๐Ÿง‘โ€๐Ÿซ Lecture 4

pruning
lecture
Pruning and Sparsity (Part II)
Author

Jung Yeon Lee

Published

February 18, 2024

์ด์ „ ํฌ์ŠคํŒ…์—์„œ Pruning์— ๋Œ€ํ•ด์„œ ๋ฐฐ์› ์—ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” Pruning์— ๋Œ€ํ•œ ๋‚จ์€ ์ด์•ผ๊ธฐ์ธ Pruning Ratio๋ฅผ ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•, Fine-tuning ๊ณผ์ •์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ณ , ๋งˆ์ง€๋ง‰์œผ๋กœ Sparsity๋ฅผ ์œ„ํ•œ System Support์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ณ ์ž ํ•œ๋‹ค.

1. Pruning Ratio

Pruning์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์–ด๋Š ์ •๋„ Pruning์„ ํ•ด์•ผ ํ• ์ง€ ์–ด๋–ป๊ฒŒ ์ •ํ•ด์•ผ ํ• ๊นŒ?

์ฆ‰, ๋‹ค์‹œ ๋งํ•ด์„œ ๋ช‡ % ์ •๋„ ๊ทธ๋ฆฌ๊ณ  ์–ด๋–ป๊ฒŒ Pruning์„ ํ•ด์•ผ ์ข‹์„๊นŒ?

Pruning ๋ฐฉ์‹ ๋น„๊ต

์šฐ์„  Channel ๋ณ„ Pruning์„ ํ•  ๋•Œ, Channel ๊ตฌ๋ถ„ ์—†์ด ๋™์ผํ•œ Pruning ๋น„์œจ(Uniform)์„ ์ ์šฉํ•˜๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š๋‹ค. ์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„์—์„œ ์ง€ํ–ฅํ•ด์•ผ ํ•˜๋Š” ๋ฐฉํ–ฅ์€ Latency๋Š” ์ ๊ฒŒ, Accuracy๋Š” ๋†’๊ฒŒ์ด๋ฏ€๋กœ ์™ผ์ชฝ ์ƒ๋‹จ์˜ ์˜์—ญ์ด ๋˜๋„๋ก Pruning์„ ์ง„ํ–‰ํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ๊ฒฐ๋ก ์€ Channel ๋ณ„ ๊ตฌ๋ถ„์„ ํ•ด์„œ ์–ด๋–ค Channel์€ Pruning ๋น„์œจ์„ ๋†’๊ฒŒ, ์–ด๋–ค Channel์€ Pruning ๋น„์œจ์„ ๋‚ฎ๊ฒŒ ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ด์•ผ๊ธฐ๊ฐ€ ๋œ๋‹ค.

1.1 Sensitiviy Analysis

Channel ๋ณ„ ๊ตฌ๋ถ„์„ ํ•ด์„œ Pruning์„ ํ•œ๋‹ค๋Š” ๊ธฐ๋ณธ ์•„์ด๋””์–ด๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • Accuracy์— ์˜ํ–ฅ์„ ๋งŽ์ด ์ฃผ๋Š” Layer๋Š” Pruning์„ ์ ๊ฒŒ ํ•ด์•ผ ํ•œ๋‹ค.
  • Accuracy์— ์˜ํ–ฅ์„ ์ ๊ฒŒ ์ฃผ๋Š” Layer๋Š” Pruning์„ ๋งŽ์ด ํ•ด์•ผ ํ•œ๋‹ค.

Accuracy๋ฅผ ๋˜๋„๋ก์ด๋ฉด ์›๋ž˜์˜ ๋ชจ๋ธ๋ณด๋‹ค ๋œ ๋–จ์–ด์ง€๊ฒŒ ๋งŒ๋“ค๋ฉด์„œ Pruning์„ ํ•ด์„œ ๋ชจ๋ธ์„ ๊ฐ€๋ณ๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋‹น์—ฐํ•œ ์•„์ด๋””์–ด์ผ ๊ฒƒ์ด๋‹ค. Accuracy์— ์˜ํ–ฅ์„ ๋งŽ์ด ์ค€๋‹ค๋Š” ๋ง์€ Sensitiveํ•œ Layer์ด๋‹ค๋ผ๋Š” ํ‘œํ˜„์œผ๋กœ ๋‹ค๋ฅด๊ฒŒ ๋งํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ Layer์˜ Senstivity๋ฅผ ์ธก์ •ํ•ด์„œ Sensitiveํ•œ Layer๋Š” Pruning Ratio๋ฅผ ๋‚ฎ๊ฒŒ ์„ค๊ณ„ํ•˜๋ฉด ๋œ๋‹ค.

Layer์˜ Sensitivity๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด Sensitivity Analysis๋ฅผ ์ง„ํ–‰ํ•ด๋ณด์ž. ๋‹น์—ฐํžˆ ํŠน์ • Layer์˜ Pruning Ratio๊ฐ€ ๋†’์„ ์ˆ˜๋ก weight๊ฐ€ ๋งŽ์ด ๊ฐ€์ง€์น˜๊ธฐ ๋œ ๊ฒƒ์ด๋ฏ€๋กœ Accuracy๋Š” ๋–จ์–ด์ง€๊ฒŒ ๋œ๋‹ค.

L0 Pruning Rate ๊ทธ๋ž˜ํ”„

Pruning Ratio์— ์˜ํ•ด Pruned ๋˜๋Š” weight๋Š” ์ด์ „ ๊ฐ•์˜์—์„œ ๋ฐฐ์šด โ€œImportance(weight์˜ ์ ˆ๋Œ“๊ฐ’ ํฌ๊ธฐ)โ€์— ๋”ฐ๋ผ ์„ ํƒ๋œ๋‹ค.

์œ„์˜ ๊ทธ๋ฆผ์—์„œ ์ฒ˜๋Ÿผ Layer 0(L0)๋งŒ์„ ๊ฐ€์ง€๊ณ  Pruning Ratio๋ฅผ ๋†’์—ฌ๊ฐ€๋ฉด์„œ ๊ด€์ฐฐํ•ด๋ณด๋ฉด, ์•ฝ 70% ์ดํ›„๋ถ€ํ„ฐ๋Š” Accuracy๊ฐ€ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ๋–จ์–ด์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. L0์—์„œ Ratio๋ฅผ ๋†’์—ฌ๊ฐ€๋ฉฐ Accuracy์˜ ๋ณ€ํ™”๋ฅผ ๊ด€์ฐฐํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ๋‹ค๋ฅธ Layer๋“ค๋„ ๊ด€์ฐฐํ•ด๋ณด์ž.

Layer๋ณ„ Sensitivity ๋น„๊ต

L1์€ ๋‹ค๋ฅธ Layer๋“ค์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ Pruning Ratio๋ฅผ ๋†’์—ฌ๊ฐ€๋„ Accuracy์˜ ๋–จ์–ด์ง€๋Š” ์ •๋„๊ฐ€ ์•ฝํ•œ ๋ฐ˜๋ฉด, L0๋Š” ๋‹ค๋ฅธ Layer๋“ค์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ Pruning Ratio๋ฅผ ๋†’์—ฌ๊ฐ€๋ฉด Accuracy์˜ ๋–จ์–ด์ง€๋Š” ์ •๋„๊ฐ€ ์‹ฌํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ L1์€ Sensitivity๊ฐ€ ๋†’๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ Pruning์„ ์ ๊ฒŒํ•ด์•ผ ํ•˜๊ณ , L0์€ Sensitivity๊ฐ€ ๋‚ฎ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ Pruning์„ ๋งŽ๊ฒŒํ•ด์•ผ ํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ Sensitivity Analysis์—์„œ ๊ณ ๋ คํ•ด์•ผํ•  ๋ช‡๊ฐ€์ง€ ์‚ฌํ•ญ๋“ค์— ๋Œ€ํ•ด์„œ ์งš๊ณ  ๋„˜์–ด๊ฐ€์ž.

  1. Sensitivity Analysis์—์„œ ๋ชจ๋“  Layer๋“ค์ด ๋…๋ฆฝ์ ์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ „์ œ๋กœ ํ•œ๋‹ค. ์ฆ‰, L0์˜ Pruning์ด L1์˜ ํšจ๊ณผ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š” ๋…๋ฆฝ์„ฑ์„ ๊ฐ€์ง„๋‹ค๋Š” ๊ฒƒ์„ ์ „์ œ๋กœ ํ•œ๋‹ค.
  2. Layer์˜ Pruning Ratio๊ฐ€ ๋™์ผํ•˜๋‹ค๊ณ  ํ•ด์„œ Pruned Weight์ˆ˜๊ฐ€ ๊ฐ™์Œ์„ ์˜๋ฏธํ•˜์ง€ ์•Š๋Š”๋‹ค.
    • 100๊ฐœ์˜ weight๊ฐ€ ์žˆ๋Š” layer์˜ 10% Pruning Ratio ์ ์šฉ์€ 10๊ฐœ์˜ weight๊ฐ€ pruned ๋˜์—ˆ์Œ์„ ์˜๋ฏธํ•˜๊ณ , 500๊ฐœ์˜ weight๊ฐ€ ์žˆ๋Š” layer์˜ 10% Pruning Ratio ์ ์šฉ์€ 50๊ฐœ์˜ weight๊ฐ€ pruned ๋˜์—ˆ์Œ์„ ์˜๋ฏธํ•œ๋‹ค.
    • Layer์˜ ์ „์ฒด ํฌ๊ธฐ์— ๋”ฐ๋ผ Pruning Ratio์˜ ์ ์šฉ ํšจ๊ณผ๋Š” ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค.

Sensitivity Analysis๊นŒ์ง€ ์ง„ํ–‰ํ•œ ํ›„์—๋Š” ๋ณดํ†ต ์‚ฌ๋žŒ์ด Accuracy๊ฐ€ ๋–จ์–ด์ง€๋Š” ์ •๋„, threshold๋ฅผ ์ •ํ•ด์„œ Pruning Ratio๋ฅผ ์ •ํ•œ๋‹ค.

Threshold ์ •ํ•˜๊ธฐ

์œ„ ๊ทธ๋ž˜ํ”„์—์„œ๋Š” Accuracy๊ฐ€ ์•ฝ 75%์ˆ˜์ค€์œผ๋กœ ์œ ์ง€๋˜๋Š” threhsold \(T\) ์ˆ˜ํ‰์„ ์„ ๊ธฐ์ค€์œผ๋กœ L0๋Š” ์•ฝ 74%, L4๋Š” ์•ฝ 80%, L3๋Š” ์•ฝ 82%, L2๋Š” 90%๊นŒ์ง€ Pruning์„ ์ง„ํ–‰ํ•ด์•ผ ๊ฒ ๋‹ค๊ณ  ์ •ํ•œ ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋ฏผ๊ฐํ•œ layer์ธ L0๋Š” ์ƒ๋Œ€์ ์œผ๋กœ Pruning์„ ์ ๊ฒŒ, ๋œ ๋ฏผ๊ฐํ•œ layer์ธ L2๋Š” Pruning์„ ๋งŽ๊ฒŒ ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ฌผ๋ก  ์‚ฌ๋žŒ์ด ์ •ํ•˜๋Š” threshold๋Š” ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ๋ฌผ๋ก  ์žˆ๋‹ค. Pruning Ratio๋ฅผ ์ข€ ๋” Automaticํ•˜๊ฒŒ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

1.2 AMC

AMC๋Š” AutoML for Model Compression์˜ ์•ฝ์ž๋กœ, ๊ฐ•ํ™”ํ•™์Šต(Reinforcement Learning) ๋ฐฉ๋ฒ•์œผ๋กœ ์ตœ์ ์˜ Pruning Ratio๋ฅผ ์ฐพ๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

AMC ์ „์ฒด ๊ตฌ์กฐ

AMC์˜ ๊ตฌ์กฐ๋Š” ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค. ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ณ„์—ด ์ค‘, Actor-Critic ๊ณ„์—ด์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ Deep Deterministic Policy Gradient(DDPG)์„ ํ™œ์šฉํ•˜์—ฌ Pruning Ratio๋ฅผ ์ •ํ•˜๋Š” Action์„ ์„ ํƒํ•˜๋„๋ก ํ•™์Šตํ•œ๋‹ค. ์ž์„ธํ•œ MDP(Markov Decision Process) ์„ค๊ณ„๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

AMC์˜ MDP

๊ฐ•ํ™”ํ•™์Šต Agent์˜ ํ•™์Šต ๋ฐฉํ–ฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ์ค‘์š”ํ•œ Reward Function์€ ๋ชจ๋ธ์˜ Accuracy๋ฅผ ๊ณ ๋ คํ•ด์„œ Error๋ฅผ ์ค„์ด๋„๋ก ์œ ๋„ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ Latency๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ชจ๋ธ์˜ ์—ฐ์‚ฐ๋Ÿ‰์„ ๋‚˜ํƒ€๋‚ด๋Š” FLOP๋ฅผ ์ ๊ฒŒ ํ•˜๋„๋ก ์œ ๋„ํ•˜๋„๋ก ์„ค๊ณ„ํ•œ๋‹ค. ์˜ค๋ฅธ์ชฝ์— ๋ชจ๋ธ๋“ค์˜ ์—ฐ์‚ฐ๋Ÿ‰ ๋ณ„(Operations) Top-1 Accuracy ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์„์ˆ˜๋ก ๋กœ๊ทธํ•จ์ˆ˜์ฒ˜๋Ÿผ Accuracy๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ณด๊ณ  ์ด๋ฅผ ๋ณด๊ณ  ๋ฐ˜์˜ํ•œ ๋ถ€๋ถ„์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

AMC์˜ Reward Function

์ด๋ ‡๊ฒŒ AMC๋กœ Pruning์„ ์ง„ํ–‰ํ–ˆ์„ ๋•Œ, Human Expert๊ฐ€ Pruning ํ•œ ๊ฒƒ๊ณผ ๋น„๊ตํ•ด๋ณด์ž. ์•„๋ž˜ ๋ชจ๋ธ ์„น์…˜๋ณ„ Density ํžˆ์Šคํ† ๊ทธ๋žจ ๊ทธ๋ž˜ํ”„์—์„œ Total์„ ๋ณด๋ฉด, ๋™์ผ Accuracy๊ฐ€ ๋‚˜์˜ค๋„๋ก Pruning์„ ์ง„ํ–‰ํ–ˆ์„ ๋•Œ AMC๋กœ Pruning์„ ์ง„ํ–‰ํ•œ ๊ฒƒ(์ฃผํ™ฉ์ƒ‰)์ด Human Expert Pruning ๋ชจ๋ธ(ํŒŒ๋ž€์ƒ‰)๋ณด๋‹ค Density๊ฐ€ ๋‚ฎ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, AMC๋กœ Pruning ์ง„ํ–‰ํ–ˆ์„ ๋•Œ ๋” ๋งŽ์€ weight๋ฅผ Pruning ๋” ๊ฐ€๋ฒผ์šด ๋ชจ๋ธ์„ ๊ฐ€์ง€๊ณ ๋„ Accuracy๋ฅผ ์œ ์ง€ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

AMC์˜ Density Graph

๋‘๋ฒˆ์งธ ๊บพ์€ ์„  ๊ทธ๋ž˜ํ”„์—์„œ AMC๋ฅผ ๊ฐ€์ง€๊ณ  Pruning๊ณผ Fine-tuning์„ ๋ฒˆ๊ฐˆ์•„ ๊ฐ€๋ฉฐ ์—ฌ๋Ÿฌ ์Šคํ…์œผ๋กœ ์ง„ํ–‰ํ•ด๊ฐ€๋ฉด์„œ ๊ด€์ฐฐํ•œ ๊ฒƒ์„ ์กฐ๊ธˆ ๋” ์ž์„ธํžˆ ์‚ดํŽด๋ณด์ž. ๊ฐ Iteration(Pruning+Fine-tuning)์„ stage1, 2, 3, 4๋กœ ๋‚˜ํƒ€๋‚ด์–ด plotํ•œ ๊ฒƒ์„ ๋ณด๋ฉด, 1x1 conv๋ณด๋‹ค 3x3 conv์—์„œ Density๊ฐ€ ๋” ๋‚ฎ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, 3x3 conv์—์„œ 1x1 conv๋ณด๋‹ค Pruning์„ ๋งŽ์ด ํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ•ด์„ํ•ด๋ณด์ž๋ฉด, AMC๊ฐ€ 3x3 conv์„ Pruningํ•˜๋ฉด 9๊ฐœ์˜ weight๋ฅผ pruningํ•˜๊ณ  ์ด๋Š” 1x1 conv pruningํ•ด์„œ 1๊ฐœ์˜ weight๋ฅผ ์—†์• ๋Š” ๊ฒƒ๋ณด๋‹ค ํ•œ๋ฒˆ์— ๋” ๋งŽ์€ weight ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— 3x3 conv pruning์„ ์ ๊ทน ํ™œ์šฉํ–ˆ์„ ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

AMC Result

์ด AMC ์‹คํ—˜ ๊ฒฐ๊ณผํ‘œ์—์„œ ๋ณด๋ฉด, FLOP์™€ Time ๊ฐ๊ฐ 50%๋กœ ์ค„์ธ AMC ๋ชจ๋ธ ๋‘˜๋‹ค Top-1 Accuracy๊ฐ€ ๊ธฐ์กด์˜ 1.0 MobileNet์˜ Accuracy๋ณด๋‹ค ์•ฝ 0.1~0.4% ์ •๋„๋งŒ ์ค„๊ณ  Latency๋‚˜ SpeedUp์ด ํšจ์œจ์ ์œผ๋กœ ์กฐ์ •๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ฒฐ๊ณผํ‘œ์—์„œ 0.75 MobileNet์€ 25%์˜ weight๋ฅผ ๊ฐ์†Œ์‹œํ‚จ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— SpeedUp์ด \(\frac{4}{3} \simeq 1.3\)x์ด์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์—ฐ์‚ฐ๋Ÿ‰์€ quadraticํ•˜๊ฒŒ ๊ฐ์†Œํ•˜๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์— \(\frac{4}{3} \cdot \frac{4}{3} \simeq 1.7\)x๋กœ SpeedUp์ด ๋œ๋‹ค.

1.3 NetAdapt

๋˜ ๋‹ค๋ฅธ Pruning Ratio๋ฅผ ์ •ํ•˜๋Š” ๊ธฐ๋ฒ•์œผ๋กœ NetAdapt์ด ์žˆ๋‹ค. Latency Constraint๋ฅผ ๊ฐ€์ง€๊ณ  layer๋งˆ๋‹ค pruning์„ ์ ์šฉํ•ด๋ณธ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ค„์ผ ๋ชฉํ‘œ latency ๋Ÿ‰์„ lms๋กœ ์ •ํ•˜๋ฉด, 10ms โ†’ 9ms๋กœ ์ค„ ๋•Œ๊นŒ์ง€ layer์˜ pruning ratio๋ฅผ ๋†’์—ฌ๊ฐ€๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

NetAdapt

NetAdapt์˜ ์ „์ฒด์ ์ธ ๊ณผ์ •์€ ์•„๋ž˜์™€ ๊ฐ™์ด ์ง„ํ–‰๋œ๋‹ค. ๊ธฐ์กด ๋ชจ๋ธ์—์„œ ๊ฐ layer๋ฅผ Latency Constraint์— ๋„๋‹ฌํ•˜๋„๋ก Pruningํ•˜๋ฉด์„œ Accuracy(\(Acc_A\)๋“ฑ)์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ธก์ •ํ•œ๋‹ค.

  1. ๊ฐ layer์˜ pruning ratio๋ฅผ ์กฐ์ ˆํ•œ๋‹ค.
  2. Short-term fine tuning์„ ์ง„ํ–‰ํ•œ๋‹ค.
  3. Latency Constraint์— ๋„๋‹ฌํ–ˆ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค.
  4. Latency Constraint ๋„๋‹ฌํ•˜๋ฉด ํ•ด๋‹น layer์˜ ์ตœ์ ์˜ Pruning ratio๋กœ ํŒ๋‹จํ•œ๋‹ค.
  5. ๊ฐ layer์˜ ์ตœ์  Pruning ratio๊ฐ€ ์ •ํ•ด์กŒ๋‹ค๋ฉด ๋งˆ์ง€๋ง‰์œผ๋กœ Long-term fine tuning์„ ์ง„ํ–‰ํ•œ๋‹ค.
NetAdapt ๊ณผ์ •

์ด์™€ ๊ฐ™์ด NetAdapt์˜ ๊ณผ์ •์„ ์ง„ํ–‰ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. Uniformํ•˜๊ฒŒ Pruning์„ ์ง„ํ–‰ํ•œ Multipilers๋ณด๋‹ค NetAdapt๊ฐ€ 1.7x ๋” ๋น ๋ฅด๊ณ  ์˜คํžˆ๋ ค Accuracy๋Š” ์•ฝ 0.3% ์ •๋„ ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

NetAdapt์˜ Latency / Top-1 Accuracy ๊ทธ๋ž˜ํ”„

2. Fine-tuning/Train

Prunned ๋ชจ๋ธ์˜ ํผํฌ๋จผ์Šค๋ฅผ ํ–ฅ์ƒํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Pruning๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ๋‚˜์„œ Fine-tuning ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.

2.1 Iterative Pruning

๋ณดํ†ต Pruned ๋ชจ๋ธ์˜ Fine-tuning ๊ณผ์ •์—์„œ๋Š” ๊ธฐ์กด์— ํ•™์Šตํ–ˆ๋˜ learning rate๋ณด๋‹ค ์ž‘์€ rate๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด ๊ธฐ์กด์˜ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๋•Œ ์‚ฌ์šฉํ•œ learning rate์˜ \(1/100\) ๋˜๋Š” \(1/10\)์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋˜ํ•œ Pruning ๊ณผ์ •๊ณผ Fine-tuning ๊ณผ์ •์€ 1๋ฒˆ๋งŒ ์ง„ํ–‰ํ•˜๊ธฐ๋ณด๋‹ค ์ ์ฐจ์ ์œผ๋กœ pruning ratio๋ฅผ ๋Š˜๋ ค๊ฐ€๋ฉฐ Pruning, Fine-tuning์„ ๋ฒˆ๊ฐˆ์•„๊ฐ€๋ฉฐ ์—ฌ๋Ÿฌ๋ฒˆ ์ง„ํ–‰ํ•˜๋Š”๊ฒŒ ๋” ์ข‹๋‹ค.

Iterative Pruning + Fine-tuning ๋น„๊ต ๊ทธ๋ž˜ํ”„

2.2 Regularization

TinyML์˜ ๋ชฉํ‘œ๋Š” ๊ฐ€๋Šฅํ•œ ๋งŽ์€ weight๋“ค์„ 0์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ž˜์•ผ ๋ชจ๋ธ์„ ๊ฐ€๋ณ๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋ž˜์„œ Regularization ๊ธฐ๋ฒ•์„ ์ด์šฉํ•ด์„œ ๋ชจ๋ธ์˜ weight๋“ค์„ 0์œผ๋กœ, ํ˜น์€ 0๊ณผ ๊ฐ€๊น๊ฒŒ ์ž‘์€ ๊ฐ’์„ ๊ฐ€์ง€๋„๋ก ๋งŒ๋“ ๋‹ค. ์ž‘์€ ๊ฐ’์˜ weight๊ฐ€ ๋˜๋„๋ก ํ•˜๋Š” ์ด์œ ๋Š” 0๊ณผ ๊ฐ€๊นŒ์šด ์ž‘์€ ๊ฐ’๋“ค์€ ๋‹ค์Œ layer๋“ค๋กœ ๋„˜์–ด๊ฐ€๋ฉด์„œ 0์ด ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค์˜ ๊ณผ์ ํ•ฉ(Overfitting)์„ ๋ง‰๊ธฐ ์œ„ํ•œ Regularization ๊ธฐ๋ฒ•๋“ค๊ณผ ๋‹ค๋ฅด์ง€ ์•Š์œผ๋‚˜ ์˜๋„์™€ ๋ชฉ์ ์€ ๋‹ค๋ฅธ ๊ฒƒ์„ ์งš์–ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Pruning์„ ์œ„ํ•œ Regularization

2.3 The Lottery Ticket Hypothesis

2019๋…„ ICLR์—์„œ ๋ฐœํ‘œ๋œ ๋…ผ๋ฌธ์—์„œ Jonathan Frankle๊ณผ Michael Carbin์ด ์†Œ๊ฐœํ•œ The Lottery Ticket Hypothesis(LTH)์€ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง(DNN) ํ›ˆ๋ จ์— ๋Œ€ํ•œ ํฅ๋ฏธ๋กœ์šด ์•„์ด๋””์–ด๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”๋œ ๋Œ€๊ทœ๋ชจ ์‹ ๊ฒฝ๋ง ๋‚ด์— ๋” ์ž‘์€ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ(Winning Ticket)๊ฐ€ ์กด์žฌํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค. ์ด ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ณ„๋„๋กœ ํ›ˆ๋ จํ•  ๋•Œ ์›๋ž˜ ๋„คํŠธ์›Œํฌ์˜ ์„ฑ๋Šฅ์— ๋„๋‹ฌํ•˜๊ฑฐ๋‚˜ ๋Šฅ๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๊ฐ€์„ค์€ ์ด๋Ÿฌํ•œ Winning Ticket์ด ํ•™์Šตํ•˜๋Š” ๋ฐ ์ ํ•ฉํ•œ ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ–๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

LTH ์„ค๋ช… ๊ทธ๋ฆผ

3. System Support for Sparsity

DNN์„ ๊ฐ€์†ํ™” ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ํฌ๊ฒŒ 3๊ฐ€์ง€, Sparse Weight, Sparse Activation, Weight Sharing์ด ์žˆ๋‹ค. Sparse Weight, Sparse Activation์€ Pruning์ด๊ณ  Weight Sharing์€ Quantization์˜ ๋ฐฉ๋ฒ•์ด๋‹ค.

DNN์„ ๊ฐ€์†ํ™” ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•
  1. Sparse Weight: Weight๋ฅผ Pruningํ•˜์—ฌ Computation์€ Pruning Ratio์— ๋Œ€์‘ํ•˜์—ฌ ๋นจ๋ผ์ง„๋‹ค. ํ•˜์ง€๋งŒ Memory๋Š” Pruning๋œ weight์˜ ์œ„์น˜๋ฅผ ๊ธฐ์–ตํ•˜๊ธฐ ์œ„ํ•œ memory ์šฉ๋Ÿ‰์ด ํ•„์š”ํ•˜๋ฏ€๋กœ Pruning Ratio์— ๋น„๋ก€ํ•˜์—ฌ ์ค„์ง„ ์•Š๋Š”๋‹ค.
  2. Sparse Activation: Weight๋ฅผ Pruningํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ค๋ฅด๊ฒŒ Activation์€ Test Input์— ๋”ฐ๋ผ dynamic ํ•˜๋ฏ€๋กœ Weight๋ฅผ Pruningํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค Computation์ด ๋œ ์ค„์–ด๋“ ๋‹ค.
  3. Weight Sharing: Quantization ๋ฐฉ๋ฒ•์œผ๋กœ 32-bit data๋ฅผ 4-bit data๋กœ ๋ณ€๊ฒฝํ•จ์œผ๋กœ์จ 8๋ฐฐ์˜ memory ์ ˆ์•ฝ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

3.1 EIE

Efficient Inference Engine์€ ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ ํ™”๋œ ์†Œํ”„ํŠธ์›จ์–ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋‚˜ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋งํ•œ๋‹ค. Processing Elements(PE)์˜ ๊ตฌ์กฐ

PE ์—ฐ์‚ฐ Logically / Physically ๋ถ„์„

์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ Input๋ณ„(\(\vec{a}\)) ์—ฐ์‚ฐ์€ ์•„๋ž˜์™€ ๊ฐ™์ด Input์ด 0์ผ ๋•Œ๋Š” skip๋˜๊ณ  0์ด ์•„๋‹ ๋•Œ๋Š” prunning ๋˜์ง€ ์•Š์€ weight์™€ ์—ฐ์‚ฐ์ด ์ง„ํ–‰๋œ๋‹ค.

Input๋ณ„ ์—ฐ์‚ฐ ๊ณผ์ •

EIE ์‹คํ—˜์€ ๊ฐ€์žฅ loss๊ฐ€ ์ ์€ data ์ž๋ฃŒํ˜•์ธ 16 bit Intํ˜•์„ ์‚ฌ์šฉํ–ˆ๋‹ค.(0.5% loss) AlexNet์ด๋‚˜ VGG์™€ ๊ฐ™์ด ReLU Activation์ด ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ๋“ค์€ ๊ฒฝ๋Ÿ‰ํ™”๊ฐ€ ๋งŽ์ด ๋œ ๋ฐ˜๋ฉด, RNN์™€ LSTM์ด ์‚ฌ์šฉ๋œ NeuralTalk ๋ชจ๋ธ๋“ค ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ReLU๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„ ๊ฒฝ๋Ÿ‰ํ™”๋  ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„์ด ์—†์–ด Activation Density๊ฐ€ 100%์ธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

EIE ์‹คํ—˜ ๊ฒฐ๊ณผ

3.2 M:N Weight Sparsity

์ด ๋ฐฉ๋ฒ•์€ Nvidia ํ•˜๋“œ์›จ์–ด์˜ ์ง€์›์ด ํ•„์š”ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ณดํ†ต 2:4 Weight Sparsity๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์™ผ์ชฝ์˜ Sparse Matrix๋ฅผ ์žฌ๋ฐฐ์น˜ํ•ด์„œ Non-zero data matrix์™€ ์ธ๋ฑ์Šค๋ฅผ ์ €์žฅํ•˜๋Š” Index matrix๋ฅผ ๋”ฐ๋กœ ๋งŒ๋“ค์–ด์„œ ์ €์žฅํ•œ๋‹ค.

2:4 Weight Sparsity

M:N Weight Sparsity ์ ์šฉํ•˜์ง€ ์•Š์€ Dense GEMM๊ณผ ์ ์šฉํ•œ Sparse GEMM์„ ๊ณ„์‚ฐํ•  ๋•Œ๋Š” ์•„๋ž˜์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ์—ฐ์‚ฐ์ด ์ง„ํ–‰๋œ๋‹ค.

Dense VS. Sparse GEMM

3.3 Sparse Convolution

Submanifold Sparse Convolutional Networks (SSCN)์€ ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์—์„œ ํšจ์œจ์ ์ธ ๊ณ„์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์‹ ๊ฒฝ๋ง ์•„ํ‚คํ…์ฒ˜์˜ ํ•œ ํ˜•ํƒœ์ด๋‹ค. ์ด ๊ธฐ์ˆ ์€ ํŠนํžˆ 3D ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ๋˜๋Š” ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€์™€ ๊ฐ™์ด ๋Œ€๊ทœ๋ชจ ๋ฐ ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ์ค‘์š”ํ•˜๋‹ค. SSCN์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋ฐ์ดํ„ฐ์˜ Sparcity์„ ํ™œ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ํฌ๊ฒŒ ์ค„์ด๋Š” ๊ฒƒ์ด๋‹ค.

์ถœ์ฒ˜: Submanifold Sparse Convolutional Networks

์ด๋Ÿฌํ•œ Sparse Convolution์€ ๊ธฐ๋ณธ Convolution๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Conventional VS. Sparse Convolution

์—ฐ์‚ฐ ๊ณผ์ •์„ ๋น„๊ตํ•ด๋ณด๊ธฐ ์œ„ํ•ด Input Point Cloud(\(P\)), Feature Map(\(W\)), Ouput Point Cloud(\(Q\))๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์žˆ๋‹ค๊ณ  ํ•˜์ž. ๊ธฐ์กด์˜ Convolution๊ณผ Sparse Convolution์„ ๋น„๊ตํ•ด๋ณด๋ฉด ์—ฐ์‚ฐ๋Ÿ‰์ด 9:2๋กœ ๋งค์šฐ ์ ์€ ์—ฐ์‚ฐ๋งŒ ํ•„์š”ํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Conventional VS. Sparse ์—ฐ์‚ฐ๋Ÿ‰ ๋น„๊ต

Feature Map(\(W\))์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐ weight ๋งˆ๋‹ค ํ•„์š”ํ•œ Input data์˜ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด \(W_{-1, 0}\)์€ \(P1\)๊ณผ ๋งŒ์˜ ์—ฐ์‚ฐ์ด ์ง„ํ–‰๋˜๋ฏ€๋กœ \(P1\)๋งŒ ์—ฐ์‚ฐ์‹œ ๋ถˆ๋Ÿฌ๋‚ด๊ฒŒ ๋œ๋‹ค.

Sparse Convolution ๊ณ„์‚ฐ ๊ณผ์ •

๋”ฐ๋ผ์„œ Feature Map์˜ \(W\)์— ๋”ฐ๋ผ ํ•„์š”ํ•œ Input data๋ฅผ ํ‘œํ˜„ํ•˜๊ณ  ๋”ฐ๋กœ computation์„ ์ง„ํ–‰ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ ๋ฅด์ง€ ๋ชปํ•œ ์—ฐ์‚ฐ๋Ÿ‰ ๋ถ„๋ฐฐ๊ฐ€ ์ง„ํ–‰๋˜๋Š”๋ฐ(์™ผ์ชฝ ๊ทธ๋ฆผ) ์ด๋Š” computation์— overhead๋Š” ์—†์ง€๋งŒ regularity๊ฐ€ ์ข‹์ง€ ์•Š๋‹ค. ๋˜๋Š” ๊ฐ€์žฅ computation์ด ๋งŽ์€ ๊ฒƒ์„ ๊ธฐ์ค€์œผ๋กœ Batch ๋‹จ์œ„๋กœ ๊ณ„์‚ฐํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด(๊ฐ€์šด๋ฐ ๊ทธ๋ฆผ) ์ ์€ computation weight์—์„œ์˜ ๋น„ํšจ์œจ์ ์ธ ๊ณ„์‚ฐ ๋Œ€๊ธฐ์‹œ๊ฐ„์ด ์ƒ๊ธฐ๋ฏ€๋กœ overhead๊ฐ€ ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ์ ์ ˆํžˆ ๋น„์Šทํ•œ ์—ฐ์‚ฐ๋Ÿ‰์„ ๊ฐ€์ง€๋Š” grouping์„ ์ง„ํ–‰ํ•œ ๋’ค batch๋กœ ๋ฌถ์œผ๋ฉด ์ ์ ˆํžˆ computation์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.(์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ)

Grouping Computation

์ด๋Ÿฐ Grouping์„ ์ ์šฉํ•œ ํ›„ Sparse Convolution์„ ์ง„ํ–‰ํ•˜๋ฉด Adaptive Grouping์ด ์ ์šฉ๋˜์–ด ์•„๋ž˜์™€ ๊ฐ™์ด ์ง„ํ–‰๋œ๋‹ค.

Sparse Convolution ์˜ˆ์‹œ

์—ฌ๊ธฐ๊นŒ์ง€๊ฐ€ 2023๋…„๋„ ๊ฐ•์˜์—์„œ ๋งˆ์ง€๋ง‰ Sparse Convolution์— ๋Œ€ํ•ด ์„ค๋ช…ํ•œ ๋ถ€๋ถ„์„ ์ •๋ฆฌํ•œ ๋ถ€๋ถ„์ด๋‹ค. ํ•˜์ง€๋งŒ ๊ฐ•์˜์—์„œ ์„ค๋ช…์ด ๋งŽ์ด ์ƒ๋žต๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ ์ข€ ๋” ์ž์„ธํ•œ ๋‚ด์šฉ์€ Youtube ๋ฐœํ‘œ ์˜์ƒ์ด๋‚˜ 2022๋…„๋„ ๊ฐ•์˜๋ฅผ ์ฐธ๊ณ ํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•œ๋‹ค.

4. Reference