๐Ÿง‘โ€๐Ÿซ Lecture 9

lecture
knowledge distillation
Knowledge Distillation(KD)
Author

Seunghyun Oh

Published

March 19, 2024

์ด๋ฒˆ ์‹œ๊ฐ„์€ Knowledge Distillation ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด์„œ ์ด์•ผ๊ธฐ ํ•ด๋ณผ๊นŒ ํ•ด์š”. ์ง€๊ธˆ๊นŒ์ง€ ์ž‘์€ ํฌ๊ธฐ์˜ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ดค์ง€๋งŒ, ์—ฌ์ „ํžˆ ์ž‘์€ ๋ชจ๋ธ์€ ์„ฑ๋Šฅ์ ์œผ๋กœ ๋ถ€์กฑํ•œ ์ ์ด ๋งŽ์ฃ . ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ค๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ๊ณ ๋ฏผํ•˜๋‹ค๊ฐ€ โ€œํฌ๊ธฐ๊ฐ€ ํฐ ๋ชจ๋ธ์„ ์ด์šฉํ•ด๋ณด์ž.โ€ ์—์„œ ๋‚˜์˜จ ์•„์ด๋””์–ด๊ฐ€ ๋ฐ”๋กœ Knowledge Distillation ์ž…๋‹ˆ๋‹ค.

1. What is Knowledge Distillation?

Knowledge Distillation์€ ๊ฐ„๋‹จํ•˜๊ฒŒ Teach Network๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ํฌ๊ธฐ๊ฐ€ ํฐ ๋ชจ๋ธ์ด ์žˆ์–ด์š”. ์ด Teacher Network๊ฐ€ ๋จผ์ € Training์„ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์˜ค๋Š˜์˜ ์ฃผ์ธ๊ณต Student Network๋กœ ๋ถˆ๋ฆฌ๋Š” ํฌ๊ธฐ๊ฐ€ ์ž‘์€ ๋ชจ๋ธ์ด ์žˆ์ฃ . ์ด ๋ชจ๋ธ์€ ๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์„ ํ•˜๋Š”๋ฐ, ์ฒซ ๋ฒˆ์งธ๋Š” ๊ธฐ์กด์— ํ•™์Šตํ•˜๋˜ ๋Œ€๋กœ Target ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šต์ด ์žˆ๊ตฌ์š”. ๋‹ค๋ฅธ ํ•œ ๊ฐ€์ง€๋Š” Teacher Network๋ฅผ ๋”ฐ๋ผ๊ฐ€๋Š” ํ•™์Šต์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Reference. Knowledge Distillation: A Survey [Gou et al., IJCV 2020]
  • The goal of knowledge distillation is to align the class probability distributions from teacher and student networks.

๊ทธ๋Ÿผ ๊ถ๊ธˆํ•œ ์ ์ด Teacher Network์— ์–ด๋–ค ์ ์„ ๋ฐฐ์›Œ์•ผํ• ๊นŒ์š”? ๊ฐ•์˜์—์„œ๋Š” ์ด 6๊ฐœ๋กœ Output logit, Intermediate weight, Intermediate feature, Gradient, Sparsity pattern, Relational information์œผ๋กœ ๋‚˜๋ˆ ์„œ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์„ค๋ช…ํ•˜๊ธฐ์— ์•ž์„œ์„œ ๊ฐœ๋…ํ•˜๋‚˜ ์†Œ๊ฐœํ•˜๊ณ  ๋„˜์–ด๊ฐˆ๊ป˜์š”.

Reference. MIT-TinyML-lecture10-Knowledge-Distillation in https://efficientml.ai

๋งŒ์•ฝ ์œ„ Teacher Network๊ฐ€ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ(T=1)๋กœ ๊ณ ์–‘์ด ์‚ฌ์ง„์ผ ํ™•๋ฅ  0.982, ๊ฐ•์•„์ง€ ์‚ฌ์ง„์ผ ํ™•๋ฅ ์ด 0.017์ด๋ผ๊ณ  ํ•ฉ์‹œ๋‹ค. ๊ทธ๋Ÿผ Student Network๋Š” Output logit์„ ํ•™์Šตํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด, ์ด ๋‘ ํ™•๋ฅ ์„ ๋”ฐ๋ผ ๊ฐˆ๊ฒ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ Student Network์— ๋”ฐ๋ผ ์ด ์ˆ˜์น˜๊นŒ์ง€ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋•Œ โ€œTemperature(T)โ€์ด๋ผ๋Š” ๊ฐœ๋…์„ ๋“ค์—ฌ์™€ Teacher Network์˜ Cat๊ณผ Dog์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ์ข€ ๋” Smoothํ•˜๊ฒŒ ๋งŒ๋“ค์ฃ .

\[ p(z_i,T) = \dfrac{exp(z_j/T)}{\sum_j exp(z_j/T)} \]

์‹์œผ๋กœ ์“ฐ๋ฉด ์œ„์™€ ๊ฐ™์ด ๋ ํ…๋ฐ, ๋ณดํ†ต์€ 1๋กœ ๋‘๊ณ  ํ•œ๋‹ค๊ณ  ๊ฐ•์˜์—์„œ ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. ์™œ ์„ค๋ช…ํ–ˆ๋ƒ๊ตฌ์š”? ํ˜น์‹œ๋‚˜ ๊ฐœ๋…์ด ๋‚˜์˜ค๋ฉด ์ดํ•ดํ•˜์‹œ๊ธฐ ํŽธํ•˜์‹œ๋ผ๊ตฌ์š”๐Ÿ™‚ ๊ทธ๋Ÿผ, Teacher Network์—์„œ ์–ด๋–ค ๋ถ€๋ถ„์„ Student Network์— ํ•™์Šต์‹œํ‚ฌ์ง€ ์•Œ์•„๋ณด์‹œ์ฃ .

2. What to match between Teacher and Student Network?

2.1 Output logits

์ฒซ ๋ฒˆ์งธ๋Š” Output logit์ž…๋‹ˆ๋‹ค. loss๋กœ๋Š” ๋Œ€ํ‘œ์ ์œผ๋กœ Cross entropy loss์™€ L2 loss๊ฐ€ ์žˆ๊ฒ ์ฃ .

Reference. MIT-TinyML-lecture10-Knowledge-Distillation in https://efficientml.ai

2.2 Intermediate weights

๋‘ ๋ฒˆ์งธ๋Š” Layer๋งˆ๋‹ค Weight์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ Student Model์€ Weight dimesion์ด ๋‹ค๋ฅผ ์ˆ˜ ๋ฐ–์— ์—†๋Š”๋ฐ, ๊ทธ๋Ÿผ Linear Transformation์„ ์ด์šฉํ•ด์„œ Dimension์„ ๋งž์ถฐ ํ•™์Šตํ•˜๋ฉด ๋˜๊ฒ ๋„ค์š”.

Reference. MIT-TinyML-lecture10-Knowledge-Distillation in https://efficientml.ai

Reference. MIT-TinyML-lecture10-Knowledge-Distillation in https://efficientml.ai

์Šคํ„ฐ๋”” ์ค‘์— ๋‚˜์˜จ ์งˆ๋ฌธ์ด โ€œ๊ทธ๋Ÿผ Student Network์—์„œ ์ถ”๊ฐ€์ ์ธ ๋ ˆ์ด์–ด๊ฐ€ ์ƒ๊ธฐ๋Š”๋ฐ, ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š” ์˜๋ฏธ๊ฐ€ ์—†์ง€ ์•Š๋Š๋ƒ?โ€ ์˜€์Šต๋‹ˆ๋‹ค. ์ œ ์ƒ๊ฐ์€ Weight Dimension์„ ๋งž์ถ”๊ธฐ ์œ„ํ•œ Linear Transformation์„ ์œ„ํ•œ ๋ ˆ์ด์–ด๋Š” ์ถ”๋ก ์‹œ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์œผ๋‹ˆ, Student Network์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ๋” ์ œ๊ฒฉ์•„๋‹๊นŒ์š”? ๋งˆ์น˜ ์ถ”๋ก  ๋•Œ ํ•„์š”ํ•œ ๋ถ€ํ’ˆ๋งŒ ์กฐ๋ฆฝํ•˜๋“ฏ ๋ง์ด์ฃ .

2.3 Intermediate features

์„ธ ๋ฒˆ์งธ๋Š” Feature ์ž…๋‹ˆ๋‹ค. ์ด์ „ ๊ฒฝ์šฐ๊ฐ€ Weight๋ผ๊ณ  ํ•˜๋ฉด, ์ด๋ฒˆ์€ Layer์˜ Output์ž…๋‹ˆ๋‹ค. Teach Network๊ณผ Student Network์˜ Feature๋ฅผ ๊ฐ™๊ฒŒ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์žˆ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” Cosine of angle๋กœ ํ•™์Šต์„ ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•(Like What You Like: Knowledge Distill via Neuron Selectivity Transfer [Huang and Wang, arXiv 2017]) ๊ณผ Dimension์„ ์ค„์—ฌ์„œ ํ•™์Šต์„ ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•(Paraphrasing Complex Network: Network Compression via Factor Transfer [Kim et al., NeurIPS 2018])์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

Reference. Like What You Like: Knowledge Distill via Neuron Selectivity Transfer [Huang and Wang, arXiv 2017]

Reference. Paraphrasing Complex Network: Network Compression via Factor Transfer [Kim et al., NeurIPS 2018]
  • The paraphraser shrinks the output teacher feature map from m dimensions to m x k dimensions (called factor typically k=0.5) and then expands the dimensionality back to m.
  • The output of paraphraser is supervised with a reconstruction loss against the original m-dimensional output.
  • Student uses one layer of MLP to obtain a factor with the same dimensionality of m x k.
  • FT minimizes the distance between teacher and student factors.

2.4 Gradients

๋„ค ๋ฒˆ์งธ๋Š” Gradient ์ž…๋‹ˆ๋‹ค. Gradient๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์—๋Š” Attention Map์ด ์žˆ๋Š”๋ฐ์š”, ์ด Attention Map์€ ์ด๋ฏธ์ง€์—์„œ ํŠน์ง•์ ์ธ ๋ถ€๋ถ„์„ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ์ฃ .

  • Reference: Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer [Zagoruyko and Komodakis, ICLR 2017]
  • Gradients of feature maps are used to characterize attention of DNNs
  • The attention of a CNN feature map \(x\) is defined as \(\dfrac{\partial L}{\partial x}\), where \(L\) is the learning objective.
  • Intuition: If \(\dfrac{\partial L}{\partial x_{i,j}}\) is large, a small perturbation at \(i,j\) will significantly impact the final output. As a result, the network is putting more attention on position \(i, j\)

Reference. MIT-TinyML-lecture10-Knowledge-Distillation in https://efficientml.ai

์•„๋ž˜ ๊ทธ๋ฆผ์€ โ€œAttention Map์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋†’๋‹ค๋ฉด ๋น„์Šทํ•œ ํŒจํ„ด์œผ๋กœ ๋‚˜์˜จ๋‹ค.โ€ ๋Š” ์˜ˆ์‹œ๋กœ ๋‚˜์˜ต๋‹ˆ๋‹ค. Resnet34์™€ ResNet101์˜ Attention Map์€ ์œ ์‚ฌํ•˜๊ฒŒ ๋ณด์ด๋Š” ๋ฐ˜๋ฉด NIN์ธ ๊ฒฝ์šฐ๋Š” ๋งŽ์ด ๋‹ค๋ฅธ ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Performant models have similar attention maps

    Attention makes of performant ImageNet models (ResNets) are indeed similar to each other, but the less performant model(NIN) has quite different attention maps

    Reference. MIT-TinyML-lecture10-Knowledge-Distillation in https://efficientml.ai

2.5 Sparsity patterns

๋‹ค์„ฏ ๋ฒˆ์งธ๋Š” Sparsity Pattern ์ž…๋‹ˆ๋‹ค. Layer๋งˆ๋‹ค Output Acitivation์„ ๊ฐ™๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ธ๋ฐ, Intermediate Feature์™€ ์œ ์‚ฌํ•˜๊ฒŒ ๋ณด์ด๋„ค์š”.

Reference. Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons [Heo et al., AAAI 2019]
  • Intuition: the teacher and student networks should have similar sparsity patterns after the ReLU activation. A neuron is activated after ReLU if its value is larger than 0, denoted by the indicator function \(\rho(x) = 1 [x>0]\).

  • We want to minimize \(\mathscr{L}(I) = \lvert\lvert \rho(T(I))-\rho(S(I)) \lvert\lvert_1\), where \(S\) and \(T\) corresponds to student and teacher networks, respectively

2.6.1 Relational information: Different Layers

๋งˆ์ง€๋ง‰์œผ๋กœ ๋ชจ๋ธ๋‚ด์—์„œ ๋‚˜์˜ค๋Š” ํ…์„œ์˜ ์ƒํ˜ธ ์—ฐ๊ด€์„ฑ์— ๋Œ€ํ•ด์„œ๋„ ๊ฐ™๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฐฉ๋ฒ• ๋‘ ๊ฐ€์ง€๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š” ๊ฐ ๋ ˆ์ด์–ด์˜ ์ž…๋ ฅ, ์ถœ๋ ฅ ํ…์„œ๋ฅผ Inner product ํ•˜๊ฒŒ ๋˜๋ฉด ํ•˜๋‚˜์˜ Matrix๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด Matrix๋ฅผ ๊ฐ™๊ฒŒ ํ•™์Šต์‹œํ‚จ๋‹ค๋Š” ์•„์ด๋””์–ด์ฃ .

Reference: A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning [Yim et al., CVPR 2017]
  1. Use inner product to extract relational information (a matrix of shape \(C_{in} \times C_{out}\), reduction on the spatial dimensions) for both student and teacher networks. *Note: the student and teacher networks only differ in number of layers, not number of channels

  2. Then match the resulting dot products between teacher and student networks \((G_1^T, G_1^S)\)

2.6.1 Relational information: Different Samples

๋‘ ๋ฒˆ์งธ๋Š” ์ด์ „๊นŒ์ง€ ์ €ํฌ๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ ํ•˜๋‚˜ํ•˜๋‚˜๋งˆ๋‹ค ๋‚˜์˜ค๋Š” ๊ฒฐ๊ณผ๋ฅผ Teach์™€ Student๋ฅผ ๊ฐ™๊ฒŒ๋” ํ•™์Šต์‹œ์ผฐ๋Š”๋ฐ, ์ด๋ฒˆ์—” ์—ฌ๋Ÿฌ ํ•™์Šต๋ฐ์ดํ„ฐ์—์„œ ๋‚˜์˜จ ์—ฌ๋Ÿฌ Output์„ ํ•˜๋‚˜์˜ Matrix ํ˜•ํƒœ๋กœ ๋‹ฎ๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • Conventional KD focuses on matching features / logins for one input. Relation KD looks at the relations between intermediate features from multiple inputs.

Reference. Relational Knowledge Distillation [Park et al., CVPR 2019]
  • Relation between different samples

    Reference. Relational Knowledge Distillation [Park et al., CVPR 2019]

์ง€๊ธˆ๊นŒ์ง€ Student Network๊ฐ€ Teacher Network์˜ ์–ด๋–ค Output์„ ๊ฐ€์ง€๊ณ  ํ•™์Šต์‹œํ‚ฌ์ง€์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ดค์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ €ํฌ๊ฐ€ TinyML์„ ํ•˜๋Š” ๋ชฉ์ ์€ ์‚ฌ์‹ค โ€œ๋” ์ž‘์€ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•ด๋ณด์ž.โ€ ์ด์ง€ ์•Š์•˜๋‚˜์š”? ์ฆ‰, Teacher Network์—†์ด Student Network๋งŒ์œผ๋กœ๋Š” ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์—†์„๊นŒ์š”? ์ด๋Ÿฐ ์ƒ๊ฐ์—์„œ ๋‚˜์˜จ ์•„์ด๋””์–ด๊ฐ€ Self and Online Distillation ์ž…๋‹ˆ๋‹ค.

3. Self and Online Distillation

  • What is the disadvantage of fixed large teachers? Does it have to be the case that we need a fixed large teacher in KD?

3.1 Self Distillation

์ฒซ ๋ฒˆ์งธ Self Distillation์€ ๊ตฌ์กฐ๊ฐ€ ๊ฐ™์€ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ณ„์†ํ•ด์„œ ๋ณต์‚ฌํ•ด ๋‚˜๊ฐ‘๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ ์ด์ „์— ํ•™์Šตํ•œ ๋„คํŠธ์›Œํฌ๋กœ ๋ถ€ํ„ฐ๋„ ๋ณต์‚ฌ๋œ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋ฉฐ, k๊ฐœ ๋งŒํผ ๋ณต์‚ฌํ•˜๋ฉฐ ํ•™์Šตํ•œ ํ›„์— ์ตœ์ข… Output์œผ๋กœ๋Š” ๋ณต์‚ฌํ•œ ๋„คํŠธ์›Œํฌ๋“ค์˜ Output์„ Ensembleํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ด์šฉํ•˜์ฃ . ์—ฌ๊ธฐ์„œ Accuracy๋Š” k๋ฒˆ์งธ๋กœ ๊ฐˆ์ˆ˜๋ก ๋Š˜์–ด๋‚˜๊ฒ ์ฃ ?

Born-Again Neural Networks [Furlanello et al., ICML 2018]
  • Born-Again Networks generalizes defensive distillation by adding iterative training states and using both classification objective and distillation objective in subsequent stages.

  • Network architecture \(T = S_1=S_2=\dots=S_k\)

  • Network accuracy \(T < S_1 < S_2 < \dots < S_k\)

  • Can alteratively ensemble \(T,S_1, S_2, \dots, S_k\) to get even better performance

3.2 Online Distillation

๋‘ ๋ฒˆ์งธ๋Š” Online Distillation์ธ๋ฐ, ์—ฌ๊ธฐ ์•„์ด๋””์–ด๋Š” โ€œ๊ฐ™์€ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์“ฐ์ž.โ€ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Teacher network๊ณผ Student network๋Š” ์ฒ˜์Œ๋ถ€ํƒ€ ๊ฐ™์ด ํ•™์Šตํ•˜๋Š”๋ฐ, Loss์— ํ•œ ๊ฐ€์ง€ ํ•ญ์ด ์ถ”๊ฐ€๋˜์ฃ . ๋ฐ”๋กœ โ€œKL Divergenceโ€ ์ž…๋‹ˆ๋‹ค.

Reference: Deep Mutual Learning [Zhang et al., CVPR 2018]
  • Idea: for both teach and student networks, we want to add a distillation objective that minimizes the output distribution of the other party.

  • \(\mathscr{L}(S) = CrossEntropy(S(I), y)+KL(S(I), T(I))\)

  • \(\mathscr{L}(T) = CrossEntropy(T(I), y)+KL(S(I), T(I))\)

  • It is not necessary to retrain \(T\) and \(S=T\) is allowed

    Reference. Deep Mutual Learning [Zhang et al., CVPR 2018]

3.3 Combined Distillation

๋งˆ์ง€๋ง‰์€ Self ์™€ Online Distillation์„ ํ•ฉ์นœ ์—ฐ๊ตฌ๋“ค์„ ์†Œ๊ฐœํ• ๊ฒŒ์š”.

์ฒซ ๋ฒˆ์งธ๋Š” On-the-Fly Native Ensemble ์ž…๋‹ˆ๋‹ค. ๊ตฌ์กฐ๋ฅผ ๋ณด์‹œ๋ฉด Branch ๋งˆ๋‹ค ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋„ ๋™์ผํ•˜๊ฒŒ Branch 0, Branch 1, โ€ฆ , Branch m ์œผ๋กœ ๋‚˜๋‰˜๋Š” ๊ฒŒ Self Distillation๋ฅผ ๋ณด๋Š” ๋“ฏํ•˜์ฃ . ๊ทธ๋ฆฌ๊ณ  ๊ฐ Branch๋ฅผ ํ•™์Šต์‹œ ๋™์‹œ์— ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ด๋„ค์š”.

Reference. Knowledge Distillation by On-the-Fly Native Ensemble [Lan et al., NeurlPS 2018]
  • Idea: generating multiple output probability distributions and ensemble them as the target distribution for knowledge distillation.

  • Similar to DML(Deep Mutual Learning), ONE allows the teacher model to be exactly the same as the student model, and it does not require retraining the teach network first. It is also not necessary to train two models as in DML.

  • Result

    Reference. Knowledge Distillation by On-the-Fly Native Ensemble [Lan et al., NeurlPS 2018]

๋‘ ๋ฒˆ์งธ ์—ฐ๊ตฌ๋Š” Be Your Own Teacher ๋ผ๋Š” ์—ฐ๊ตฌ์ธ๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” ๊ฐ ๋ ˆ์ด์–ด๋งˆ๋‹ค ๋‚˜์˜จ Feature map์— ์ถ”๊ฐ€์ ์ธ ๋ ˆ์ด์–ด๋ฅผ ๋ถ™์—ฌ์„œ Self Distillation์˜ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค. Loss๋กœ๋Š” Cross entropy(Output Logit), ์ถ”๊ฐ€์ ์œผ๋กœ ๋ถ™์—ฌ์„œ ๋งŒ๋“  ๊ฐ ๋ชจ๋ธ๋งˆ๋‹ค KL Divergence, ๊ทธ๋ฆฌ๊ณ  intermediate feature๋ฅผ ์‚ฌ์šฉํ•˜๋„ค์š”. ํฅ๋ฏธ๋กœ์› ๋˜ ์ ์€ ์ฒซ ๋ฒˆ์งธ, ๋‘ ๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ๋‚˜์˜ค์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™์•˜๋Š”๋ฐ ๋‘ ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋ถ€ํ„ฐ๋Š” Ensemble๊นŒ์ง€ ์–ด๋Š์ •๋„ ์„ฑ๋Šฅ์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ๊ฒฐ๊ณผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ์–ด์š”.

Reference. Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation [Zhang et al., ICCV 2019]
  • Use deeper layers to distill shallower layers.

  • Intuition: Labels at later stages are more reliable, so the authors use them to supervise the predictions from the previous stages.

  • Result

    Reference. Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation [Zhang et al., ICCV 2019]

4. Distillation for different tasks

์ด๋ ‡๊ฒŒ ์•Œ์•„๋ณธ Knowledge Distillation์€ ์–ด๋–ค Application์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์„๊นŒ์š”? ๊ฐ•์˜๋Š” Object Detection, Semantic Segmentation, GAN, Transformer ๋ชจ๋ธ๋กœ ๋‚˜๋ˆ ์„œ ์ด์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ถ€๋ถ„๋งˆ๋‹ค ์–ด๋–ค ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”์ง€ ํ˜น์€ ์–ด๋–ค ์•„์ด๋””์–ด๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”์ง€๋งŒ ์งš๊ณ  ๋„˜์–ด๊ฐ€๋ณผ๊ฒŒ์š”(์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์„ ์ฐธ์กฐ!).

4.1 Object Detection

Object Detection์€ ์„ธ ๊ฐ€์ง€๋กœ ํ•ด๊ฒฐํ•ด์•ผํ•  ๋ฌธ์ œ๊ฐ€ ๋Š˜์–ด๋‚ฌ์Šต๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” Classification, ๊ทธ๋™์•ˆ ํ•ด์™”๋˜ ๋ถ€๋ถ„์ด๊ตฌ์š”, ๋‹ค๋ฅธ ๋‘ ๊ฐœ๋Š” Background์™€ Foreground์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ๊ณผ Bounding block ๋ฌธ์ œ ์ž…๋‹ˆ๋‹ค.

Reference. Object Detection: Learning Efficient Object Detection Models with Knowledge Distillation [Chen et al., NeurIPS 2017]

์ด ์—ฐ๊ตฌ๋Š” Classifcation๊ณผ Background, Foreground ๋ฌธ์ œ๋ฅผ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ Loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” Feature, ๊ทธ๋ฆฌ๊ณ  Output Logit์—์„œ Background, Foreground๋ฅผ ๊ฐ๊ฐ ๋‹ค๋ฅธ Weight๋ฅผ ์ค€ Cross Entropy, ๋งˆ์ง€๋ง‰์€ Bounded ํ•œ Regression Loss ์ž…๋‹ˆ๋‹ค.

Reference. Localization: Localization Distillation for Dense Object Detection [Zheng et al., CVPR 2022]

๊ทธ๋Ÿผ Bounding block ์€ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ• ๊นŒ์š”? ์ด ๋…ผ๋ฌธ์—์„œ๋Š” X์ถ•๊ณผ Y์ถ•์œผ๋กœ 6๊ฐœ๋กœ ๋‚˜๋ˆ ์ง„ ๊ตฌ์—ญ์—์„œ ๋‘ ์ ์œผ๋กœ bounding block์„ ์žก์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์žก์€ Bounding block์˜ ๋ถ„ํฌ๋ฅผ Student Network๊ฐ€ ํ•™์Šตํ•˜๋Š” ๊ฒ๋‹ˆ๋‹ค.

Reference. Localization: Localization Distillation for Dense Object Detection [Zheng et al., CVPR 2022]

4.2 Semantic Segmentation

๋‘ ๋ฒˆ์งธ Task์ธ Semantic Segmentation์—์„œ๋Š” Feature์™€ Output Logit์—์„œ Pixel ๋‹จ์œ„๋กœ Loss๋ฅผ ๊ตฌํ•œ ๋‹ค๋Š” ์ , ๊ทธ๋ฆฌ๊ณ  Discriminator ๋ชจ๋ธ์„ ๊ฐ€์ง€๊ณ  ํ•™์Šต์„ ์‹œํ‚จ๋‹ค๋Š” ์ ์ด ๋”ํ•ด์กŒ์Šต๋‹ˆ๋‹ค.

Reference. Semantic Segmentation: Structured Knowledge Distillation for Semantic Segmentation [Liu et al., CVPR 2019]

4.3 GAN

์„ธ๋ฒˆ์งธ Task๋Š” GAN ์ž…๋‹ˆ๋‹ค. ๋งค Task ๋งˆ๋‹ค feature map์„ KD-loss๋กœ ๊ฐ€์ ธ๊ฐ€๊ณ  ๊ธฐ์กด์— Output Logit์€ ๋™์ผํ•˜๊ฒŒ ๊ฐ€์ ธ๊ฐ€๋„ค์š”. ์ถ”๊ฐ€๋กœ ํ•ด๋‹น ์—ฐ๊ตฌ์—์„œ๋Š” ๊ฐ ๋ ˆ์ด์–ด๋งˆ๋‹ค ์ฑ„๋„ ์ˆ˜ ์ค‘ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์€ ์ผ€์ด์Šค์— ํ•œํ•ด Fine-Tuning์„ ์ง„ํ–‰ํ•œ๋‹ค๋Š” ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Reference. GAN: GAN Compression: Efficient Architectures for Interactive Conditional GANs [Li et al., CVPR 2020]

4.4 Transformer

๋งˆ์ง€๋ง‰์€ Transformer ๋ชจ๋ธ์—์„œ Knowledge Distillation ์ž…๋‹ˆ๋‹ค. Transformer๋Š” Feature Map, Attention Map์„ ์•ˆ ๋ณผ ์ˆ˜๊ฐ€ ์—†๋Š”๋ฐ์š”, ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ ๋ณด๋ฉด attention transfer๋ฅผ ํ•˜๊ณ  ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐํ•˜๊ณ  ํ™•์‹คํžˆ Teacher์™€ Attention map๊ฐ€ ๋น„๊ต๊ฐ€ ๋˜๋„ค์š”.

Refernece. NLP: MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices [Sun et al., ACL 2020]

5. Network Augmentation, a training technique for tiny machine learning models.

์ง€๊ธˆ๊นŒ์ง€ Task์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ดค๋Š”๋ฐ์š”, ๊ทธ๋Ÿผ Tiny Model๋„ overfitting ๋ฌธ์ œ๊ฐ€ ์žˆ์ง€ ์•Š์„๊นŒ์š”? ๊ทธ๋ ˆ์„œ overfitting์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์—๋Š” Data Augmentation์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ ์ฒ˜๋Ÿผ Cutoff, Mixup, AutoAugment, Dropout๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Reference. Data Augmentation(AutoAugment: Learning Augmentation Policies from Data [Cubuk et al., CVPR 2019])

Reference. Dropout(DropBlock: A regularization method for convolutional networks [Ghiasi et al., NeurIPS 2018])

ํ•˜์ง€๋งŒ Data Augmentation์„ ์ ์šฉํ•œ Tiny Model์˜ ์„ฑ๋Šฅ์„ ๋ณด์‹œ๋ฉด ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋งˆ๋‹ค ๋–จ์–ด์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์—ฌ๊ธฐ์„œ ์ œ์•ˆํ•œ ์•„์ด๋””์–ด๊ฐ€ โ€œNetwork Augmentationโ€ ์ž…๋‹ˆ๋‹ค.

Reference. MIT-TinyML-lecture10-Knowledge-Distillation in https://efficientml.ai
  • Tiny Neural Network lacks capacity! โ†’ NetAug

5.2 Network Augmentation

Network Augmentation์€ ๊ธฐ์กด์— ๋””์ž์ธํ•œ ๋ชจ๋ธ์„ ๊ฐ€์ง€๊ณ  ํ•™์Šต์„ ์‹œํ‚จ ํ›„, ์› ๋ชจ๋ธ๊ณผ ๊ฐ ๋ ˆ์ด์–ด๋งˆ๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ณ€๊ฒฝํ•œ ๋ชจ๋ธ์„ ํ•จ๊ป˜ ์žฌํ•™์Šต์„ ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ณ€๊ฒฝํ•˜๋Š” ๋ชจ๋ธ๊ฐ™์€ ๊ฒฝ์šฐ ์ด์ „์‹œ๊ฐ„ ์‹ค์Šต์— ์žˆ์œผ๋‹ˆ ๊ถ๊ธˆํ•˜์‹œ๋ฉด ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”. ์‹คํ—˜๊ฒฐ๊ณผ๋Š” 1.3 ~ 1.8 % Tiny ๋ชจ๋ธ์ด ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ์ด๋ค„์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์–ด์š”. ์—ฌ๊ธฐ์„œ ์› ๋ชจ๋ธ(ResNet50)์ด Evaluation์—์„œ๋Š” ์ด๋ฏธ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ์„ ์ถฉ๋ถ„ํžˆ ํ›ˆ๋ จ์‹œ์ผฐ๊ธฐ ๋•Œ๋ฌธ์— ๋”์ด์ƒ ๋Š˜์–ด๋‚˜์ง€ ์•Š๋Š” ๊ฒƒ๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ฒ ๋„ค์š”.

  • Training Process

    Reference. Network Augmentation for Tiny Deep Learning [Cai et al., ICLR 2022]

    \[ \mathscr{L}_{aug} = \mathscr{L}(W_{base}) + \alpha \mathscr{L}([W_{base}, W_{aug}]) \]

    • \(\mathscr{L}_{aug}\) = base supervision + \(\alpha \cdot\)auxiliary supervision
  • Learning Curve

    Reference. Network Augmentation for Tiny Deep Learning [Cai et al., ICLR 2022]
  • Result

    Reference. Network Augmentation for Tiny Deep Learning [Cai et al., ICLR 2022]
  • Result for Transfer Learning

    Reference. Network Augmentation for Tiny Deep Learning [Cai et al., ICLR 2022]

์ง€๊ธˆ๊นŒ์ง€ Knowledge Distilation์˜ ๊ธฐ๋ฒ•๋“ค ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ์ด์šฉํ•œ Appllcation์— ๋Œ€ํ•ด์„œ ๋‹ค๋ค„๋ดค์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ์‹œ๊ฐ„์—๋Š” TinyEngine์„ ์œ„ํ•œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์œผ๋กœ ๋‹ค์‹œ ์ฐพ์•„์˜ฌ๊ฒŒ์š” ๐Ÿ™‚