๐Ÿง‘โ€๐Ÿซ Lecture 5-6

Quantization
Author

Seunghyun Oh

Published

March 5, 2024

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” MIT HAN LAB์—์„œ ๊ฐ•์˜ํ•˜๋Š” TinyML and Efficient Deep Learning Computing์— ๋‚˜์˜ค๋Š” Quantization ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•˜๋ ค ํ•œ๋‹ค. Quantization(์–‘์žํ™”) ์‹ ํ˜ธ์™€ ์ด๋ฏธ์ง€์—์„œ ์•„๋‚ ๋กœ๊ทธ๋ฅผ ๋””์ง€ํ„ธ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฐœ๋…์ด๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์—ฐ์†์ ์ธ ์„ผ์„œ๋กœ ๋ถ€ํ„ฐ ๋“ค์–ด์˜ค๋Š” ์•„๋‚ ๋กœ๊ทธ ๋ฐ์ดํ„ฐ ๋‚˜ ์ด๋ฏธ์ง€๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์œ„ ์‹œ๊ฐ„์— ๋Œ€ํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•œ๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

๋””์ง€ํ„ธ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์ •ํ•˜๋ฉด์„œ ์ด๋ฅผ ํ•˜๋‚˜์”ฉ ์–‘์žํ™”ํ•œ๋‹ค. ์–‘์ˆ˜์™€ ์Œ์ˆ˜๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด Unsigned Integer ์—์„œ Signed Integer, Signed์—์„œ๋„ Sign-Magnitude ๋ฐฉ์‹๊ณผ Twoโ€™s Complement๋ฐฉ์‹์œผ๋กœ, ๊ทธ๋ฆฌ๊ณ  ๋” ๋งŽ์€ ์†Œ์ˆซ์  ์ž๋ฆฌ๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด Fixed-point์—์„œ Floating point๋กœ ๋ฐ์ดํ„ฐ ํƒ€์ž…์—์„œ ์ˆ˜์˜ ๋ฒ”์ฃผ๋ฅผ ํ™•์žฅ์‹œํ‚จ๋‹ค. ์ฐธ๊ณ ๋กœ Device์˜ Computationality์™€ ML ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ง€ํ‘œ์ค‘ ํ•˜๋‚˜์ธ FLOP์ด ๋ฐ”๋กœ floating point operations per second์ด๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

์ด ๊ธ€์—์„œ floating point๋ฅผ ์ดํ•ดํ•˜๋ฉด, fixed point๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋งค๋ชจ๋ฆฌ์—์„œ, ๊ทธ๋ฆฌ๊ณ  ์—ฐ์‚ฐ์—์„œ ๋” ํšจ์œจ์ ์ผ ๊ฒƒ์ด๋ผ๊ณ  ์˜ˆ์ƒํ•ด๋ณผ ์žˆ ์ˆ˜ ์žˆ๋‹ค. ML๋ชจ๋ธ์„ ํด๋ผ์šฐ๋“œ ์„œ๋ฒ„์—์„œ ๋Œ๋ฆด ๋•Œ๋Š” ํฌ๊ฒŒ ๋ฌธ์ œ๋˜์ง€ ์•Š์•˜์ง€๋งŒ ์•„๋ž˜ ๋‘ ๊ฐ€์ง€ ํ‘œ๋ฅผ ๋ณด๋ฉด ์—๋„ˆ์ง€์†Œ๋ชจ, ์ฆ‰ ๋ฐฐํ„ฐ๋ฆฌ ํšจ์œจ์—์„œ ํฌ๊ฒŒ ์ฐจ์ด๊ฐ€ ๋ณด์ธ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์—์„œ Floating point๋ฅผ fixed point๋กœ ๋” ๋งŽ์ด ๋ฐ”๊พธ๋ ค๊ณ  ํ•˜๋Š”๋ฐ ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๋‚˜์˜จ ๊ฒƒ์ด ๋ฐ”๋กœ Quatization์ด๋‹ค.

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” Quntization ์ค‘์—์„œ Quantization ๋ฐฉ๋ฒ•๊ณผ ๊ทธ ์ค‘ Linearํ•œ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋” ์ž์„ธํ•˜๊ฒŒ, ๊ทธ๋ฆฌ๊ณ  Post-training Quantization๊นŒ์ง€ ๋‹ค๋ฃจ๊ณ , ๋‹ค์Œ ๊ธ€์—์„œ๋Š” Quantization-Aware Training, Binary/Tenary Quantization, Mixed Precision Quantization๊นŒ์ง€ ๋‹ค๋ฃจ๋ ค๊ณ  ํ•œ๋‹ค.

1. Common Network Quantization

์•ž์„œ์„œ ์†Œ๊ฐœํ•œ ๊ฒƒ์ฒ˜๋Ÿผ Neural Netowork๋ฅผ ์œ„ํ•œ Quantization์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. Quantization ๋ฐฉ๋ฒ•์„ ํ•˜๋‚˜์”ฉ ์•Œ์•„๋ณด์ž.

Reference. MIT-TinyML-lecture5-Quantization-1

Reference. MIT-TinyML-lecture5-Quantization-1 in https://efficientml.ai

1.1 K-Means-based Quantization

๊ทธ ์ค‘ ์ฒซ ๋ฒˆ์งธ๋กœ K-means-based Quantization์ด ์žˆ๋‹ค. Deep Compression [Han et al., ICLR 2016] ๋…ผ๋ฌธ์— ์†Œ๊ฐœํ–ˆ๋‹ค๋Š” ์ด ๋ฐฉ๋ฒ•์€ ์ค‘์‹ฌ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ clustering์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์˜ˆ์ œ๋ฅผ ๋ด๋ณด์ž.

Reference. MIT-TinyML-lecture5-Quantization-1

์œ„ ์˜ˆ์ œ๋Š” weight๋ฅผ codebook์—์„œ -1, 0, 1.5, 2๋กœ ๋‚˜๋ˆ  ๊ฐ๊ฐ์— ๋งž๋Š” ์ธ๋ฑ์Šค๋กœ ํ‘œ๊ธฐํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์—ฐ์‚ฐ์„ ํ•˜๋ฉด ๊ธฐ์กด์— 64bytes๋ฅผ ์‚ฌ์šฉํ–ˆ๋˜ weight๊ฐ€ 20bytes๋กœ ์ค„์–ด๋“ ๋‹ค. codebook์œผ๋กœ ์˜ˆ์ œ๋Š” 2bit๋กœ ๋‚˜๋ˆด์ง€๋งŒ, ์ด๋ฅผ N-bit๋งŒํผ ์ค„์ธ๋‹ค๋ฉด ์šฐ๋ฆฌ๋Š” ์ด 32/N๋ฐฐ์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๊ณผ์ •์—์„œ quantizatio error, ์ฆ‰ quantization์„ ํ•˜๊ธฐ ์ „๊ณผ ํ•œ ํ›„์— ์˜ค์ฐจ๊ฐ€ ์ƒ๊ธฐ๋Š” ๊ฒƒ์„ ์œ„ ์˜ˆ์ œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋Š” ๊ฒƒ๋„ ์ข‹์ง€๋งŒ, ์ด ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์— ์˜ค์ฐจ๊ฐ€ ์ƒ๊ธฐ์ง€ ์•Š๊ฒŒ ํ•˜๊ธฐ์œ„ํ•ด ์ด ์˜ค์ฐจ๋ฅผ ์ค„์ด๋Š” ๊ฒƒ ๋˜ํ•œ ์ค‘์š”ํ•˜๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด Quantizedํ•œ Weight๋ฅผ ์œ„์— ๊ทธ๋ฆผ์ฒ˜๋Ÿผ Fine-tuningํ•˜๊ธฐ๋„ ํ•œ๋‹ค. centroid๋ฅผ fine-tuningํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋˜๋Š”๋ฐ, ๊ฐ centroid์—์„œ ์ƒ๊ธฐ๋Š” ์˜ค์ฐจ๋ฅผ ํ‰๊ท ๋‚ด tuningํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ ์—์„œ๋Š” Convolution ๋ ˆ์ด์–ด์—์„œ๋Š” 4bit๊นŒ์ง€ centroid๋ฅผ ๊ฐ€์กŒ์„ ๋•Œ, Full-Connected layer์—์„œ๋Š” 2 bit๊นŒ์ง€ centroid๋ฅผ ๊ฐ€์กŒ์„ ๋•Œ ์„ฑ๋Šฅ์— ํ•˜๋ฝ์ด ์—†๋‹ค๊ณ  ๋งํ•˜๊ณ  ์žˆ์—ˆ๋‹ค.

Reference. Deep Compression [Han et al., ICLR 2016]

์ด๋ ‡๊ฒŒ Quantization ๋œ Weight๋Š” ์œ„์ฒ˜๋Ÿผ ์—ฐ์†์ ์ธ ๊ฐ’์—์„œ ์•„๋ž˜์ฒ˜๋Ÿผ Discreteํ•œ ๊ฐ’์œผ๋กœ ๋ฐ”๋€๋‹ค.

Reference. Deep Compression [Han et al., ICLR 2016]

๋…ผ๋ฌธ์€ ์ด๋ ‡๊ฒŒ Quantizationํ•œ weight๋ฅผ ํ•œ ๋ฒˆ ๋” Huffman coding๋ฅผ ์ด์šฉํ•ด ์ตœ์ ํ™”์‹œํ‚จ๋‹ค. ์งง๊ฒŒ ์„ค๋ช…ํ•˜์ž๋ฉด, ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๋ฌธ์ž๋Š” ์งง์€ ์ด์ง„์ฝ”๋“œ๋ฅผ, ๋นˆ๋„ ์ˆ˜๊ฐ€ ๋‚ฎ์€ ๋ฌธ์ž์—๋Š” ๊ธด ์ด์ง„์ฝ”๋“œ๋ฅผ ์“ฐ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์••์ถ• ๊ฒฐ๊ณผ๋กœ Generalํ•œ ๋ชจ๋ธ๊ณผ ์••์ถ• ๋น„์œจ์ด ๊ฝค ํฐ SqueezeNet์„ ์˜ˆ๋กœ ๋“ ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์„ ์ฐธ๊ณ ํ•˜๋Š” ๊ฑธ๋กœ.

Reference. Deep Compression [Han et al., ICLR 2016]

Reference. Deep Compression [Han et al., ICLR 2016]

inference๋ฅผ ์œ„ํ•ด weight๋ฅผ Decodingํ•˜๋Š” ๊ณผ์ •์€ inference๊ณผ์ •์—์„œ ์ €์žฅํ•œ cluster์˜ ์ธ๋ฑ์Šค๋ฅผ ์ด์šฉํ•ด codebook์—์„œ ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์ €์žฅ ๊ณต๊ฐ„์„ ์ค„์ผ ์ˆ˜๋Š” ์žˆ์ง€๋งŒ, floating point Computation์ด๋‚˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ centroid๋ฅผ ์“ฐ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ์ˆ˜ ๋ฐ–์— ์—†๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

Reference. Deep Compression [Han et al., ICLR 2016]

1.2 Linear Quantization

๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ Linear Quatization์ด๋‹ค. floating-point์ธ weight๋ฅผ N-bit์˜ ์ •์ˆ˜๋กœ affine mapping์„ ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ฐ„๋‹จํ•˜๊ฒŒ ์‹์œผ๋กœ ๋ณด๋Š” ๊ฒŒ ๋” ์ดํ•ด๊ฐ€ ์‰ฝ๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

์—ฌ๊ธฐ์„œ S(Scale of Linear Quantization)์™€ Z(Zero point of Linear Quantization)๊ฐ€ ์žˆ๋Š”๋ฐ ์ด ๋‘˜์ด quantization parameter ๋กœ์จ tuning์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ’์ธ ๊ฒƒ์ด๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

1.3 Scale and Zero point

Reference. MIT-TinyML-lecture5-Quantization-1

์ด Scale๊ณผ Zero point ๋‘ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์šฉํ•ด์„œ affine mapping์€ ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค. Bit ์ˆ˜(Bit Width)๊ฐ€ ๋‚ฎ์•„์ง€๋ฉด ๋‚ฎ์•„์งˆ ์ˆ˜๋ก, floating point์—์„œ ํ‘œํ˜„ํ•  ์žˆ๋Š” ์ˆ˜ ๋˜ํ•œ ์ค„์–ด๋“ค ๊ฒƒ์ด๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด Scale์™€ Zero point๋Š” ๊ฐ๊ฐ ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ• ๊นŒ?

์šฐ์„  floating-point ์ธ ์ˆซ์ž์˜ ๋ฒ”์œ„ ์ค‘ ์ตœ๋Œ€๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์— ๋งž๊ฒŒ ๋‘ ์‹์„ ์„ธ์šฐ๊ณ  ์ด๋ฅผ ์—ฐ๋ฆฝ๋ฐฉ์ •์‹์œผ๋กœ Scale๊ณผ Zero point์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

  • Scale point rmax=S(qmaxโˆ’Z) rmin=S(qminโˆ’Z)

    rmaxโˆ’rmin=S(qmaxโˆ’qmin)

    S=rmaxโˆ’rminqmaxโˆ’qmin

  • Zero point rmin=S(qminโˆ’Z)

    Z=qminโˆ’rminS

    Z=round(qminโˆ’rminS)

์˜ˆ๋ฅผ ๋“ค์–ด, ์•„๋ž˜์™€ ๊ฐ™์€ ์˜ˆ์ œ์—์„œ rmax ๋Š”2.12 ์ด๊ณ  rmin ์€ โˆ’1.08 ๋กœ Scale์„ ๊ณ„์‚ฐํ•˜๋ฉด ์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๋œ๋‹ค. Zero point๋Š” โˆ’1 ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

๊ทธ๋Ÿผ Symmetricํ•˜๊ฒŒ r์˜ ๋ฒ”์œ„๋ฅผ ์ œํ•œํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ Linear Quantization์€ ์—†์„๊นŒ? ์ด๋ฅผ ์•ž์„œ, Quatized๋œ ๊ฐ’๋“ค์ด Matrix Multiplication์„ ํ•˜๋ฉด์„œ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ๋Š” ์ˆ˜ (Quantized Weight, Scale, Zero point)๊ฐ€ ์žˆ์œผ๋‹ˆ inference์‹œ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์—†์„๊นŒ?

1.4 Quantized Matrix Multiplication

์ž…๋ ฅ X, Weight W, ๊ฒฐ๊ณผ Y๊ฐ€ Matrix Multiplication์„ ํ–ˆ๋‹ค๊ณ  ํ•  ๋•Œ ์‹์„ ๊ณ„์‚ฐํ•ด๋ณด์ž.

Y=WX

SY(qYโˆ’ZY)=SW(qWโˆ’ZW)โ‹…SX(qXโˆ’ZX

โ‹ฎ

Reference. MIT-TinyML-lecture5-Quantization-1

์—ฌ๊ธฐ์„œ ๋งˆ์ง€๋ง‰ ์ •๋ฆฌํ•œ ์‹์„ ์‚ดํŽด๋ณด๋ฉด,

Zx ์™€ qw,Zw,ZX ์˜ ๊ฒฝ์šฐ๋Š” ๋ฏธ๋ฆฌ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋˜ SwSX/SY ์˜ ๊ฒฝ์šฐ ํ•ญ์ƒ ์ˆ˜์˜ ๋ฒ”์œ„๊ฐ€ (0,1) ๋กœ 2โˆ’nM0 , M0โˆˆ[0.5,1) ๋กœ ๋ณ€ํ˜•ํ•˜๋ฉด N-bit Integer๋กœ Fixed-point ํ˜•ํƒœ๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅํ•˜๋‹ค. ์—ฌ๊ธฐ์— Zw๊ฐ€ 0์ด๋ฉด ์–ด๋–จ๊นŒ? ๋˜ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ํ•ญ์ด ๋ณด์ธ๋‹ค.

1.5 Symmetric Linear Quantization

Reference. MIT-TinyML-lecture5-Quantization-1

Zw=0 ์ด๋ผ๊ณ  ํ•จ์€ ๋ฐ”๋กœ ์œ„์™€ ๊ฐ™์€ Weight ๋ถ„ํฌ์ธ๋ฐ, ๋ฐ”๋กœ Symmetricํ•œ Linear Quantization์œผ๋กœ Zw๋ฅผ 0์œผ๋กœ ๋งŒ๋“ค์–ด Zwqxํ•ญ์„ 0์œผ๋กœ ๋‘˜ ์ˆ˜ ์žˆ์–ด ์—ฐ์‚ฐ์„ ๋˜ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

Symmetric Linear Quantization์€ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์—์„œ Full range mode์™€ Restrict range mode๋กœ ๋‚˜๋‰œ๋‹ค.

์ฒซ ๋ฒˆ์งธ Full range mode ๋Š” Scale์„ real number(๋ฐ์ดํ„ฐ, weight)์—์„œ ๋ฒ”์œ„๊ฐ€ ๋„“์€ ์ชฝ์— ๋งž์ถ”๋Š” ๊ฒƒ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์•„๋ž˜์˜ ๊ฒฝ์šฐ, r_min์ด r_max๋ณด๋‹ค ์ ˆ๋Œ“๊ฐ’์ด ๋” ํฌ๊ธฐ ๋•Œ๋ฌธ์— r_min์— ๋งž์ถฐ q_min์„ ๊ฐ€์ง€๊ณ  Scale์„ ๊ตฌํ•œ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ Pytorch native quantization๊ณผ ONNX์—์„œ ์‚ฌ์šฉ๋œ๋‹ค๊ณ  ๊ฐ•์˜์—์„œ ์†Œ๊ฐœํ•œ๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

๋‘ ๋ฒˆ์งธ Restrict range mode๋Š” Scale์„ real number(๋ฐ์ดํ„ฐ, weight)์—์„œ ๋ฒ”์œ„๊ฐ€ ์ข์€ ์ชฝ์— ๋งž์ถ”๋Š” ๊ฒƒ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์•„๋ž˜์˜ ๊ฒฝ์šฐ, r_min๊ฐ€ r_max๋ณด๋‹ค ์ ˆ๋Œ“๊ฐ’์ด ๋” ํฌ๊ธฐ ๋•Œ๋ฌธ์— r_min์— ๋งž์ถ”๋ฉด์„œ q_max์— ๋งž๋„๋ก Scale์„ ๊ตฌํ•œ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ TensorFlow, NVIDIA TensorRT, Intel DNNL์—์„œ ์‚ฌ์šฉ๋œ๋‹ค๊ณ  ๊ฐ•์˜์—์„œ ์†Œ๊ฐœํ•œ๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

๊ทธ๋ ‡๋‹ค๋ฉด ์™œ Symmetric ์จ์•ผํ• ๊นŒ? Asymmetric ๋ฐฉ๋ฒ•๊ณผ Symmetric ๋ฐฉ๋ฒ•์˜ ์ฐจ์ด๋Š” ๋ญ˜๊นŒ? (feat. Neural Network Distiller) ์•„๋ž˜ ๊ทธ๋ฆผ์„ ์ฐธ๊ณ ํ•˜๋ฉด ๋˜์ง€๋งŒ, ๊ฐ€์žฅ ํฐ ์ฐจ์ด๋กœ ๋ณด์ด๋Š” ๊ฒƒ์€ Computation vs Compactful quantized range๋กœ ์ดํ•ด๊ฐ„๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

1.6 Linear Quantization examples

๊ทธ๋Ÿผ Quatization ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ดค์œผ๋‹ˆ ์ด๋ฅผ Full-Connected Layer, Convolution Layer์— ์ ์šฉํ•ด๋ณด๊ณ  ์–ด๋–ค ํšจ๊ณผ๊ฐ€ ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด์ž.

1.6.1 Full-Connected Layer

์•„๋ž˜์ฒ˜๋Ÿผ ์‹์„ ์ „๊ฐœํ•ด๋ณด๋ฉด ๋ฏธ๋ฆฌ ์—ฐ์‚ฐํ•  ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ํ•ญ๊ณผ N-bit integer๋กœ ํ‘œํ˜„ํ•  ์žˆ๋Š” ํ•ญ์œผ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค(์ „๊ฐœํ•˜๋Š” ์ด์œ ๋Š” ์•„๋งˆ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ํ•ญ์„ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•จ์ด ์•„๋‹๊นŒ ์‹ถ๋‹ค).

Y=WX+b

โ†“

SY(qYโˆ’ZY)=SW(qWโˆ’ZW)โ‹…SX(qXโˆ’ZX)+Sb(qbโˆ’Zb)

โ†“ Zw=0

SY(qYโˆ’ZY)=SWSX(qWqXโˆ’ZxqW)+Sb(qbโˆ’Zb)

โ†“ Zb=0,Sb=SWSX

SY(qYโˆ’ZY)=SWSX(qWqXโˆ’ZxqW+qb)

โ†“

qY=SWSXSY(qWqX+qbโˆ’ZXqW)+ZY

โ†“ qbias=qbโˆ’ZxqW

qY=SWSXSY(qWqX+qbias)+ZY

๊ฐ„๋‹จํžˆ ํ‘œ๊ธฐํ•˜๊ธฐ ์œ„ํ•ด ZW=0,Zb=0,Sb=SWSX ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

1.6.2 Convolutional Layer

Convolution Layer์˜ ๊ฒฝ์šฐ๋Š” Weight์™€ X์˜ ๊ณฑ์˜ ๊ฒฝ์šฐ๋ฅผ Convolution์œผ๋กœ ๋ฐ”๊ฟ”์„œ ์ƒ๊ฐํ•ด๋ณด๋ฉด ๋œ๋‹ค. ๊ทธ๋„ ๊ทธ๋Ÿด ๊ฒƒ์ด Convolution์€ Kernel๊ณผ Input์˜ ๊ณฑ์˜ ํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— Full-Connected์™€ ๊ฑฐ์˜ ์œ ์‚ฌํ•˜๊ฒŒ ์ „๊ฐœ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

2. Post-training Quantization (PTQ)

๊ทธ๋Ÿผ ์•ž์„œ์„œ Quantizaedํ•œ Layer๋ฅผ Fine tuningํ•  ์—†์„๊นŒ? โ€œHow should we get the optimal linear quantization parameters (S, Z)?โ€ ์ด ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ Weight, Activation, Bias ์„ธ ๊ฐ€์ง€์™€ ๊ทธ์— ๋Œ€ํ•˜์—ฌ ๋…ผ๋ฌธ์—์„œ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒฐ๊ณผ๊นŒ์ง€ ์•Œ์•„๋ณด์ž.

2.1 Weight quantization

TL;DR. ์ด ๊ฐ•์˜์—์„œ ์†Œ๊ฐœํ•˜๋Š” Weight quantization์€ Grandularity์— ๋”ฐ๋ผ Whole(Per-Tensor), Channel, ๊ทธ๋ฆฌ๊ณ  Layer๋กœ ๋“ค์–ด๊ฐ„๋‹ค.

2.1.1 Granularity

Weight quantization์—์„œ Granularity์— ๋”ฐ๋ผ์„œ Per-Tensor, Per-Channel, Group, ๊ทธ๋ฆฌ๊ณ  Generalized ํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ํ™•์žฅ์‹œ์ผœ Shared Micro-exponent(MX) data type์„ ์ฐจ๋ก€๋กœ ๋ณด์—ฌ์ค€๋‹ค. Scale์„ ๋ช‡ ๊ฐœ๋‚˜ ๋‘˜ ๊ฒƒ์ด๋ƒ, ๊ทธ Scale์„ ์ ์šฉํ•˜๋Š” ๋ฒ”์œ„๋ฅผ ์–ด๋–ป๊ฒŒ ๋‘˜ ๊ฒƒ์ด๋ƒ, ๊ทธ๋ฆฌ๊ณ  Scale์„ ์–ผ๋งˆ๋‚˜ ๋””ํ…Œ์ผํ•˜๊ฒŒ(e.g. floating-point)ํ•  ๊ฒƒ์ด๋ƒ์— ์ดˆ์ ์„ ๋‘”๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-2

์ฒซ ๋ฒˆ์งธ๋Š” Per-Tensor Quantization ํŠน๋ณ„ํ•˜๊ฒŒ ์„ค๋ช…ํ•  ๊ฒƒ ์—†์ด ์ด์ „๊นŒ์ง€ ์„ค๋ช…ํ–ˆ๋˜ ํ•˜๋‚˜์˜ Scale์„ ์‚ฌ์šฉํ•˜๋Š” Linear Quantization์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋˜๊ฒ ๋‹ค. ํŠน์ง•์œผ๋กœ๋Š” Large model์— ๋Œ€ํ•ด์„œ๋Š” ์„ฑ๋Šฅ์ด ๊ดœ์ฐฎ์ง€๋งŒ ์ž‘์€ ๋ชจ๋ธ๋กœ ๋–จ์–ด์ง€๋ฉด ์„ฑ๋Šฅ์ด ๊ธ‰๊ฒฉํ•˜๊ฒŒ ๋–จ์–ด์ง„๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค. Channel๋ณ„๋กœ weight ๋ฒ”์ฃผ๊ฐ€ ๋„“์€ ๊ฒฝ์šฐ๋‚˜ outlier weight๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ quantization ์ดํ›„์— ์„ฑ๋Šฅ์ด ํ•˜๋ฝํ–ˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-2

๊ทธ๋ž˜์„œ ๊ทธ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์œผ๋กœ ๋‚˜์˜ค๋Š” ๊ฒƒ์ด ๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์ธ Per-Channel Quantization์ด๋‹ค. ์œ„ ์˜ˆ์ œ์—์„œ ๋ณด๋ฉด Channel ๋งˆ๋‹ค ์ตœ๋Œ€๊ฐ’๊ณผ ๊ฐ๊ฐ์— ๋งž๋Š” Scale์„ ๋”ฐ๋กœ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ ์šฉํ•œ ๊ฒฐ๊ณผ์ธ ์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด Per-Channel๊ณผ Per-Tensor๋ฅผ ๋น„๊ตํ•ด๋ณด๋ฉด Per-Channel์ด ๊ธฐ์กด์— floating point weight์™€์˜ ์ฐจ์ด๊ฐ€ ๋” ์ ๋‹ค. ํ•˜์ง€๋งŒ, ๋งŒ์•ฝ ํ•˜๋“œ์›จ์–ด์—์„œ Per-Channel Quantization์„ ์ง€์›ํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ์„ ์ถ”๊ฐ€๋กœ ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Š” ์ ํ•ฉํ•œ ๋ฐฉ๋ฒ•์ด ๋  ์ˆ˜ ์—†๋‹ค๋Š” ์ ๋„ ๊ณ ๋ คํ•ด์•ผํ•  ๊ฒƒ์ด๋‹ค(์ด๋Š” ์ด์ „ Tiny Engine์— ๋Œ€ํ•œ ๊ธ€์—์„œ Channel๋‚ด์— ์บ์‹ฑ์„ ์ด์šฉํ•œ ์ตœ์ ํ™”์™€ ์—ฐ๊ด€์ด ์žˆ๋‹ค). ๊ทธ๋Ÿผ ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ์—†์„๊นŒ?

Reference. MIT-TinyML-lecture5-Quantization-2

์„ธ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ Group Quantization์œผ๋กœ ์†Œ๊ฐœํ•˜๋Š” Per-vector Scaled Quantization์™€ Shared Micro-exponent(MX) data type ์ด๋‹ค. Per-vector Scaled Quantization์€ 2023๋…„๋„ ๊ฐ•์˜๋ถ€ํ„ฐ ์†Œ๊ฐœํ•˜๋Š”๋ฐ, ์ด ๋ฐฉ๋ฒ•์€ Scale factor๋ฅผ ๊ทธ๋ฃน๋ณ„๋กœ ํ•˜๋‚˜, Per-Tensor๋กœ ํ•˜๋‚˜๋กœ ๋‘๊ฐœ๋ฅผ ๋‘๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์•„๋ž˜์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด,

Reference. MIT-TinyML-lecture5-Quantization-2

r=S(qโˆ’Z)โ†’r=ฮณโ‹…Sq(qโˆ’Z)

Sq ๋กœ vector๋ณ„ ์Šค์ผ€์ผ๋ง์„ ํ•˜๋‚˜, ฮณ ๋กœ Tensor์— ์Šค์ผ€์ผ๋ง์„ ํ•˜๋ฉฐ ๊ฐ๋งˆ๋Š” floating point๋กœ ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜์žˆ๋‹ค. ์•„๋ฌด๋ž˜๋„ vector๋‹จ์œ„๋กœ ์Šค์ผ€์ผ๋ง์„ ํ•˜๊ฒŒ๋˜๋ฉด channel๊ณผ ๋น„๊ตํ•ด์„œ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์— ๋งž๊ฒŒ accuracy์˜ trade-off๋ฅผ ์กฐ์ ˆํ•˜๊ธฐ ๋” ์ˆ˜์›”ํ•  ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

์—ฌ๊ธฐ์„œ ๊ฐ•์˜๋Š” ์ง€ํ‘œ์ธ Memory Overhead๋กœ โ€œEffective Bit Widthโ€๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. ์ด๋Š” Microsoft์—์„œ ์ œ๊ณตํ•˜๋Š” Quantization Approach MX4, MX6, MX9๊ณผ ์—ฐ๊ฒฐ๋ผ ์žˆ๋Š”๋ฐ, ์ด ๋ฐ์ดํ„ฐํƒ€์ž…์€ ์กฐ๊ธˆ ์ดํ›„์— ๋” ์ž์„ธํžˆ ์„ค๋ช…ํ•  ๊ฒƒ์ด๋‹ค. Effective Bit Width? ์˜ˆ์‹œ ํ•˜๋‚˜๋ฅผ ๋“ค์–ด ์ดํ•ดํ•ด๋ณด์ž. ๋งŒ์•ฝ 4-bit Quatization์„ 4-bit per-vector scale์„ 16 elements(4๊ฐœ์˜ weight๊ฐ€ ๊ฐ๊ฐ 4bit๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด 16 element๋กœ ๊ณ„์‚ฐ๋œ๋‹ค ์œ ์ถ”ํ•  ์žˆ๋‹ค) ๋ผ๋ฉด, Effective Bit Width๋Š” 4(Scale bit) + 4(Vector Scale bit) / 16(Vector Size) = 4.25๊ฐ€ ๋œ๋‹ค. Element๋‹น Scale bit๋ผ๊ณ  ๊ฐ„๋‹จํ•˜๊ฒŒ ์ƒ๊ฐํ•  ์ˆ˜๋„ ์žˆ์„ ๋“ฏ ์‹ถ๋‹ค.

๋งˆ์ง€๋ง‰ Per-vector Scaled Quantization์„ ์ดํ•ดํ•˜๋‹ค๋ณด๋ฉด ์ด์ „์— Per-Tensor, Per-Channel๋„ ๊ทธ๋ฃน์œผ๋กœ ์–ผ๋งˆ๋งŒํผ ๋ฌถ๋Š” ์ฐจ์ด๊ฐ€ ์žˆ๊ณ , ์ด๋Š” ์ด๋“ค์„ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์žˆ์–ด ๋ณด์ธ๋‹ค. ๊ฐ•์˜์—์„œ ๋ฐ”๋กœ ๋‹ค์Œ์— ์†Œ๊ฐœํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋ฐ”๋กœ Multi-level scaling scheme์ด๋‹ค. Per-Channel Quantization์™€ Per-Vector Quantization(VSQ, Vector-Scale Quantization)๋ถ€ํ„ฐ ๋ด๋ณด์ž.

Reference. With Shared Microexponents, A Little Shifting Goes a Long Way [Bita Rouhani et al.]

Per-Channel Quantization๋Š” Scale factor๊ฐ€ ํ•˜๋‚˜๋กœ Effective Bit Width๋Š” 4๊ฐ€ ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  VSQ๋Š” ์ด์ „์— ๊ณ„์‚ฐํ–ˆ ๋“ฏ 4.25๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค(์ฐธ๊ณ ๋กœ Per Channel๋กœ ์ ์šฉ๋˜๋Š” Scale์˜ ๊ฒฝ์šฐ element์˜ ์ˆ˜๊ฐ€ ๋งŽ์•„์„œ ๊ทธ๋Ÿฐ์ง€ ๋”ฐ๋กœ Effective Bit Width๋กœ ๊ณ„์‚ฐํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค). VSQ๊นŒ์ง€ ๋ณด๋ฉด์„œ Effective Bit Width๋Š”,

Effective Bit Width = Scale bit + Group 0 Scale bit / Group 0 Size +...
e.g. VSQ Data type int4 = Scale bit (4) + Group 0 Scale bit(4) / Group 0 Size(16) = 4.25

์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ , MX4, MX6, MX9๊ฐ€ ๋‚˜์˜จ๋‹ค. ์ฐธ๊ณ ๋กœ S๋Š” Sign bit, M์€ Mantissa bit, E๋Š” Exponent bit๋ฅผ ์˜๋ฏธํ•œ๋‹ค(Mantissa๋‚˜ Exponent์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ floating point vs fixed point ๊ธ€์„ ์ฐธ๊ณ ํ•˜์ž). ์•„๋ž˜๋Š” Microsoft์—์„œ ์ œ๊ณตํ•˜๋Š” Quantization Approach MX4, MX6, MX9์— ๋Œ€ํ•œ ํ‘œ์ด๋‹ค.

Reference. MIT-TinyML-lecture5-Quantization-1

2.1.2 Weight Equalization

์—ฌ๊ธฐ๊นŒ์ง€ Weight Quatization์—์„œ ๊ทธ๋ฃน์œผ๋กœ ์–ผ๋งˆ๋งŒํผ ๋ฌถ๋Š”์ง€์— ๋”ฐ๋ผ(๊ฐ•์˜์—์„œ๋Š” Granularity) Quatization์„ ํ•˜๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ–ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ ์†Œ๊ฐœ ํ•  ๋ฐฉ๋ฒ•์€ Weight Equalization์ด๋‹ค. 2022๋…„์— ์†Œ๊ฐœํ•ด์ค€ ๋‚ด์šฉ์ธ๋ฐ, ์ด๋Š” i๋ฒˆ์งธ layer์˜ output channel๋ฅผ scaling down ํ•˜๋ฉด์„œ i+1๋ฒˆ์งธ layer์˜ input channel์„ scaling up ํ•ด์„œ Scale๋กœ ์ธํ•ด Quantization ์ „ํ›„๋กœ ์ƒ๊ธฐ๋Š” Layer๊ฐ„ ์ฐจ์ด๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

Reference. Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]

์˜ˆ๋ฅผ ๋“ค์–ด ์œ„์— ๊ทธ๋ฆผ์ฒ˜๋Ÿผ Layer i์˜ output channel๊ณผ Layer i+1์˜ input channel์ด ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์‹์„ ์ „๊ฐœํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€๋ฐ,

y(i+1)=f(W(i+1)x(i+1)+b(i+1))=f(W(i+1)โ‹…f(W(i)x(i)+b(i))+b(i+1))=f(W(i+1)Sโ‹…f(Sโˆ’1(W(i)x(i)+Sโˆ’1b(i)))+b(i+1))

where S=diag(s) , sj is the weight equalization scale factor of output channel j

์—ฌ๊ธฐ์„œ Scale(S)๊ฐ€ i+1๋ฒˆ์งธ layer์˜ weight์—, i๋ฒˆ์งธ weight์— 1/S ๋กœ Scale๋  ๋–„ ๊ธฐ์กด์— Scale ํ•˜์ง€ ์•Š์€ ์‹๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ์œ ์ง€ํ•  ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰,

rocj(i)/s=ricj(i+1)โ‹…s

sj=1ric=j(i+1)roc=j(i)โ‹…ric=j(i+1)

rocj(i)=rocj(i)/s=roc=j(i)โ‹…ric=j(i+1)

ricj(i)=ricj(i)โ‹…s=roc=j(i)โ‹…ric=j(i+1)

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด i๋ฒˆ์งธ layer์˜ output channel๊ณผ i+1๋ฒˆ์งธ layer์˜ input channel์˜ Scale์„ ๊ฐ๊ฐ S ์™€ 1/S ๋กœํ•˜๋ฉฐ weight๊ฐ„์˜ ๊ฒฉ์ฐจ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

2.1.3 Adaptive rounding

Reference. MIT-TinyML-lecture5-Quantization-1
๋งˆ์ง€๋ง‰ ์†Œ๊ฐœํ•˜๋Š” ๋ฐฉ๋ฒ•์€ Adaptive rounding ์ด๋‹ค. ๋ฐ˜์˜ฌ๋ฆผ์€ Round-to-nearest์œผ๋กœ ๋ถˆ๋ฆฌ๋Š” ์ผ๋ฐ˜์ ์ธ ๋ฐ˜์˜ฌ๋ฆผ์„ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ณ , ํ•˜๋‚˜์˜ ๊ธฐ์ค€์„ ๊ฐ€์ง€๊ณ  ๋ฐ˜์˜ฌ๋ฆผ์„ ํ•˜๋Š” Adaptive Round๋ฅผ ์ƒ๊ฐํ•  ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ฐ•์˜์—์„œ๋Š” Round-to-nearest๊ฐ€ ์ตœ์ ์˜ ๋ฐฉ๋ฒ•์ด ๋˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๋งํ•˜๋ฉฐ, Adaptive round๋กœ weight์— 0๋ถ€ํ„ฐ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๋”ํ•ด ์ˆ˜์‹์ฒ˜๋Ÿผ w~=โŒŠโŒŠwโŒ‹+ฮดโŒ‰,ฮดโˆˆ[0,1] ์ตœ์ ์˜ Optimalํ•œ ๋ฐ˜์˜ฌ๋ฆผ ๊ฐ’์„ ๊ตฌํ•œ๋‹ค. $$ argminV||Wxโˆ’W~x||F2+ฮปfreg(V)โ†’argminV||Wxโˆ’โŒŠโŒŠWโŒ‹+h(V)โŒ‰x||F2+ฮปfreg(V)

$$ ### 2.2 Activation quantization ๋‘ ๋ฒˆ์งธ๋กœ Activation quantization์ด ์žˆ๋‹ค. ๋ชจ๋ธ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜ค๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ง์ ‘์ ์œผ๋กœ ๊ฒฐ์ •ํ•˜๋Š” Activation Quatization์—์„œ๋Š” ๋‘ ๊ฐ€์ง€๋ฅผ ๊ณ ๋ คํ•œ ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ํ•˜๋‚˜๋Š” Activation ๋ ˆ์ด์–ด์—์„œ ๊ฒฐ๊ณผ๊ฐ’์„ Smoothingํ•œ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด Exponential Moving Average(EMA)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๋‹ค์–‘ํ•œ ์ž…๋ ฅ๊ฐ’์„ ๊ณ ๋ คํ•ด batch samples์„ FP32 ๋ชจ๋ธ๊ณผ calibrationํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

Exponential Moving Average (EMA)์€ ์•„๋ž˜ ์‹์—์„œ ฮฑ ๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. r^max,min(t)=ฮฑrmax,min(t)+(1โˆ’ฮฑ)r^max,min(t) Calibration์˜ ์ปจ์…‰์€ ๋งŽ์€ input์˜ min/max ํ‰๊ท ์„ ์ด์šฉํ•˜์ž๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ trained FP32 model๊ณผ sample batch๋ฅผ ๊ฐ€์ง€๊ณ  quantizedํ•œ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ์™€ calibration์„ ๋Œ๋ฆฌ๋ฉด์„œ ๊ทธ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋Š”๋ฐ, ์—ฌ๊ธฐ์— ์ด์šฉํ•˜๋Š” ์ง€ํ‘œ๋Š” loss of information์™€ Newton-Raphson method๋ฅผ ์‚ฌ์šฉํ•œ Mean Square Error(MSE)๊ฐ€ ์žˆ๋‹ค. MSE=min|r|max E[(Xโˆ’Q(X))2] KL divergence=DKL(P||Q)=โˆ‘iNP(xi)logP(xi)Q(xi) ### 2.3 Quanization Bias Correction

๋งˆ์ง€๋ง‰์œผ๋กœ Quatization์œผ๋กœ biased error๋ฅผ ์žก๋Š”๋‹ค๋Š” ๊ฒƒ์„ ์†Œ๊ฐœํ•œ๋‹ค. ฯต=Q(W)โˆ’W ์ด๋ผ๊ณ  ๋‘๊ณ  ์•„๋ž˜์ฒ˜๋Ÿผ ์‹์ด ์ „๊ฐœ์‹œํ‚ค๋ฉด ๋งˆ์ง€๋ง‰ ํ•ญ์—์„œ ๋ณด์ด๋Š” โˆ’ฯตE[x] ๋ถ€๋ถ„์ด bias๋ฅผ quatization์„ ํ•  ๋•Œ ์ œ๊ฑฐ ๋œ๋‹ค๊ณ  ํ•œ๋‹ค(์ด ๋ถ€๋ถ„์€ 2023๋…„์—๋Š” ์†Œ๊ฐœํ•˜์ง„ ์•Š๋Š”๋ฐ, ๋‹น์—ฐํ•œ ๊ฒƒ์ด์–ด์„œ ์•ˆํ•˜๋Š”์ง€, ํ˜น์€ ์˜ํ–ฅ์ด ํฌ์ง€ ์•Š์•„์„œ ๊ทธ๋Ÿฐ์ง€๋Š” ๋ชจ๋ฅด๊ฒ ๋‹ค. Bias Quatization์ดํ›„์— MobileNetV2์—์„œ ํ•œ ๋ ˆ์ด์–ด์˜ output์„ ๋ณด๋ฉด ์–ด๋А์ •๋„ ์ œ๊ฑฐ๋˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค). E[y]=E[Wx]+E[ฯตx]โˆ’E[ฯตx], E[Q(W)x]=E[Wx]+E[ฯตx]E[y]=E[Q(W)x]โˆ’ฯตE[x]

Reference. MIT-TinyML-lecture5-Quantization-2

2.4 Post-Training INT8 Linear Quantization Result

์•ž์„  Post-Training Quantization์„ ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ด๋ฏธ์ง€๊ณ„์—ด ๋ชจ๋ธ์„ ๋ชจ๋‘ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ์„ฑ๋Šฅํ•˜๋ฝํญ์€ ์ง€ํ‘œ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ๋น„๊ต์  ํฐ ๋ชจ๋ธ๋“ค์˜ ๊ฒฝ์šฐ ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ MobileNetV1, V2์™€ ๊ฐ™์€ ์ž‘์€ ๋ชจ๋ธ์€ ์ƒ๊ฐ๋ณด๋‹ค Quantization์œผ๋กœ ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅํญ(-11.8%, -2.1%) ์ด ํฐ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿผ ์ž‘์€ ํฌ๊ธฐ์˜ ๋ชจ๋ธ๋“ค์€ ์–ด๋–ป๊ฒŒ Training ํ•ด์•ผํ• ๊นŒ?

Reference. MIT-TinyML-lecture5-Quantization-2

3. Quantization-Aware Training(QAT)

3.1 Quantization-Aware Training

Reference. MIT-TinyML-lecture06-Quantization-2
  • Usually, fine-tuning a pre-trained floating point model provides better accuracy than training from scratch.

์ด์ „์— K-mean Quantization์—์„œ Fine-tuning๋•Œ Centroid์— gradient๋ฅผ ๋ฐ˜์˜ํ–ˆ์—ˆ๋‹ค. Quantization-Aware Training์€ ์ด์™€ ์œ ์‚ฌํ•˜๊ฒŒ Quantization - Reconstruction์„ ํ†ตํ•ด ๋งŒ๋“ค์–ด์ง„ Weight๋กœ Training์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋งํ•œ๋‹ค. ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด์„œ ์ž์„ธํžˆ ์‚ดํŽด๋ณด์ž.

Reference. MIT-TinyML-lecture06-Quantization-2
  • A full precision copy of the weights W is maintained throughout the training.
  • The small gradients are accumulated without loss of precision
  • Once the model is trained, only the quantized weights are used for inference

์œ„ ๊ทธ๋ฆผ์—์„œ Layer N์ด ๋ณด์ธ๋‹ค. ์ด Layer N์€ weights๋ฅผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๊ฐ€์ง€์ง€๋งŒ, ์‹ค์ œ๋กœ Training ๊ณผ์ •์—์„œ ์“ฐ์ด๋Š” weight๋Š” โ€œweight quantizationโ€์„ ํ†ตํ•ด Quantization - Reconstruction์„ ํ†ตํ•ด ๋งŒ๋“ค์–ด์ง„ Weight๋ฅผ ๊ฐ€์ง€๊ณ  ํ›ˆ๋ จ์„ ํ•  ๊ฒƒ์ด๋‹ค.

3.2 Straight-Through Estimator(STE)

Reference. MIT-TinyML-lecture06-Quantization-2

๊ทธ๋Ÿผ ํ›ˆ๋ จ์—์„œ gradient๋Š” ์–ด๋–ป๊ฒŒ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์„๊นŒ? Quantization์˜ ๊ฐœ๋…์ƒ, weight quantization์—์„œ weight๋กœ ๋„˜์–ด๊ฐ€๋Š” gradient๋Š” ์—†์„ ์ˆ˜ ๋ฐ–์— ์—†๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ์‚ฌ์‹ค์ƒ weight๋กœ back propagation์ด ๋  ์ˆ˜ ์—†๊ฒŒ ๋˜๊ณ , ๊ทธ๋ž˜์„œ ์†Œ๊ฐœํ•˜๋Š” ๊ฐœ๋…์ด Straight-Through Estimator(STE) ์ž…๋‹ˆ๋‹ค. ๋ง์ด ๊ฑฐ์ฐฝํ•ด์„œ ๊ทธ๋ ‡์ง€, Q(W)์—์„œ ๋ฐ›์€ gradient๋ฅผ ๊ทธ๋Œ€๋กœ weights ๋กœ ๋„˜๊ฒจ์ฃผ๋Š” ๋ฐฉ์‹์ด๋‹ค.

  • Quantization is discrete-valued, and thus the derivative is 0 almost everywhere โ†’ NN will learn nothing!

  • Straight-Through Estimator(STE) simply passes the gradients through the quantization as if it had been the identity function.

    gW=โˆ‚Lโˆ‚W=โˆ‚Lโˆ‚Q(W)

Reference. MIT-TinyML-lecture06-Quantization-2
  • Reference
    • Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
    • Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]

์ด ํ›ˆ๋ จ์˜ ๊ฒฐ๊ณผ๊ฐ€ ๊ถ๊ธˆํ•˜์‹œ๋‹ค๋ฉด ์ด ๋…ผ๋ฌธ์„ ์ฐธ๊ณ ํ•˜์ž. ์ฐธ๊ณ ๋กœ ๋…ผ๋ฌธ์—์„œ๋Š” MobileNetV1, V2 ๊ทธ๋ฆฌ๊ณ  NASNet-Mobile์„ ์ด์šฉํ•ด Post-Training Quantization๊ณผ Quantization-Aware Training์„ ๋น„๊ตํ•˜๊ณ  ์žˆ๋‹ค.

4. Binary and Ternary Quantization

์ž, ๊ทธ๋Ÿผ Quantization์„ ๊ถ๊ทน์ ์œผ๋กœ 2bit๋กœ ํ•  ์ˆ˜๋Š” ์—†์„๊นŒ? ๋ฐ”๋กœ Binary(1, -1)๊ณผ Tenary(1, 0, -1) ์ด๋‹ค.

  • Can we push the quantization precision to 1 bit?

    Reference. MIT-TinyML-lecture06-Quantization-2
  • Reference

    • BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
    • XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]

๋จผ์ € Weight๋ฅผ 2bit๋กœ Quantization์„ ํ•˜๊ฒŒ ๋˜๋ฉด, ๋ฉ”๋ชจ๋ฆฌ์—์„œ๋Š” 32bit๋ฅผ 1bit๋กœ ์ค„์ด๋‹ˆ 32๋ฐฐ๋‚˜ ์ค„์ผ ์ˆ˜ ์žˆ๊ณ , Computation๋„ (8x5)+(-3x2)+(5x0)+(-1x1)์—์„œ 5-2+0-1 ๋กœ ์ ˆ๋ฐ˜์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

4.1 Binarization: Deterministic Binarization

๊ทธ๋Ÿผ Binarization์—์„œ +1๊ณผ -1์„ ์–ด๋–ค ๊ธฐ์ค€์œผ๋กœ ํ•ด์•ผํ• ๊นŒ? ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ•์€ threhold๋ฅผ ๊ธฐ์ค€์œผ๋กœ +-1๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์ด๋‹ค.

Directly computes the bit value base on a threshold, usually 0 resulting in a sign function.

q=sign(r)={+1,rโ‰ฅ0โˆ’1,r<0

4.2 Binarization: Stochastic Binarization

๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” output์—์„œ hard-sigmoid function์„ ๊ฑฐ์ณ์„œ ๋‚˜์˜จ ๊ฐ’๋งŒํผ ํ™•๋ฅ ์ ์œผ๋กœ +-1์ด ๋‚˜์˜ค๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์€ ๋ฌด์ž‘์œ„๋กœ ๋น„ํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ•˜์ง„ ์•Š๋Š”๋‹ค๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค.

  • Use global statistics or the value of input data to determine the probability of being -1 or +1

  • In Binary Connect(BC), probability is determined by hard sigmoid function ฯƒ(r)

    q={+1,with probability p=ฯƒ(r)โˆ’1,1โˆ’pwhere ฯƒ(r)=min(max(r+12,0),1)

    Reference. MIT-TinyML-lecture06-Quantization-2
  • Harder to implement as it requires the hardware to generate random bits when quantizing.

4.3 Binarization: Use Scale

์•ž์„  ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•ด์„œ ImageNet Top-1 ์„ ํ‰๊ฐ€ํ•ด๋ณด๋ฉด Quantization์ดํ›„ -21.2%๋‚˜ ์„ฑ๋Šฅ์ด ํ•˜๋ฝํ•˜๋Š” ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. โ€œ์–ด๋–ป๊ฒŒ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ์„๊นŒ?โ€ ํ•œ ๊ฒƒ์ด linear qunatization์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ Scale ๊ฐœ๋…์ด๋‹ค.

  • Using Scale, Minimizing Quantization Error in Binarization

    Reference. MIT-TinyML-lecture06-Quantization-2

์—ฌ๊ธฐ์„œ Scale์€ 1n||W||1 ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ณ , ์„ฑ๋Šฅ์€ ํ•˜๋ฝ์ด ๊ฑฐ์˜ ์—†๋Š” ๊ฒƒ๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์™œ 1n||W||1์ธ์ง€๋Š” ์•„๋ž˜ ์ฆ๋ช…๊ณผ์ •์„ ์ฐธ๊ณ ํ•˜์ž!

  • Why ฮฑ is 1n||W||1?

    J(B,ฮฑ)=||Wโˆ’ฮฑB||2ฮฑโˆ—,Bโˆ—=argminฮฑ,B J(B,ฮฑ)J(B,ฮฑ)=ฮฑ2BTBโˆ’2ฮฑWTB+WTW since Bโˆˆ{+1,โˆ’1}nBTB=n(constant),WTW=constant(a known variable)J(B,ฮฑ)=ฮฑ2nโˆ’2ฮฑWTB+CBโˆ—=argmaxB{WTB} s.t. Bโˆˆ{+1,โˆ’1}nฮฑโˆ—=WTBโˆ—nฮฑโˆ—=WTsign(W)n=|Wi|n=1n||W||l1

    • Reference. XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
    • B*๋Š” J(B,ฮฑ)์—์„œ ์ตœ์†Ÿ๊ฐ’์„ ๊ตฌํ•ด์•ผํ•˜๋ฏ€๋กœ WTB ๊ฐ€ ์ตœ๋Œ€์—ฌ์•ผํ•˜๊ณ  ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด์„œ๋Š” W๊ฐ€ ์–‘์ˆ˜์ผ๋•Œ๋Š” B๋„ ์–‘์ˆ˜, W๊ฐ€ ์Œ์ˆ˜์ผ ๋•Œ๋Š” B๋„ ์Œ์ˆ˜์—ฌ์•ผ WTB=โˆ‘|W| ์ด ๋˜๋ฉด์„œ ์ตœ๋Œ“๊ฐ’์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

4.4 Binarization: Activation

๊ทธ๋Ÿผ Activation๊นŒ์ง€ Quantization์„ ํ•ด๋ด…์‹œ๋‹ค.

4.4.1 Activation

Untitled

์—ฌ๊ธฐ์„œ ์กฐ๊ธˆ ๋” ์—ฐ์‚ฐ์„ ์ตœ์ ํ™” ํ•  ์ˆ˜ ์žˆ์–ด๋ณด์ด๋Š” ๊ฒƒ์ด Matrix Muliplication์ด XOR ์—ฐ์‚ฐ๊ณผ ๋น„์Šทํ•˜๊ฒŒ ๋ณด์ธ๋‹ค.

4.4.2 XNOR bit count

Reference. MIT-TinyML-lecture06-Quantization-2
  • yi=โˆ’n+popcount(Wi xnor x)<<1 โ†’ popcount returns the number of 1

๊ทธ๋ž˜์„œ popcount๊ณผ XNOR์„ ์ด์šฉํ•ด์„œ Computation์—์„œ ์ข€ ๋” ์ตœ์ ํ™”๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ตœ์ ํ™”๋ฅผ ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜๋ฉด, ๋ฉ”๋ชจ๋ฆฌ๋Š” 32๋ฐฐ, Computation์€ 58๋ฐฐ๊ฐ€๋Ÿ‰ ์ค„์–ด๋“ค ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค.

Reference. MIT-TinyML-lecture06-Quantization-2

์ด๋ ‡๊ฒŒ Weight, Scale factor, Activation, ๊ทธ๋ฆฌ๊ณ  XNOR-Bitcout ๊นŒ์ง€. ์ด ๋„ค ๊ฐ€์ง€ ๋‹จ๊ณ„๋กœ Binary Quantization์„ ๋‚˜๋ˆˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ๋Š” Ternary Quantization์€ ์•Œ์•„๋ณด์ž.

Reference. XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
    1. Binarizing Input ์˜ ๊ฒฝ์šฐ๋Š” average๋ฅผ ๋ชจ๋“  channel์— ๊ฐ™์ด ์ ์šฉํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ c๋งŒํผ์„ average filter๋กœ ํ•œ ๋ฒˆ์— ์ ์šฉํ•œ๋‹ค๋Š” ๋ง์ด๋‹ค.

4.5 Ternary Weight Networks(TWN)

Ternary๋Š” Binary Quantization๊ณผ ๋‹จ๊ณ„๋Š” ๋ชจ๋‘ ๊ฐ™์ง€๋งŒ, ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฐ’์œผ๋กœ 0 ์„ ์ถ”๊ฐ€ํ•œ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์€ Scale์„ ์ด์šฉํ•ด์„œ Quantization Error๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ๋งํ•˜๊ณ  ์žˆ๋‹ค. q={rt,r>ฮ”0,|r|โ‰คฮ”โˆ’rt,r<โˆ’ฮ”where ฮ”=0.7ร—E(|r|),rt=E|r|>ฮ”(|r|) Reference. Trained Ternary Quantization [Zhu et al., ICLR 2017] ### 4.6 Trained Ternary Quantization(TTQ)

Tenary Quantization์—์„œ ๋˜ ํ•œ๊ฐ€์ง€ ๋‹ค๋ฅด๊ฒŒ ์„ค๋ช…ํ•˜๋Š” ๊ฒƒ์€ 1๊ณผ -1๋กœ๋งŒ ์ •ํ•ด์ ธ ์žˆ๋˜ Binary Quantization๊ณผ ๋‹ค๋ฅด๊ฒŒ Tenary๋Š” 1, 0, -1๋กœ Quantization์„ ํ•œ ํ›„, ์ถ”๊ฐ€์ ์ธ ํ›ˆ๋ จ์„ ํ†ตํ•ด wt์™€ โˆ’wt๋กœ fine-tuning์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์ œ์•ˆํ•œ๋‹ค(ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•์„ ์ด์šฉํ•ด์„œ ํ•œ ๊ฒฐ๊ณผ๋ฅผ CIFAR-10 ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ResNets, AlexNet, ImageNet์—์„œ ๋ณด์—ฌ์ค€๋‹ค). q={wt,r>ฮ”0,|r|โ‰คฮ”โˆ’wt,r<โˆ’ฮ” Reference. Trained Ternary Quantization [Zhu et al., ICLR 2017]

4.7 Accuracy Degradation

Binary, Ternary Quantization์„ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค(Resnet-18 ๊ฒฝ์šฐ์—๋Š” Ternary ๊ฐ€ ์˜คํžˆ๋ ค Binary๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋” ๋–จ์–ด์ง„๋‹ค!)

  • Binarization

    Reference. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or โˆ’1. [Courbariaux et al., Arxiv 2016], XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
  • Ternary Weight Networks (TWN)

    Reference. Ternary Weight Networks [Li et al., Arxiv 2016]
  • Trained Ternary Quantization (TTQ)

    Reference. Trained Ternary Quantization [Zhu et al., ICLR 2017]

5. Low Bit-Width Quantization

๋‚จ์€ ๋ถ€๋ถ„๋“ค์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์‹คํ—˜ / ์—ฐ๊ตฌ๋“ค์„ ์†Œ๊ฐœํ•˜๊ณ  ์žˆ๋‹ค.

  • Binary Quantization์€ Quantization Aware Training์„ ํ•  ์ˆ˜ ์žˆ์„๊นŒ?
  • 2,3 bit๊ณผ 8bit ๊ทธ ์ค‘๊ฐ„์œผ๋กœ๋Š” Quantization์„ ํ•  ์ˆ˜ ์—†์„๊นŒ?
  • ๋ ˆ์ด์–ด์—์„œ Quantization์„ ํ•˜์ง€ ์•Š๋Š” ๋ ˆ์ด์–ด, ์˜ˆ๋ฅผ ๋“ค์–ด ๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ์˜ˆ๋ฏผํ•˜๊ฒŒ ๋ฏธ์น˜๋Š” ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ด์–ด๊ฐ€ ๊ฐ™์€ ๊ฒฝ์šฐ Quantization์„ ํ•˜์ง€ ์•Š์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?
  • Activation ํ•จ์ˆ˜๋ฅผ ๋ฐ”๊พธ๋ฉด ์–ด๋–จ๊นŒ?
  • ์˜ˆ๋ฅผ ๋“ค์–ด ์ฒซ๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ N๋ฐฐ ๋„“๊ฒŒ ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋ฐ”๊พธ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?
  • ์กฐ๊ธˆ์”ฉ Quantization์„ ํ•  ์ˆ˜ ์—†์„๊นŒ? (20% โ†’ 40% โ†’ โ€ฆ โ†’ 100%)

๊ฐ•์˜์—์„œ๋Š” ํฌ๊ฒŒ ์–ธ๊ธ‰ํ•˜์ง€ ์•Š๊ณ  ๊ฐ„ ๋‚ด์šฉ๋“ค์ด๋ผ ์„ค๋ช…์„ ํ•˜์ง€๋Š” ์•Š๊ฒ ๋‹ค. ํ•ด๋‹น ๋‚ด์šฉ๋“ค์€ ์ž์„ธํ•œ ๋‚ด์šฉ์„ ์•Œ๊ณ ์‹ถ์œผ๋ฉด ๊ฐ ํŒŒํŠธ์— ์–ธ๊ธ‰๋œ ๋…ผ๋ฌธ์„ ์ฐธ์กฐํ•˜๊ธธ!

5.1 Train Binarized Neural Networks From Scratch

  • Straight-Through Estimator(STE)

Reference. MIT-TinyML-lecture06-Quantization-2
  • Gradient pass straight to floating-point weights
  • Floating-point weight with in [-1, 1]
  • Reference. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or โˆ’1. [Courbariaux et al., Arxiv 2016]

5.2 Quantization-Aware Training: DoReFa-Net With Low Bit-Width Gradients

Reference. MIT-TinyML-lecture06-Quantization-2
  • Gradient Quantization

    Q(g)=2โ‹…max(|G|)โ‹…[quantizek(g2โ‹…max(|G|)+12+N(k))โˆ’12] where N(k)=ฯƒ2kโˆ’1and ฯƒโˆผUniform(โˆ’0.5,0.5)

    • Noise function N(k) is added to compensate the potential bias introduced by gradient quantization.
  • Result

    Reference. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [Zhou et al., arXiv 2016]

5.3 Replace the Activation Function: Parameterized Clipping Activation Function

  • The most common activation function ReLU is unbounded. The dynamic range of inputs becomes problematic for low bit-width quantization due to very limited range and resolution.

  • ReLU is replaced with hard-coded bounded activation functions: ReLU6, ReLU1, etc

  • The clipping value per layer can be learned as well: PACT(Parametrized Clipping Activation Function)

    Reference. PACT: Parameterized Clipping Activation for Quantized Neural Networks [Choi et al., arXiv 2018] y=PACT(x;ฮฑ)=0.5(|x|โˆ’|xโˆ’ฮฑ|+ฮฑ)={0,xโˆˆ[โˆ’โˆž,0)x,xโˆˆ[0,ฮฑ)ฮฑ,xโˆˆ[ฮฑ,+โˆž)

    The upper clipping value of the activation function is a trainable. With STE, the gradient is computed as

    โˆ‚Q(y)โˆ‚ฮฑ=โˆ‚Q(y)โˆ‚yโ‹…โˆ‚yโˆ‚ฮฑ={0xโˆˆ(โˆ’โˆž,ฮฑ)1xโˆˆ[ฮฑ,+โˆž)

    โ†’โˆ‚Lโˆ‚ฮฑ=โˆ‚Lโˆ‚Q(y)โ‹…โˆ‚Q(y)โˆ‚ฮฑ={0xโˆˆ(โˆ’โˆž,ฮฑ)โˆ‚Lโˆ‚Q(y)xโˆˆ[ฮฑ,+โˆž)

    The larger ฮฑ, the more the parameterized clipping function resembles a ReLU function

    • To avoid large quantization errors due to a wide dynamic range [0,ฮฑ], L2-regularizer for ฮฑ is included in the training loss function.
  • Result

    Reference. PACT: Parameterized Clipping Activation for Quantized Neural Networks [Choi et al., arXiv 2018]

5.4 Modify the Neural Network Architecture

  1. Widen the neural network to compensate for the loss of information due to quantization

    ex. Double the channels, reduce the quantization precision

    Reference. WRPN: Wide Reduced-Precision Networks [Mishra et al., ICLR 2018]
  2. Replace a single floating-point convolution with multiple binary convolutions.

    • Towards Accurate Binary Convolutional Neural Network [Lin et al., NeurIPS 2017]
    • Quantization [Neural Network Distiller]

5.5 No Quantization on First and Last Layer

  • Because it is more sensitive to quantization and small portion of the overall computation
  • Quantizing these layers to 8-bit integer does not reduce accuracy

5.6 Iterative Quantization: Incremental Network Quantization

  • Reference. Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights [Zhou et al., ICLR 2017]

Reference. Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights [Zhou et al., ICLR 2017]
  • Setting
    • Weight quantization only
    • Quantize weights to 2n for faster computation (bit shift instead of multiply)
  • Algorithm
    • Start from a pre-trained fp32 model
    • For the remaining fp32 weights
      • Partition into two disjoint groups(e.g., according to magnitude)
      • Quantize the first group (higher magnitude), and re-train the other group to recover accuracy
    • Repeat until all the weights are quantized (a popular stride is {50%, 75%, 87.5%, 100%})

    Reference. MIT-TinyML-lecture06-Quantization-2

6. Mixed-precision quantization

๋งˆ์ง€๋ง‰์œผ๋กœ ๋ ˆ์ด์–ด๋งˆ๋‹ค Quantization bit๋ฅผ ๋‹ค๋ฅด๊ฒŒ ๊ฐ€์ ธ๊ฐ€๋ฉด ์–ด๋–จ์ง€์— ๋Œ€ํ•ด์„œ ์ด์•ผ๊ธฐํ•œ๋‹ค. ํ•˜์ง€๋งŒ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ 8bit ๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ ๊ฐ™๊ฒŒ Quantization์„ ํ•  ์‹œ, weight์™€ activation๋กœ ๊ฒฝ์šฐ์˜ ์ˆ˜๋ฅผ ๊ณ ๋ ค๋ฅผ ํ•œ๋‹ค๋ฉด N๊ฐœ ๋ ˆ์ด์–ด์— ๋Œ€ํ•ด์„œ (8ร—8)N๋ผ๋Š” ์–ด๋งˆ์–ด๋งˆํ•œ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ ๋‚˜์˜จ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด์— ๋Œ€ํ•ด์„œ๋Š” ๋‹ค์Œ ํŒŒํŠธ์— ๋‚˜๊ฐˆ Neural Architecture Search(NAS) ์—์„œ ๋‹ค๋ฃฐ ๋“ฏ ์‹ถ๋‹ค.

6.1 Uniform Quantization

6.2 Mixed-precision Quantization

6.3 Huge Design Space and Solution: Design Automation

  • Design Space: Each of Choices(8x8=64) โ†’ 64n

    Reference. HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
  • Result in Mixed-Precision Quantized MobileNetV1

    Reference. HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
    • This paper compares with Model size, Latency and Energy

๊ฐ€์žฅ ๋งˆ์ง€๋ง‰์— ์–ธ๊ธ‰ํ•˜๋Š” Edge์™€ ํด๋ผ์šฐ๋“œ์—์„œ๋Š” Convolution ๋ ˆ์ด์–ด์˜ ์ข…๋ฅ˜ ์ค‘ ๋”ํ•˜๊ณ  ๋œ Quantizationํ•˜๋Š” ๋ ˆ์ด์–ด๊ฐ€ ๊ฐ๊ฐ depthwise์™€ pointwise๋กœ ๋‹ค๋ฅด๋‹ค๊ณ  ์ด์•ผ๊ธฐํ•œ๋‹ค. ์ด ๋‚ด์šฉ์— ๋Œ€ํ•ด์„œ ๋” ์ž์„ธํžˆ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•„๋งˆ๋„ NAS๋กœ ๋„˜์–ด๊ฐ€๋ด์•ผ ์•Œ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ์‹ถ๋‹ค.

  • Quantization Policy for Edge and Cloud

    Reference. HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]

7. Reference