๐Ÿง‘โ€๐Ÿซ Lecture 14

lecture
transformer
vision transformer
Vision Transformer for TinyML
Author

Seunghyun Oh

Published

April 26, 2024

Vision Transformer (ViT) for TinyML

์ด๋ฒˆ ์‹œ๊ฐ„์€ Transformer ๋ชจ๋ธ์—์„œ๋„ Vision์— ์ฃผ๋กœ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์•Œ์•„๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ๊ฒฝ๋Ÿ‰ํ™”ํ•˜๊ฑฐ๋‚˜ ๊ฐ€์†ํ™”ํ•˜๋Š” ๊ธฐ๋ฒ•, ๊ทธ๋ฆฌ๊ณ  ์ œํ•œ๋œ ๋ฆฌ์†Œ์Šค์—์„œ ์–ด๋–ป๊ฒŒ ์ž˜ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์„์ง€ ์•Œ์•„๋ด…์‹œ๋‹ค.

1. Basics of Vision Transformer (ViT)

Vision Transformer๋Š” ๋ญ˜๊นŒ์š”? LLM์œผ๋กœ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” Language ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, ์ž…๋ ฅ์œผ๋กœ ํ† ํฐ์ด ๋“ค์–ด์™€ Transformer ๋ชจ๋ธ ๊ตฌ์กฐ๋กœ Encoder, Decoder ๊ตฌ์กฐ์— ๋”ฐ๋ผ BERT(Encoder), GPT(Decoder) ๊ทธ๋ฆฌ๊ณ  BART, T5(Encoder - Decoder) ๊ตฌ์กฐ๋กœ ์‚ฌ์šฉํ•˜์ฃ . ๊ทธ๋Ÿผ Vision์—์„œ๋Š” ์ด ๊ตฌ์กฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ• ๊นŒ์š”?

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

์ƒ๊ฐ๋ณด๋‹ค ๊ฐ„๋‹จํ•ด์š”. ์ด๋ฏธ์ง€๊ฐ€ ๋งŒ์•ฝ 96x96์ด ์žˆ๋‹ค๋ฉด ์ด๋ฆ„ 32x32 ์ด๋ฏธ์ง€ 9๊ฐœ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ๋‚˜๋ˆˆ ์ด๋ฏธ์ง€๋ฅผ Patch๋ผ๊ณ  ๋ถ€๋ฅผ๊ฒŒ์š”. ๊ทธ๋Ÿผ ์ด Patch๋ฅผ Linear Projection์„ ํ†ตํ•ด์„œ ํ† ํฐ์ฒ˜๋Ÿผ 768๊ฐœ์˜ Vision Transformer(ViT)์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ์•„์ด๋””์–ด๋ฅผ ๊ตฌํ˜„ํ•  ๋•Œ๋Š” 32x32 Convolution ๋ ˆ์ด์–ด์— stride 32, padding 0, ์ž…๋ ฅ ์ฑ„๋„ 3, ์ถœ๋ ฅ ์ฑ„๋„ 768๋กœ ์—ฐ์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๋‹ค์Œ์€, ์ž…๋ ฅ์ด ๋™์ผํ•ด ์กŒ์œผ๋‹ˆ ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” ํ”ํžˆ ๋ณด๋Š” ์•„๋ž˜ ๊ทธ๋ฆผ์˜ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์ฃ .

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

์—ฌ๊ธฐ์„œ Patch ์˜ ํฌ๊ธฐ ๋˜ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ViT๋ฅผ ๋งํ•  ๋•Œ๋Š” Patch๋„ ์œ ์‹ฌํžˆ ๋ณด์…”์•ผํ•  ๊ฒ๋‹ˆ๋‹ค.

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

๊ทธ๋Ÿฐ๋ฐ ์™œ Vision Transformer๋ฅผ ์“ธ๊นŒ์š”? ๊ธฐ์กด์— CNN ๊ตฌ์กฐ์— ResNet์ด๋‚˜ MobileNet์˜ ๊ตฌ์กฐ๋„ ์ถฉ๋ถ„ํžˆ ์„ฑ๋Šฅ์ด ๊ดœ์ฐฎ์ง€ ์•Š๋‚˜์š”? CNN๊ณผ Transformer๋ฅผ Vision task์—์„œ ๋น„๊ตํ•ด๋ณด๋ฉด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์ˆ˜๊ฐ€ ์ ์„ ๋•Œ๋Š” ํ™•์‹คํžˆ CNN์ด ๊ฐ•์„ธ๋ฅผ ๋ณด์ด์ฃ .

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

ํ•˜.์ง€.๋งŒ. ๋ฐ์ดํ„ฐ์ˆ˜๊ฐ€ 300M์ธ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณด์‹œ์ฃ . ViT๋Š” ๋ฐ์ดํ„ฐ์ˆ˜๊ฐ€ ๋งŽ์œผ๋ฉด ๋งŽ์„ ์ˆ˜๋ก CNN์— ๋น„ํ•ด์„œ ํ›จ์”ฌ ๊ฐ•์„ธ๋ฅผ ๋ณด์ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์ฃ ? ์ด ๋•Œ๋ฌธ์—, ์ €ํฌ๋Š” ViT์— ๋งค๋ฃŒ๋  ์ˆ˜ ๋ฐ–์— ์—†์—ˆ์Šต๋‹ˆ๋‹ค.

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

Vision์— ๋Œ€ํ•œ Application์œผ๋กœ Medical Image Segmentation, Super Resolution, Autonomous Driving, Segmentation๋กœ์จ ์ฃผ๋กœ ์‚ฌ์šฉํ•ด์š”.

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

๋ฌธ์ œ๋Š” ์ด ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์€ ๋ชจ๋‘ High resolution์— prediction์„ ์š”๊ตฌํ•˜์ง€๋งŒ, ViT๋Š” Input resolution์ด ๋†’์•„์งˆ ๋•Œ๋งˆ๋‹ค ์—ฐ์‚ฐ๋Ÿ‰์ด ์–ด๋งˆ์–ด๋งˆํ•ด์ง‘๋‹ˆ๋‹ค. ์„ ํ˜•์ ์ด๋ผ๊ธฐ ๋ณด๋‹จ ์ง€์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ ์ฒ˜๋Ÿผ ๋ณด์ด๋„ค์š”. ๋ฐ”๋กœ ์ด ๋ฌธ์ œ ๋•Œ๋ฌธ์—, ์šฐ๋ฆฌ๊ฐ€ โ€œEfficient and Accelerationโ€์— ๋Œ€ํ•ด์„œ ๊ณ ๋ฏผํ•  ์ˆ˜ ๋ฐ–์— ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

*mIoU: mean Intersection of Union, GMAC: Giga Multiply and Add Computation

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

2. Efficient ViT & acceleration techniques

2.1 Window attention

์ฒ˜์Œ์œผ๋กœ ์†Œ๊ฐœํ•  ๊ธฐ์ˆ ์€ Window attention ์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด์— attention์€ ๋ ˆ์ด์–ด๋งˆ๋‹ค patch์˜ ํฌ๊ธฐ๊ฐ€ ๋™์ผํ•˜๊ฒŒ ๋“ค์–ด๊ฐ€๊ฒ ์ฃ . ํ•˜์ง€๋งŒ Window attention์€ ๋ ˆ์ด์–ด๋งˆ๋‹ค patch์˜ ํฌ๊ธฐ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ทธ.๋ฆฌ.๊ณ . ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ๊ฑด Window attention์€ ๊ทธ Patch์•ˆ์— ๋‹ค์‹œ Patch๋ฅผ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ํ†ตํ•ด ๋ณ‘๋ ฌ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฑฐ์ฃ .

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

ํ•˜์ง€๋งŒ ์ €๋ ‡๊ฒŒ ์—ฐ์‚ฐํ•˜๋ฉด ๋ฌธ์ œ์ ์ด Window๊ฐ„ ์ •๋ณด๊ตํ™˜์ด ์—†๋‹ค๋Š” ์ ์ธ๋ฐ์š”. ์ด๋Š” ๋ ˆ์ด์–ด๋ณ„๋กœ โ€œShift Windowโ€๋ฅผ ํ†ตํ•ด ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

2.2 Linear attention

๋‘ ๋ฒˆ์งธ ์†Œ๊ฐœํ•  ๊ธฐ์ˆ ์€ Linear attention ์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด์— Attention ์—ฐ์‚ฐ์ค‘์— Softmax๊ฐ€ ์žˆ์—ˆ์ฃ ? exponential ์—ฐ์‚ฐ์€ ์ง์ ‘ ๊ตฌํ˜„ํ•ด๋ณด๋ฉด ์—ฐ์‚ฐ๋Ÿ‰์ด ์ชผ๊ธˆ ํž˜๋“ญ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋ฅผ Linear ๋ ˆ์ด์–ด์ธ Relu๋กœ ๋Œ€์ฒด๋ฅผ ํ•˜๋Š”๋ฐ์š”. ์—ฌ๊ธฐ์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„๊ฐ€ O(\(n^2\))์ธ ๋ถ€๋ถ„๊นŒ์ง€ ํ–‰๋ ฌ์˜ ๊ณฑ์…ˆ์—์„œ ๊ฒฐํ•ฉ๋ฒ•์น™์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋ถ€๋ถ„์„ ์ด์šฉํ•ด O(\(n\))์œผ๋กœ ์ค„์—ฌ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ๋ณต์žก๋„๊ฐ€ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์ด ์ดํ•ด๊ฐ€ ์•ˆ๊ฐ€์‹  ๋‹ค๋ฉด, Scale์ดํ›„์— ๋‚˜์˜ค๋Š” ๋ ˆ์ด์–ด๊ฐ€ \(n \times d\) ์ธ ์ ๊ณผ n-์ฐจ์›๊ณผ d-์ฐจ์›์— ๋Œ€ํ•ด์„œ ๋น„๊ตํ•ด๋ณด์‹œ๋ฉด ๋น ๋ฅด๊ฒŒ ๋‚ฉ๋“์ด ๊ฐ€์‹ค๊ฒ๋‹ˆ๋‹ค!

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

ํ•˜์ง€๋งŒ ์—ฌ๊ธฐ์„œ๋„ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. attention์—์„œ ์„ฑ๋Šฅ์„ ๋ณด๋Š” ๋ฐฉ๋ฒ•์ค‘์— attention map์„ ํ†ตํ•ด ๋ณด๋ฉด Linear Attention์ด ์‚ฌ์ง„์˜ ํŠน์ง•์„ ์ž˜ ๋ชป์žก์•„ ๋ƒ…๋‹ˆ๋‹ค.

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

๋‹น์—ฐํžˆ Softmax ๋ณด๋‹ค distribution์ด ๋‚ ์นด๋กญ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ง•์ ์—์„œ๋„ ๋‘๋“œ๋Ÿฌ์ง€์ง€ ์•Š๋Š”๊ฒŒ ๋ฌธ์ œ์ฃ . ์„ฑ๋Šฅ๋„, ๋‚˜์˜ค์ง€ ์•Š์„ ๊ฒƒ์ด๊ตฌ์š”. โ€œMulti-scaleโ€ learning์„ ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ค์šธ ๊ฒ๋‹ˆ๋‹ค.

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

ํ•ด๊ฒฐํ•ด์•ผ๊ฒ ์ฃ ? ๋ฐฉ๋ฒ•์€ ๋ ˆ์ด์–ด๋ฅผ ํ•˜๋‚˜ ๋” ๋„ฃ์œผ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์„ฑ๋Šฅ๋„ ์˜คํžˆ๋ ค ์ „๋ณด๋‹ค ํ›จ์”ฌ ์ข‹์•„์กŒ๋„ค์š”.

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

2.3 Sparse attention

์„ธ ๋ฒˆ์งธ ๊ธฐ์ˆ ์„ ์†Œ๊ฐœ๋“œ๋ฆฌ๊ธฐ ์ „์—, Vision Application์„ ํ•˜๋‹ค ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ƒํ™ฉ์ด ๋งŽ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„๋ฅผ ์ค„์ด๊ฑฐ๋‚˜, Pruning์„ ํ†ตํ•ด ์ด๋ฏธ์ง€๊ฐ€ ํŠน์ •๋ถ€๋ถ„๋งŒ ๋“ค์–ด์˜ค๊ฒŒ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์ฃ .

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

๊ทธ๋ž˜์„œ, Sparse attention์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ์ˆ ์„ Pruning ๊ณผ ๋น„์Šทํ•˜๊ฒŒ Patch๋งˆ๋‹ค Importance๋ฅผ ๊ณ„์‚ฐํ•ด ์ค„์„ ์„ธ์›๋‹ˆ๋‹ค.

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

๊ทธ๋ฆฌ๊ณ  ์ค„์„ ์„ธ์šด Patch์—์„œ N๋ฒˆ์˜ ๋ฐ˜๋ณตํ•˜๋Š” fine-tuning์„ ํ†ตํ•ด ๋ชจ๋ธ์„ ์žฌํ•™์Šต์‹œํ‚ค์ฃ .

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

์—ฌ๊ธฐ์„œ Constraint์— ๋งŒ์กฑํ•˜๋Š” ์กฐํ•ฉ์„ ์ฐพ๊ธฐ ์œ„ํ•ด Evolutionary Search๋ฅผ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค(Evolutionary Search๋Š” Lab 3๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”).

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

์„ฑ๋Šฅ์ด ๊ถ๊ธˆํ•˜๋‹ค๋ฉด ์ด ๋…ผ๋ฌธ์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š” :)

3. Self-supervised learning for ViT

Efficient + Acceleration์— ๋Œ€ํ•œ ๊ธฐ์ˆ ์€ ์—ฌ๊ธฐ๊นŒ์ง€์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ ๋‹ค์‹œ ์ฒ˜์Œ์œผ๋กœ ๋Œ์•„๊ฐ€์„œ, ํ˜น์‹œ Vision Transformer ์ฒซ ์„ฑ๋Šฅ ๊ทธ๋ž˜ํ”„ ๊ธฐ์–ตํ•˜์‹œ๋‚˜์š”? ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋งˆ์–ด๋งˆํ•˜๊ฒŒ ๋งŽ์•„์•ผ ํ–ˆ๋˜ ๋ถ€๋ถ„์ด์š”(์•„๋ž˜์— ๊ฐ€์ ธ์™€ ๋ดค์Šต๋‹ˆ๋‹ค). ๊ทธ๋Ÿฐ๋ฐ, ์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŽ์ด ๊ตฌํ•˜๋Š” ๊ฑด ํ˜„์‹ค์ ์œผ๋กœ ๋งŽ์ด ์–ด๋ ค์šธ ์ˆ˜ ๋ฐ–์— ์—†์Šต๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋กœ ์˜๋ฃŒ๋ฐ์ดํ„ฐ๊ฐ€ ๊ทธ๋ ‡์ฃ . ๊ทธ๋Ÿผ ์ด ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•œ ๊ฑด ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

3.1 Contrastive learning

์ฒซ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ Contrastive learning ์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ TinyML์—์„œ ๋ฟ ์•„๋‹ˆ๋ผ ๋งŽ์ด ์“ฐ์ด๋Š” ๋ฐฉ๋ฒ•์ธ๋ฐ, Positive Sample๊ณผ Negative Sample์„ ๊ฐ€์ง€๊ณ  embedding vector๋ฅผ ๋ฉ€๊ฒŒ ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ณ ์–‘์ด ์‚ฌ์ง„์„ ๊ตฌ๋ถ„ํ•˜๋Š” ํ…Œ์Šคํฌ์—์„œ Positive Sample์€ ๊ณ ์–‘์ด ์‚ฌ์ง„์ด ๋  ๊ฒƒ์ด๊ณ , Negative Sample์€ ์—ฌ๊ธฐ์„œ ๊ฐ•์•„์ง€ ์‚ฌ์ง„์ด ๋  ๊ฒ๋‹ˆ๋‹ค.

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด Supervised ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” CIFAR-100, Oxford Flowers-102, Oxford-IIIT Pets ๋ฐ์ดํ„ฐ ์…‹์—์„œ๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ๋‚˜์˜ค์ง€ ์•Š์ง€๋งŒ, ์œ„ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•œ ๋ชจ๋ธ์€ ์„ฑ๋Šฅ์ด ์–ด๋Š์ •๋„ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Reference. An Empirical Study of Training Self-Supervised Vision Transformers [Chen et al., 2021]

Contrastive Learning์œผ๋กœ Multi-Modal์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๋…ผ๋ฌธ์€ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋‘ ๋ฐ›๋Š” ํ˜•ํƒœ๋กœ ๋””์ž์ธ๋ผ ์žˆ์Šต๋‹ˆ๋‹ค.

Reference. Learning Transferable Visual Models From Natural Language Supervision [Radford et al., 2021]

3.2 Masked image modeling

๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ Mask ์ž…๋‹ˆ๋‹ค. ์•„๋ž˜ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋ณด์‹œ์ฃ .

Reference. MIT-TinyML lecture14 Vision Transformer in https://efficientml.ai

BERT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Mask๋ฐฉ๋ฒ•์€ ์ž…๋ ฅ ํ† ํฐ์— ๋งˆ์Šคํฌ๋ฅผ ์”Œ์›Œ ์ถœ๋ ฅ์—์„œ ์ด๋ฅผ ๋งž์ถ”๋Š” ํ…Œ์Šคํฌ๋กœ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ Vision์€ ์–ด๋–ป๊ฒŒ ํ• ๊นŒ์š”?

Reference. Masked Autoencoders Are Scalable Vision Learners [He et al., 2022]

LLM๊ณผ ํ›ˆ๋ จํ•˜๋Š” ๋ฐฉ์‹์€ Mask๋ฅผ ์”Œ์šฐ๊ณ  ์ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•˜๋Š” ํ›ˆ๋ จ์ธ ๊ฒƒ์€ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ ViT์˜ ๊ฒฝ์šฐ Encoder์™€ Decoder๊ฐ€ ๊ฐ™์ด ์žˆ์œผ๋ฉด์„œ Encoder๋ณด๋‹ค๋Š” Decoder์˜ ๋ชจ๋ธํฌ๊ธฐ๊ฐ€ ๋” ์ž‘๋‹ค๋Š” ๊ฑธ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค. ํฅ๋ฏธ๋กœ์šด ๋ถ€๋ถ„์€ Masking ratio๊ฐ€ BERT์˜ ๊ฒฝ์šฐ 15%์ธ ๋ฐ˜๋ฉด, ์•„๋ž˜ ์‹คํ—˜๊ฒฐ๊ณผ์—์„œ ViT์˜ ๊ฒฝ์šฐ๋Š” ๋ฌด๋ ค 75%๋‚˜ ๋ฉ๋‹ˆ๋‹ค. ๊ฐ•์˜๋…ธํŠธ์—์„œ๋Š” โ€œ์ด๋ฏธ์ง€๊ฐ€ ์–ธ์–ด๋ณด๋‹ค ๋” information density๊ฐ€ ๋‚ฎ์•„์„œ ๊ทธ๋ ‡๋‹ค.โ€๋ผ๊ณ  ๋งํ•ฉ๋‹ˆ๋‹ค.

Reference. Masked Autoencoders Are Scalable Vision Learners [He et al., 2022]

Lecture_14-Vision Transformer (9).jpg

4. Multi-modal LLM

๋งˆ์ง€๋ง‰์œผ๋กœ Multi-modal LLM์— ๋Œ€ํ•ด์„œ ์–ธ๊ธ‰์„ ํ•˜๋Š”๋ฐ, ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋‹ค๋ฃจ์ง€ ์•Š์•„ ๊ถ๊ธˆํ•˜์‹  ๋ถ„๋“ค์„ ์œ„ํ•ด ๋…ผ๋ฌธ์€ ๋งํฌ๋กœ ๋‹ฌ์•„๋‘๊ณ  ์„ค๋ช…์€ ๋„˜์–ด๊ฐ€๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ๊นŒ์ง€ Vision Transformer ์˜€์Šต๋‹ˆ๋‹ค. LLM๊ณผ ๋น„์Šทํ•˜๋ฉด์„œ๋„ ๋ชจ๋ธํฌ๊ธฐ๊ฐ€ ๋˜‘๊ฐ™๋‹ค๋ฉด ์—ฐ์‚ฐ๋Ÿ‰ ๋†’์€ ๊ฒƒ๊ณผ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•œ ๊ฒƒ์„ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•˜๋Š”๊ฐ€๊ฐ€ ์ค‘์ ์ธ ๊ฐ•์˜์˜€์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ์‹œ๊ฐ„์—๋Š” GAN, Video, and Point Cloud๋กœ ๋Œ์•„์˜ค๊ฒ ์Šต๋‹ˆ๋‹ค :D