๐Ÿง‘โ€๐Ÿซ Lecture 12-13

lecture
Transformer
LLM
Transformer and LLM
Author

Gijeong Seong

Published

April 19, 2024

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” ํ˜„๋Œ€ NLP์˜ ํ•ต์‹ฌ ๋„๊ตฌ์ธ transformer์™€ LLM, ๊ทธ๋ฆฌ๊ณ  ์ด๋“ค์„ ์ตœ์ ํ™”ํ•˜๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ์†Œ๊ฐœํ•œ๋‹ค. transformer๋ฅผ ์ตœ์ ํ™” ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ positional embedding์˜ ๋ณ€ํ™”๋‚˜ llama2์— ์‚ฌ์šฉ๋œ grouped-query-attention๋“ฑ์„ ๋‹ค๋ฃฌ๋‹ค. ๋” ๋‚˜์•„๊ฐ€, LLMs์˜ ํšจ์œจ์ ์ธ ์ถ”๋ก (inference)์™€ ์กฐ์ •(fine-tuning) ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์‹œ์Šคํ…œ์— ๋Œ€ํ•ด์„œ๋„ ๋‹ค๋ฃฌ๋‹ค. inference์— ๋Œ€ํ•ด์„œ๋Š” vLLM, StreamingLLM, fine-tuning์€ LoRA, QLoRA, Adapter๊ธฐ๋ฒ• ๋“ฑ์„ ์†Œ๊ฐœํ•œ๋‹ค.

1. Transformer basics

๊ฐ•์˜์—์„œ๋Š” transformer ๊ตฌ์กฐ์˜ tokenizer, encoding, attention๋“ฑ์— ๋Œ€ํ•ด ๊ฐ„๋žตํžˆ ์†Œ๊ฐœํ•˜๊ณ  ์žˆ์ง€๋งŒ ์ด ๊ธ€์€ transformer/LLM์˜ ์ตœ์ ํ™”์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ณ  ์žˆ๊ณ , transformer ์ž์ฒด์˜ ๊ตฌ์กฐ๋ฅผ ๋‹ค๋ฃจ๋ฉด ๊ธ€์ด ๋„ˆ๋ฌด ๊ธธ์–ด์งˆ ๊ฒƒ ๊ฐ™์•„์„œ ์ž์„ธํ•œ ๊ตฌ์กฐ๋Š” ์•„๋ž˜์˜ ๋‘ ๋งํฌ๋ฅผ ์ฐธ๊ณ  ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค

https://wikidocs.net/31379

https://blogs.nvidia.co.kr/blog/what-is-a-transformer-model/

2. Transformer design variants

์ด๋ฒˆ ์žฅ์—์„œ๋Š” ์›๋ณธ transformer(attention is all you need) ์ดํ›„์— transformer๋ฅผ ๋ฐœ์ „์‹œํ‚จ ์—ฌ๋Ÿฌ ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค.

Absolute positional encoding -> Relative positional encoding

์›๋ณธ transformer ๋ชจ๋ธ์—์„œ๋Š” positional embedding์œผ๋กœ sinusoid embedding์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” position๋งˆ๋‹ค ๋…๋ฆฝ์ ์ด๋ฉด์„œ๋„ ์—ฐ์†์ ์ธ embedding vector๋ฅผ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฐ index์— ์˜์กด์ ์ธ absolute positional encoding ๋ฐฉ์‹์—๋Š”, training์ค‘์— ๋ณด์ง€ ๋ชปํ•œ ๊ธธ์ด๋ฅผ ๋Œ€์‘ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 250 token๊นŒ์ง€๋งŒ ํ•™์Šตํ–ˆ๋Š”๋ฐ 251 token์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค๋Š” ์ƒํ™ฉ ๋ง์ด๋‹ค.

Relative positional encoding์„ ์‚ฌ์šฉํ•˜๋ฉด train short, test long์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด์™ธ์—๋„ absolute positional encoding์€ ์œ„์น˜ ์ •๋ณด๋ฅผ input embedding์— ๋”ํ•ด Q/K/V ์ „์ฒด์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€๋งŒ, relative positional encoding์€ Q,K์— bias๋ฅผ ๋”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ attention score์— ์˜ํ–ฅ์„ ์ค€๋‹ค(V์—๋Š” ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค)

Attention with Linear Biases (ALiBi)

๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ALiBi๊ฐ€ ์žˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๋‹จ์ˆœํžˆ attention matrix์— query์™€ key์˜ ๊ฑฐ๋ฆฌ์— ๋Œ€ํ•œ offset์„ ๋”ํ•ด์ค€๋‹ค.

Rotary Positional Embedding (RoPE)

๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” llama2์—๋„ ์‚ฌ์šฉ๋  ์ •๋„๋กœ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋Š” RoPE์ด๋‹ค. RoPE์˜ ์•„์ด๋””์–ด๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ํšŒ์ „(rotation)์„ ํ†ตํ•ด ์ธ์ฝ”๋”ฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. d์ฐจ์›์˜ word embedding์œผ๋กœ d/2๊ฐœ์˜ ์ง์„ ๋งŒ๋“ค๊ณ , ๊ฐ๊ฐ์˜ pair๋ฅผ 2d ์ขŒํ‘œ๋กœ ๊ฐ€์ •ํ•˜๊ณ , position์— ๋”ฐ๋ผ ํšŒ์ „์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ์œ„ ๊ทธ๋ฆผ์—์„œ x1,x2์— m * theta๋ฅผ ๊ณฑํ•ด์„œ ํšŒ์ „์‹œํ‚ค๊ณ  ์žˆ๋‹ค. RoPE๋ฅผ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์œ„์™€ ๊ฐ™๋‹ค

LLM์€ ๋ณดํ†ต ํ•™์Šตํ•  ๋•Œ context ๊ธธ์ด์— ์ œํ•œ์ด ์žˆ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด llama๋Š” 2k, llama2๋Š” 4k๋กœ ์ œํ•œ๋œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ RoPE ๋ฐฉ์‹ ๋•์— ๋” ํฐ context ๊ธธ์ด๋„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋‹ค. ๋” ์ž‘์€ theta๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด, ๋” ์ด˜์ด˜ํ•˜๊ฒŒ interpolate ํ•˜๋ฉด์„œ context๊ธธ์ด๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ๋‹ค. ์œ„ ๊ทธ๋ฆผ์—์„œ ๋‹จ์ˆœํžˆ 4096 context ๊ธธ์ด๋Š” unseen์ด๋ผ ์‹คํŒจํ•˜์ง€๋งŒ, theta๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ธ ์•„๋ž˜ ๊ทธ๋ž˜ํ”„์—์„œ๋Š” ์›๋ž˜ context ๊ธธ์ด ์•ˆ์— ๋“ค์–ด์˜ค๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๊ณตํ•˜๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

KV cache optimizations(Multi-Head Attention (MHA) -> Multi-Query Attention (MQA) -> Grouped-Query Attention(GQA)

KV cache๋ž€ attention ๋งค์ปค๋‹ˆ์ฆ˜์˜ Key, Value๋ฅผ ์ €์žฅํ•ด๋†“๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค. transfomer๋ฅผ decode(gpt-style. decoder ๋ชจ๋ธ ์‚ฌ์šฉํ•ด ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ)ํ•  ๋•Œ๋Š” ์ง€๊ธˆ ์‹œ์  ํ† ํฐ์˜ attention์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์ด์ „ ํ† ํฐ๋“ค์˜ Key, Value๊ฐ’์„ ๋ชจ๋‘ ์ €์žฅํ•˜๊ณ  ์žˆ์–ด์•ผ ํ•œ๋‹ค. ์œ„ ๊ทธ๋ฆผ์—์„œ โ€œtrainiumโ€ ํ† ํฐ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด์„  ์ด์ „ โ€œIโ€, โ€œloveโ€์˜ K,V๊ฐ€ ํ•„์š”ํ•˜๋‹ค(Query๋Š” ํ•„์š”ํ•˜์ง€ ์•Š์Œ) ๋‹จ์ˆœํžˆ ์ƒ์ƒํ•ด๋ด๋„, KV cache๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์ด ํ•„์š”ํ•˜๋‹ค. llama2-7b ๋ชจ๋ธ์—์„œ KV cache ํฌ๊ธฐ๋Š” batch_size * 32(layers) * 128(n_emd) * N(length) * 2(K,V๋‹ˆ๊นŒ 2๊ฐœ) * 2byte(fp16) = 512KB * BS * N ๋งŒํผ ํ•„์š”ํ•˜๋‹ค. llama2-70B ๋ชจ๋ธ์„ ์ด๋Ÿฐ ์‹์œผ๋กœ ๊ณ„์‚ฐํ•ด๋ณด๋ฉด, batch size 16์ผ ๋•Œ 4096๋ฒˆ์งธ token์„ ์ฒ˜๋ฆฌํ•  ๋•Œ KV cache์˜ ์šฉ๋Ÿ‰์€ 160GB์— ๋‹ฌํ•œ๋‹ค. ๋”ฐ๋ผ์„œ KV cache์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์ค„์ผ ํ•„์š”๊ฐ€ ์žˆ๊ณ , ๊ทธ ๋ฐฉ๋ฒ•์ด multi-query-attention(MQA), grouped-query-attention(GQA)์ด๋‹ค. ์ด์ค‘ GQA๋Š” llama2์—๋„ ์ ์šฉ๋  ์ •๋„๋กœ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ๊ฐ๊ฐ์˜ ๋ฐฉ์‹์„ ์‚ดํŽด๋ณด๋ฉด

MQA : ๋ชจ๋“  value์™€ key๋ฅผ ํ•˜๋‚˜๋กœ ํ‰๊ท ๋‚ธ๋‹ค

GQA : ๋ชจ๋“  value์™€ key๋ฅผ G๊ฐœ๋กœ ํ‰๊ท ๋‚ธ๋‹ค(๋ณดํ†ต G๋Š” N/8)

์œ„ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ MQA, GQA๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด KV cache ํฌ๊ธฐ๋ฅผ ๋งŽ์ด ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

FFN->GLU

inverted bottleneck, relu๋ฅผ ์‚ฌ์šฉํ•˜๋˜ ๊ธฐ์กด FFN ๊ณ„์ธต์„, GLU(Gated Linear Unit)๊ณผ swish ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์„ฑ๋Šฅ์ด ๋” ๋‚˜์•„์ง„๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋•Œ ์„ฑ๋Šฅ์€ PPL(perplexity)๋กœ ์ธก์ •ํ•œ๋‹ค.

3. Large language models(LLMs)

LLM์ด ์–ด๋–ค ์ผ์„ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€, ์–ด๋–ค ์ข…๋ฅ˜๊ฐ€ ์žˆ๋Š”์ง€์— ๋Œ€ํ•ด์„œ๋Š” ์ด๋ฏธ ๋งŽ์ด ๊ธ€๊ณผ ์ •๋ณด๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, 3์žฅ์—์„œ๋Š” LLM์˜ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํŠน์ง•์— ๋Œ€ํ•ด ๊ฐ„๋‹จํžˆ ์„ค๋ช…ํ•œ๋‹ค. LLM์˜ ์‹ ๊ธฐํ•œ ํŠน์ง•์ค‘ ํ•˜๋‚˜๋Š”, model size๊ฐ€ ์ปค์ง€๋‹ค ๋ณด๋ฉด ์–ด๋Š์ƒˆ ํŠน์ • task์— ๋Œ€ํ•œ ๋Šฅ๋ ฅ์ด ์ƒ๊ธด๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ž…๋ ฅ ๋งฅ๋ฝ์— ๋งž๋Š” ์ˆซ์ž ์—ฐ์‚ฐ์„ ํ•œ๋‹ค๊ฑฐ๋‚˜, ์„ž์ธ ์•ŒํŒŒ๋ฒณ ์ฒ ์ž๋ฅผ ์ฐพ์•„๋‚ผ ์ˆ˜๋„ ์žˆ๋‹ค.

๋˜ํ•œ ์ด์ „ NLP ์‹œ๋Œ€์—์„œ๋Š” downstream task๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด์„œ fine tuning์„ ํ•ด์•ผ ํ–ˆ์ง€๋งŒ, LLM์€ ํŒŒ์ธํŠœ๋‹ ์—†์ด Zero-shot์ด๋‚˜ Few-shot ๋ฐฉ์‹์œผ๋กœ downstream task๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค.

์ฃผ๋ชฉ๋ฐ›๋Š” LLM ๋ชจ๋ธ๊ณผ ๊ฐ๊ฐ์˜ ํŠน์ง•์„ ๊ฐ„๋žตํžˆ ์ •๋ฆฌํ•ด๋ณด๋ฉด, llama๋Š” SwiGLU๋ฅผ ์ ์šฉํ–ˆ๊ณ , llama2์—์„œ๋Š” training tokens์„ ํฌ๊ฒŒ ๋Š˜๋ฆฐ ์ ์ด falcon์€ 180B๋ผ๋Š” ๊ฑฐ๋Œ€ํ•œ model size๊ฐ€ mistral์€ sliding window attention์ด๋ผ๋Š” attention ๊ธฐ๋ฒ•์ด ๋…ํŠนํ•œ ์ ์ด๋‹ค.

์นœ์น ๋ผ ๋ฒ•์น™(The Chinchilla Law)

์นœ์น ๋ผ ๋ฒ•์น™์€ model size๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, training data์˜ ํฌ๊ธฐ๋„ ๋Š˜๋ ค์•ผ ์ตœ์ ์˜ computation-accuracy trade-off ํ•˜๋Š” ์ง€์ ์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. (๋ฌด์กฐ๊ฑด data ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๋Š”๊ฒŒ ์ข‹๋‹ค๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ ์–‘์— ๋”ฐ๋ฅธ ์ตœ์  model size๊ฐ€ ์žˆ๋‹ค๋Š” ๋œป์ด๋‹ค) llama-2๊ฐ€ ๋น„๊ต์  ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๋งŽ์€ train token์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค.

4. Advanced topics, multi-modal LLM

์ดํ›„ ๊ฐ•์˜์—์„œ ํ•œ๋ฒˆ ๋” ์ž์„ธํžˆ ๋‹ค๋ฃจ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฒˆ ๊ฐ•์˜ ์ •๋ฆฌ ๊ธ€์—์„œ๋Š” ์ƒ๋žตํ•ฉ๋‹ˆ๋‹ค

5. Efficient inference algorithms for LLMs

์•ž์„  ๊ฐ•์˜๋“ค์—์„œ inference๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ quantization๊ณผ pruning์˜ ๋ฐฉ๋ฒ•์ด ์žˆ์—ˆ๊ณ , ์ด ๋ฐฉ๋ฒ•๋“ค์„ LLM์—๋„ ์ ์šฉํ•ด ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

5.1. Quantization: SmoothQuant, AWQ, TinyChat

ํ•˜์ง€๋งŒ, ๋‹จ์ˆœํžˆ W8A8๊ณผ ๊ฐ™์ด quantizeํ•˜๋Š” ๊ฒƒ์€, ๊ต‰์žฅํžˆ ํฐ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์—ฌ์ค€๋‹ค ์ด์œ ๋Š”, LLM์—์„œ๋Š” activation์˜ outlier๊ฐ€ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค ์œ„ ๊ทธ๋ฆผ์˜ ์˜ค๋ฅธ์ชฝ์ฒ˜๋Ÿผ activation์—๋Š” ๊ต‰์žฅํžˆ ํฐ outlier๊ฐ€ ์กด์žฌํ•˜๊ณ  weight๋Š” ๋น„๊ต์  ํŽธ์ฐจ๊ฐ€ ์—†๋‹ค. ๋”ฐ๋ผ์„œ activation์„ 10์œผ๋กœ ๋‚˜๋ˆ„๊ณ , weight์— 10์„ ๊ณฑํ•˜๋ฉด ์ˆ˜์‹์˜ ๊ฐ’์€ ๋ณ€ํ•˜์ง€ ์•Š์ง€๋งŒ, activation์„ ์ข€ ๋” ํŽธํ•˜๊ฒŒ quantizeํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค(์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ). ์ด๋Ÿฐ ๋ฐฉ์‹์„ activation์„ ์ข€๋” ํ‰ํ‰ํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค๊ณ  ํ•ด์„œ smoothQuant๋ผ๊ณ  ํ•œ๋‹ค. smoothQuant๋ฐฉ์‹์€ llama ๋ชจ๋ธ์—์„œ๋„ ๋งค์šฐ ์ž˜ ๋™์ž‘ํ•œ๋‹ค.

์œ„ ๊ทธ๋ฆผ์˜ x์ถ•์ธ compute intensity๋Š” FLOPs / MemoryBandwith๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ฆ‰, ๋ฐ์ดํ„ฐ ํ•˜๋‚˜๋‹น ์—ฐ์‚ฐ์„ ์–ผ๋งˆ๋‚˜ ํšจ์œจ์ ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋Š๋ƒ์— ๋Œ€ํ•œ ์ง€ํ‘œ์ด๋‹ค. ์œ„ ๊ทธ๋ฆผ์—์„œ batch size๊ฐ€ 1์ผ ๋•Œ ๋‚ฎ์€ TFLOPS๋ฅผ ๋ณด์ด๋Š” ์ด์œ ๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋•Œ๋ฌธ์ด๋‹ค. LLM์—์„œ ๋งค ํ† ํฐ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํฐ ๋ฉ”๋ชจ๋ฆฌ fetch๊ฐ€ ํ•„์š”ํ•˜๋‹ค(parameter fetch.). activation๊ณผ weight ์ค‘์—์„œ๋Š” weight๊ฐ€ ํ›จ์”ฌ ๋” ํฌ๋ฏ€๋กœ, weight๋ฅผ ์ค„์ด๋Š”๋ฐ ๋” ์ง‘์ค‘ํ•ด์•ผ ํ•œ๋‹ค.

์œ„์—์„œ ์‚ดํŽด๋ณธ W8A8 ๋ฐฉ์‹์˜ quantization์€ batch serving(ํ•œ๋ฒˆ์— ์—ฌ๋Ÿฌ batch๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์ผ)์—์„œ๋Š” ์ž˜ ๋™์ž‘ํ•œ๋‹ค. ํ•˜๋‚˜๋งŒ ์ฒ˜๋ฆฌํ•˜๋Š” ์ž‘์—…์€(single-batch) memory-bounded(๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•˜๋ฉด bottleneck์ด ๋œ๋‹ค)์ด๋‹ค. ๋‹น์—ฐํžˆ weight๋ฅผ ๋ฐ”๋กœ quantize ํ•˜๋ฉด ์œ„ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์„ ์‚ดํŽด๋ณด๋ฉด, RTN๋ฐฉ์‹์„ ๋‹จ์ˆœํžˆ ์ ์šฉํ•œ ๊ฒฝ์šฐ Perplexity๊ฐ€ ๋งŽ์ด ์ƒ์Šนํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋” ๋‚˜์€ ๋ฐฉ๋ฒ•์€, ์ค‘์š”ํ•œ(salient) weight๋“ค๋งŒ quantize ํ•˜์ง€ ์•Š๊ณ  ๋‘๋Š” ๊ฒƒ์ธ๋ฐ, salient ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜๋Š” ๊ธฐ์ค€์„ โ€œactivationโ€๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒ๋‹จํ•  ๋•Œ(magnitude-base, ๋‹จ์ˆœํžˆ ์ ˆ๋Œ“๊ฐ’์ด ํฌ๋ฉด ์ค‘์š”ํ•˜๋‹ค๊ณ  ํŒ๋‹จ) ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ์ด๋Ÿฐ ๋ฐฉ์‹์„ AWQ(Activation-aware Weight Quantization) ๋ผ๊ณ  ํ•œ๋‹ค.

SmoothQuant์™€ AWQ๋Š” ์˜ค๋Š˜๋‚  ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

5.2. Pruning/sparsity: SpAtten, H2O, MoE

quantization์„ ํ–ˆ์œผ๋ฉด, pruning๋„ ํ•ด ๋ด์•ผ ํ•œ๋‹ค. Wanda๋Š” AWQ์ฒ˜๋Ÿผ Weight์™€ Activation์„ ๊ณ ๋ คํ•ด์„œ pruning ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค SpAtten์€ ์ค‘์š”ํ•˜์ง€ ์•Š์€ ํ† ํฐ ์ž์ฒด๋ฅผ ์‚ญ์ œํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์˜ค๋ฅธ์ชฝ attention ๋งต ๊ธฐ๋ฐ˜์œผ๋กœ, ๊ฐ€์žฅ ๋‚ฎ์€ attention ํ•ฉ๊ณ„๋ฅผ ๊ฐ€์ง„ ํ† ํฐ์„ ์‚ญ์ œํ•œ๋‹ค. H2O๋Š” Heavy Hitter Token(H2)๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ๋‚จ๊ธฐ๊ณ , ๋‚˜๋จธ์ง€๋ฅผ pruningํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” Heavy Hitter๋ž€ attention ๊ธฐ๋ฐ˜์œผ๋กœ ์„ ์ •ํ•œ๋‹ค. ์ดํ•ดํ•˜๊ธฐ๋กœ๋Š”, SpAtten์˜ ๋ฐฉ์‹๊ณผ ๋น„์Šทํ•˜๋‹ค๊ณ  ๋Š๊ผˆ๋‹ค. DejaVu๋Š” ์ž…๋ ฅ์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š” attention head๋“ค์ด ์กด์žฌํ•˜๊ณ , ์ด๊ฒƒ์„ contextual sparsity๋ผ๊ณ  ๋ถ€๋ฅด๋ฉฐ, ์ด ํŒจํ„ด์„ MLP๋ฅผ ํ†ตํ•ด ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€์„ค์„ ์„ธ์› ๋‹ค. ์ด๋Ÿฐ contextual sparsity๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ์‹์„ DejaVu๋Š” ์‚ฌ์šฉํ–ˆ๋‹ค. MoE(Mixture of Experts) ๋Š” FFN์„ N๊ฐœ๋กœ ๋‚˜๋ˆ„๊ณ , Expert๋ฅผ ์‚ฌ์šฉํ•ด ๊ทธ์ค‘์— ํ•˜๋‚˜๋ฅผ ๊ณ ๋ฅด๋Š” ๊ฐœ๋…์„ ๋„์ž…ํ•œ๋‹ค. ๊ทธ๋ฆผ ์ค‘๊ฐ„์— ์žˆ๋Š” Router๋กœ๋ถ€ํ„ฐ ํ™•๋ฅ ์ ์œผ๋กœ ์–ด๋–ค FFN์„ ์‚ฌ์šฉํ• ์ง€ MoE๋ฐฉ์‹์€ GPT-4์—์„œ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋‹ค.

6. Efficient inference systems for LLMs

์ด ์žฅ์—์„œ๋Š” system์  ๊ด€์ ์—์„œ ๋” ํšจ์œจ์ ์œผ๋กœ LLM์„ inferenceํ•˜๋Š” ๋ฒ•์— ๋‹ค๋ฃฌ๋‹ค.

6.1. vLLM(Paged Attention)

๋‹ค์ˆ˜์˜ ์‚ฌ์šฉ์ž๊ฐ€ LLM์„ ์‚ฌ์šฉํ•˜๋Š” ํ™˜๊ฒฝ์—์„  ๋ฌด์—‡์ด ๋ฌธ์ œ๊ฐ€ ๋ ๊นŒ? ์œ„ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์šฐ๋ฆฌ๋Š” LLM์˜ ์ถœ๋ ฅ์ด ์–ผ๋งˆ๋‚˜ ๊ธธ์–ด์งˆ ์ง€ ๋ชจ๋ฅด๊ธฐ ๋–„๋ฌธ์—, ์–ผ๋งˆ๋‚˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•ด์•ผ ํ•  ์ง€ ์•Œ ์ˆ˜ ์—†๋‹ค. ๋”ฐ๋ผ์„œ <resv> ์ฒ˜๋Ÿผ ๋‚ด๋ถ€ ๋‹จํŽธํ™”, ํ˜น์€ ๋‹ค๋ฅธ ์š”์ฒญ๊ฐ„์˜ ๊ฐ„๊ฒฉ์œผ๋กœ ์ธํ•ด ์™ธ๋ถ€ ๋‹จํŽธํ™”๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค. ๋งˆ์น˜ ์‹ค์ œ ์šด์˜์ฒด์ œ์˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ™๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด, ์žฌ๋ฐŒ๊ฒŒ๋„ ์šด์˜์ฒด์ œ์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋ฐฉ๋ฒ•์œผ๋กœ ์ด๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๊ณ , ๊ทธ ๋ฐฉ๋ฒ•์ด ๋ฐ”๋กœ Page๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

OS์—์„œ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค๊ฐ„ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ page๋‹จ์œ„๋กœ ์‚ฌ์šฉํ–ˆ๋“ฏ์ด, LLM์—์„œ๋„ ๋‹ค๋ฅธ ์š”์ฒญ๋“ค ๊ฐ„์— KV cache๋ฅผ page ๋‹จ์œ„๋กœ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค. ์œ„์ฒ˜๋Ÿผ ๋‹ค๋ฅธ ์š”์ฒญ์„ page๋‹จ์œ„๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.

๋” ๋†€๋ผ์šด ์ ์€, ํ•˜๋‚˜์˜ KV Cache๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค. ์•ž ๋ฌธ์žฅ์„ ๊ณต์œ ํ•˜๊ฑฐ๋‚˜, ์•„๋‹ˆ๋ฉด Prompt๊ฐ™์ด ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฌธ์žฅ์˜ KV cache๋ฅผ ๊ณต์œ ํ•ด ํšจ์œจ์ ์œผ๋กœ ๋Œ€๋Ÿ‰์˜ inference๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

์ด๋Ÿฐ ๋ฐฉ์‹์„ Paged Attention ์ด๋ผ๊ณ  ํ•˜๊ณ , ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์ด vLLM์ด๋ผ๋Š” ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค.

6.2. StreamingLLM

LLM ๋ฐฐํฌ์‹œ ๋˜ ๋‹ค๋ฅธ ๋ฌธ์ œ๋Š” ๊ธธ์ด ๋ฌธ์ œ์ด๋‹ค. ์—„์ฒญ๋‚˜๊ฒŒ ๊ธด ๋ฌธ์žฅ์ด๋‚˜, ํ˜น์€ ์ฑ—๋ด‡์—์„œ ์—„์ฒญ ์˜ˆ์ „์— ์ด์•ผ๊ธฐํ–ˆ๋˜ ๋‚ด์šฉ๊นŒ์ง€ ๊ธฐ์–ตํ•˜๋ ค๋ฉด ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋งค์šฐ ๋งŽ์ด ํ•„์š”ํ•˜๋‹ค. ๋‹จ์ˆœํžˆ transformer ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด(๋…ธ๋ž€์ƒ‰ ๊ทธ๋ž˜ํ”„) ๋ฉ”๋ชจ๋ฆฌ๋Š” ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ณ , perplexity๋Š” ์ž…๋ ฅ ๊ธธ์ด 4K ์ดํ›„๋กœ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•œ๋‹ค(training์—์„œ ๋ณด์ง€ ๋ชปํ•œ ๊ธธ์ด์ด๊ธฐ ๋•Œ๋ฌธ์—) windowed attention(์ผ์ • context๋งŒ ๊ธฐ์–ต, ๋…น์ƒ‰)์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์€ ์ผ์ •ํ•˜์ง€๋งŒ ์ž…๋ ฅ ๊ธธ์ด๊ฐ€ window๊ธธ์ด๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ์ˆœ๊ฐ„(๊ทธ๋ฆผ์—์„œ๋Š” 1K์ •๋„) perplexity๊ฐ€ ๊ธ‰์ฆํ•œ๋‹ค(์ฒซ ๋ช‡ ํ† ํฐ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—) a๊ฐ€ ์œ„์—์„œ ๋งํ•œ ๋‹จ์ˆœ transformer๋ฐฉ์‹, b๊ฐ€ windowed attention์ด๋‹ค. c๋Š” sliding window๋ฐฉ์‹์ธ๋ฐ, ์ด์ „ ํ† ํฐ์„ ๋ฉ”๋ชจ๋ฆฌ์— ๋‘๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ perplexity๋Š” ๊ดœ์ฐฎ์ง€๋งŒ, ์—ฐ์‚ฐํ•˜๋Š”๋ฐ ๋„ˆ๋ฌด ๋งŽ์€ ์‹œ๊ฐ„์ด ๋“ ๋‹ค.

์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ฐพ์•„๋‚ด๊ธฐ ์œ„ํ•œ ์•„์ด๋””์–ด๋ฅผ Attention Sink์—์„œ ์ฐพ์•˜๋‹ค. ์œ„ ๊ทธ๋ฆผ์—์„œ ๋ณด๋ฉด ์ฒซ๋ฒˆ์งธ ํ† ํฐ์˜ attention score๊ฐ€ ๋งค์šฐ ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ๊ทธ๋“ค์ด ๋ฌธ๋งฅ์ ์œผ๋กœ(semantically)์ค‘์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋„ ๊ทธ๋ ‡๋‹ค. ์ด๋Ÿฐ ํ˜„์ƒ์„ Attention Sink๋ผ๊ณ  ํ•˜๋Š”๋ฐ, ์™œ ์ผ์–ด๋‚˜๋Š” ํ˜„์ƒ์ผ๊นŒ? attention์„ ๊ตฌํ•  ๋•Œ softmax๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, decoding์„ ํ•˜๋ฉด์„œ ์ฒซ๋ฒˆ์งธ ํ† ํฐ์€ ๋ชจ๋“  ํ† ํฐ์„ decode ํ•  ๋•Œ ๋“ฑ์žฅํ•˜๊ฒŒ ๋˜๋ฏ€๋กœ, ๋‹น์—ฐํžˆ ์–ด๋Š์ •๋„์˜ ๊ฐ’์„ ๊ณ„์† ๋”ํ•ด๊ฐ€์„œ ์ƒ๊ธฐ๋Š” ํ˜„์ƒ์ด๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋Ÿฐ attention sink๊ฐ€ ์ผ์–ด๋‚˜๋Š” ์ฒซ ํ† ํฐ์€ ๋ฌด์กฐ๊ฑด ๋‚จ๊ฒจ๋‘๊ณ , windowed attention์„ ์‚ฌ์šฉํ•˜๋ฉด ๋” ๊ดœ์ฐฎ์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฐ ํ˜„์ƒ์— ๋Œ€ํ•œ ๋…ผ๋ฆฌ์ ์ธ ์„ค๋ช…์€ ์ฐพ์ง€ ๋ชปํ–ˆ์ง€๋งŒ, ์•„๋งˆ๋„ ์ฒซ ํ† ํฐ์ด ๋ฌธ๋งฅ์ ์œผ๋กœ ์ค‘์š”ํ•˜์ง€ ์•Š๋”๋ผ๋„ โ€œsink(์Œ“์•„๋‘๋Š”)โ€์˜ ์—ญํ• ์„ ํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐ๋œ๋‹ค. ablation study์—์„œ๋Š” ํ•˜๋‚˜์˜ ํ† ํฐ์ด ์•„๋‹ˆ๋ผ, 4๊ฐœ์˜ token์„ sink๋กœ ํ•˜๋Š”๊ฒŒ ํ‰๊ท ์ ์œผ๋กœ ์ข‹๋‹ค๋Š” ๊ฒฐ๊ณผ๊ฐ€ ์žˆ๋‹ค

6.3. FlashAttention

FlashAttention์€ ์ข€ ๋” ํ•˜๋“œ์›จ์–ด์ ์ธ ์ ‘๊ทผ์ด๋‹ค. HBM(High Bandwith Memory)์— ์ ‘๊ทผํ•˜๋Š” ํšŸ์ˆ˜๋ฅผ ์ค„์ด๋Š” ์•„์ด๋””์–ด์ด๋‹ค. ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ํ•  ๋•Œ ์ „์ฒด ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ํ•˜๋‚˜์”ฉ ๋ถˆ๋Ÿฌ์™€์„œ(Copy Block to SRAM๋ถ€๋ถ„) GPU์˜ SRAM ๋‚ด์—์„œ ์—ฐ์‚ฐ์„ ์ตœ๋Œ€ํ•œ ๋งˆ์น˜๊ฒ ๋‹ค๋Š” ์•„์ด๋””์–ด์ด๋‹ค. ์•ž์—์„œ ๋‹ค๋ค˜๋˜ MQA, GQA๋“ฑ์„ ์ ์šฉํ•œ FlashAttention-2๋ผ๋Š” ๋…ผ๋ฌธ๋„ ์žˆ๋‹ค.

6.4. Speculative decoding

LLM์˜ decoding์€ ๋งค์šฐ memory-boundedํ•˜๋‹ค. ํ•˜๋‚˜ํ•˜๋‚˜์˜ ํ† ํฐ์„ ์ƒ์„ฑํ•  ๋•Œ๋งˆ๋‹ค ๋งค์šฐ ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๋‹ค. Speculative Decoding์€ ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ž‘์€ ๋ชจ๋ธ๋กœ K๊ฐœ์˜ ํ† ํฐ์„ ์ƒ์„ฑํ•œ ๋’ค ํฐ ๋ชจ๋ธ๋กœ ์ด ํ† ํฐ์ด ์ข‹์€์ง€ ์•„๋‹Œ์ง€ ํŒ๋‹จํ•˜๊ณ  ๋Œ€์•ˆ์„ ์ƒ์„ฑํ•œ๋‹ค(ํฐ ๋ชจ๋ธ์—์„œ๋Š” batch size๊ฐ€ 1์ผ ๋•Œ๋‚˜ K์ผ๋•Œ๋‚˜ ๋น„์Šทํ•˜๋ฏ€๋กœ) K๊ฐœ์˜ token์„ ์ƒ์„ฑํ•  ๋•Œ, ํฐ ๋ชจ๋ธ์„ K๋ฒˆ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ž‘์€ ๋ชจ๋ธ์„ K๋ฒˆ, ํฐ ๋ชจ๋ธ์€ 1๋ฒˆ๋งŒ ํ˜ธ์ถœํ•˜๋ฉด ๋˜๋ฏ€๋กœ decoding ์‹œ๊ฐ„์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๋‹ค(๋Œ€๋žต 2~3๋ฐฐ)

7. Efficient fine-tuning for LLMs

์ด๋ฒˆ ์žฅ์—์„œ๋Š” LLM fine tuning์„ ํšจ์œจ์ ์œผ๋กœ ํ•˜๋Š” ๋ฒ•์„ ์•Œ์•„๋ณธ๋‹ค.

7.1. LoRA/QLoRA

LoRA๋Š” ๋ชจ๋ธ ์ „์ฒด๋ฅผ updateํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, ์ž‘์€ bypass branch์˜ weight๋งŒ updateํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. LLM์„ pretrainํ•œ ๊ฐ€์ค‘์น˜ W๊ฐ€ ์žˆ๊ณ , full-fine tuningํ–ˆ์„ ๋•Œ ๋‹ฌ๋ผ์ง€๋Š” ๊ฐ€์ค‘์น˜๋ฅผ delta W๋ผ๊ณ  ํ•˜์ž. ๊ทธ๋Ÿฌ๋ฉด ๊ทธ delta W๋ฅผ low-rank ํ–‰๋ ฌ์ธ A์™€ B์˜ ๊ณฑ(์œ„ ๊ทธ๋ฆผ์˜ ์ฃผํ™ฉ์ƒ‰ ํ–‰๋ ฌ)์œผ๋กœ ๋‚˜ํƒ€๋‚ด์ž๋Š” ์•„์ด๋””์–ด์ด๋‹ค.

QLoRA๋Š” ๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด LoRA์— quantization์„ ๋”ํ•œ ๊ฒƒ์ด๋‹ค. NF4(NormalFloat4)๋ผ๋Š” normal distribution์— ์ตœ์ ํ™”๋œ ๋ฐ์ดํ„ฐ ํƒ€์ž…, Double Quantization, paged optimizer๋“ฑ์˜ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

7.2. Adapter

Adapter๋Š” transformer ๋ธ”๋ก์— learnableํ•œ ์ž‘์€ ๋ธ”๋ก์„ ํ•˜๋‚˜ ๋ผ์›Œ ๋„ฃ๋Š” ๊ฒƒ์ด๋‹ค. ์œ„ ๊ทธ๋ฆผ์—์„œ ์–ด๋Œ‘ํ„ฐ๋Š” ์˜ค๋ฅธ์ชฝ ๋…ธ๋ž€์ƒ‰ ๊ตฌ์กฐ๋ฅผ ๋„ฃ์€ ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ์ƒˆ๋กœ์šด layer๊ฐ€ ์ถ”๊ฐ€๋˜๋Š” ๊ฒƒ์ด๋ผ, inference์‹œ ์‹œ๊ฐ„์ด ์กฐ๊ธˆ ๋” ๋Š˜์–ด๋‚  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค.

7.3. Prompt Tuning

์œ„์˜ ๋ฐฉ๋ฒ•๋“ค๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ, tuning์—†์ด prompt๋งŒ ์ž…๋ ฅํ•ด์„œ ํŠน์ •ํ•œ task์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œ๋’ค์— ๋ฌธ์žฅ์„ ์š”์•ฝํ•ด์ค˜ :โ€ ๋ผ๋Š” ๋ฌธ์žฅ์„ ์ž…๋ ฅ์— ์ถ”๊ฐ€ํ•˜๋ฉด ์š”์•ฝ task์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์ด ์˜ฌ๋ผ๊ฐ„๋‹ค. ์ด๋ฅผ ํ™œ์š”ํ•˜๋ฉด, ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์—ฌ๋Ÿฌ๊ฐ€์ง€ task์— ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋ฉฐ, ๋ชจ๋ธ์ด ์ปค์งˆ์ˆ˜๋ก ํ•ด๋‹น task์— ๋Œ€ํ•ด์„œ๋งŒ fine-tuningํ•œ ๋ชจ๋ธ์ด๋ž‘ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๊ฒŒ ๋œ๋‹ค.