Transformer hyperparameter questions you might have had in 224n..
And other model setting questions
\[FFN(x) = max(0, xW_1 + b_1)W_2 + b_2\]
There are two dimensions:
convention: \(d_{ff} = 4 d_{ff}\)
Exception #1 - GLU varient
Remember that GLU variants scale down by \(2/3^{\text{rd}}\). This means most GLU variants have \(d_{ff} = \frac{8}{3}d_{model}\). This is mostly what happens. Some notable such examples.
| Model | \(d_{ff}/d_{model}\) |
|---|---|
| PaLM | 4 |
| Mistral 7B | 3.5 |
| LLaMA-2 70B | 3.5 |
| LLaMA 70B | 2.68 |
| Qwen 14B | 2.67 |
| DeepSeek 67B | 2.68 |
| Yi 34B | 2.85 |
Should my model be deep or wide? How deep and how wide?
Most models are surprisingly consistent on this too.
| Model | \(d_{model} / n_{layer}\) |
|---|---|
| BLOOM | 205 |
| T5 v1.1. | 171 |
| PaLM | 156 |
| GPT3/OPT/Mistral/Qwen | 128 |
| LLama/LLama2 | 102 |
将两张表格转换为 Markdown 格式:
Monolingual models - 30-50k vocab
| Model | Token count |
|---|---|
| Original transformer | 37000 |
| GPT | 40257 |
| GPT2/3 | 50257 |
| T5/T5v1.1 | 32128 |
| LLaMA | 32000 |
Multilingual / production systems 100-250k
| Model | Token count |
|---|---|
| mT5 | 250000 |
| PaLM | 256000 |
| GPT4 | 100276 |
| Command A | 255000 |
| DeepSeek | 100000 |
| Qwen 15B | 152064 |
| Yi | 64000 |