GPT-2 规模模型训练干预实验：学习率是最大变量，Dropout 反而有害

Source: Giles Thomas | 2026-04-23 URL: com/2026/04/llm-from-scratch-32m-interventions-conclusion Giles Thomas 在自己训练 GPT-2 规模模型（163M 参数，44 小时本地训练）过程中系统性测试了多种干预手段. 按效果排序：学习率调整（最大收益）+ 调度；Weight decay（有效）；QKV bias（微小帮助）；Gradient clipping（效果有限）；PyTorch AMP（训练速度翻倍但 loss 轻微变差）；Weight tying（反而让

继续阅读