English
microgpt — A 200-line pure Python GPT
Andrej Karpathy released microgpt: a single file of ~200 lines of pure Python with no dependencies that trains and runs inference on a GPT. The file contains the complete algorithmic essence: dataset, tokenizer, autograd engine, GPT-2-like neural network, Adam optimizer, training loop, and inference loop.
The goal: distill the entire algorithmic content of what an LLM needs down to the bare minimum—no efficiency tricks, no GPU requirements, just the math.
Core Components
Dataset: 32,000 names from a text file. Each name is a "document".
Tokenizer: Simplest possible—assign one integer ID to each unique character (26 lowercase letters + BOS token = 27 total vocabulary).
Autograd: A single Value class wraps scalars, tracks computation graph, implements backward() via chain rule. This is the same algorithm as PyTorch's .backward(), just operating on scalars instead of tensors.
Architecture: GPT-2-like Transformer with:
- Token embeddings + position embeddings
- Single-head attention (configurable to multi-head)
- RMSNorm (no LayerNorm), no biases
- ReLU (not GeLU)
- Residual connections
Training: 1,000 steps on a tiny model (~4,192 parameters). Loss drops from ~3.3 (random guessing) to ~2.37. Then sample new names:
kamon, ann, karai, jaire, vialan, karia, yeran, anna, areli, kaina, conna, keylen, liole, alerin, earan, lenne, kana, lara, alela, anton
Key insight: Everything in modern LLMs (ChatGPT etc.) is this same loop—predict next token, sample, repeat—just massively scaled up in data, model size, and compute.
What makes it different from production LLMs?
| Aspect | microgpt | Production LLMs | |--------|----------|-----------------| | Data | 32K names | Trillions of tokens | | Vocab | 27 characters | ~100K subword tokens | | Parameters | 4,192 | Hundreds of billions | | Autograd | Pure Python scalars | GPU tensors + CUDA kernels | | Training | Single-threaded Python | Distributed GPU clusters | | Post-training | None | SFT + RLHF |
The core algorithm is identical. Everything else is engineering at scale.
Source: karpathy.github.io | GitHub Gist
中文
microgpt — 仅用200行纯Python实现的GPT
Andrej Karpathy 发布了 microgpt:一个约200行的纯Python文件,零依赖,能训练并运行一个GPT做推理。文件包含完整的算法核心:数据集、分词器、自微分引擎、类GPT-2神经网络、Adam优化器、训练循环和推理循环。
目标:将LLM所需的全部算法内容精简到极致——不含任何效率优化、不需要GPU,只有数学。
核心组件
数据集: 32,000个英文名字组成的文本文件。每个名字视为一个"文档"。
分词器: 最简实现——为每个唯一字符分配一个整数ID(26个小写字母 + BOS标记 = 共27个词表)。
自微分(Autograd): 一个 Value 类封装标量,跟踪计算图,通过链式法则实现 backward()。这与 PyTorch 的 .backward() 算法完全相同,只是操作的是标量而非张量。
架构: 类GPT-2 Transformer,含:
- 词嵌入 + 位置嵌入
- 单头注意力(可配置为多头)
- RMSNorm(而非LayerNorm),无偏置
- ReLU(而非GeLU)
- 残差连接
训练: 小模型(约4,192个参数)训练1,000步。损失从~3.3(随机猜测)降至~2.37。随后采样生成新名字:
kamon, ann, karai, jaire, vialan, karia, yeran, anna, areli, kaina, conna, keylen, liole, alerin, earan, lenne, kana, lara, alela, anton
核心洞见: 现代LLM(如ChatGPT)的全部内容就是这个循环的放大版——预测下一个token、采样、重复——只是在数据、模型规模和算力上极度扩展。
与生产级LLM的区别
| 方面 | microgpt | 生产级LLM | |------|---------|----------| | 数据 | 32K个名字 | 数万亿token | | 词表 | 27个字符 | ~100K子词token | | 参数 | 4,192 | 数千亿 | | 自微分 | 纯Python标量 | GPU张量 + CUDA内核 | | 训练 | 单线程Python | 分布式GPU集群 | | 后训练 | 无 | SFT + RLHF |
核心算法完全相同。其他都是大规模工程。