Do you still remember Andrej Karpathy’s project to reproduce the GPT-2 large model in pure C language?
In April this year, a project "llm.c" by Karpathy, a master in the field of AI, can implement GPT-2 training on CPU/fp32 with only 1,000 lines of code, which triggered heated discussions in the machine learning community.
llm.c is designed to greatly simplify the training of large models. It uses pure C language / CUDA and does not require 245MB of PyTorch or 107MB of cPython. However, even with such optimization, it takes 45 minutes to train on 8 H100s to reproduce a GPT-2 level model.
Unexpectedly, a few months later, the industry level has improved exponentially, which surprised Karpathy himself:
A new project "Modded-NanoGPT" appeared on GitHub ”, the technology has been significantly iterated, and now it only takes 5 minutes to achieve the same result. The author of the study, Keller Jordan, once worked at Hive AI, and his research direction has always focused on the optimization of model training. He said Wednesday that he has improved his speed record from 7.2 minutes to 5 minutes using FlexAttention with large sequence lengths.
Now with FlexAttention and larger seqlen, documents are split less often, so language modeling becomes easier both at training and validation time. The record has a slightly lower accuracy of around 29% on HellaSwag, compared to the previous record and Andrej Karpathy’s original training accuracy of around 30%.
Let’s see how he does it:
Project link: https://github.com/KellerJordan/modded-nanogpt/tree/master
Modded-NanoGPT
The project is called "Modded-NanoGPT", which is the PyTorch GPT-2 of the llm.c repository Improved variant of the trainer:
10B tokens-->45 minutes training on 1B tokens8xH100-->5 minutes training on 8xH100Modded-NanoGPT uses the following technology:
Advanced architecture: Rotation inlayInput, QK-Norm and ReLU^2; new optimizer: Muon; Untied Head in embedding; projection and classification layers initialized to zero (muP-like); architecture shortcut: value residual and embedding shortcut (partially following the paper "Value Residual Learning For Alleviating Attention Concentration In Transformers"); Momentum warmup; Tanh soft logit capping (following Gemma 2); FlexAttention.To train, run the following three commands:
pip install -r requirements.txtpip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124 —upgrade # install torch 2.6.0python data/cached_fineweb10B.py 10 # downloads only the first 1.0B training tokens to save time./run.shOn a good network connection On the 8xH100, the workout should be completed in 20 minutes.
The result will be a transformer with 124M active parameters, trained for 1875 steps on 1 billion Fineweb tokens, achieving a validation loss of ~3.278. In comparison, the default llm.c PyTorch trainer achieved a validation loss of >3.28 after training for 19560 steps on 10 billion tokens.
It is worth mentioning that to run Modded-NanoGPT on fewer GPUs, just modify run.sh to obtain a different --nproc_per_node. If you're running out of memory, just reduce device_batch_size to 16 or 32 in train_gpt2.py.
Here is a startup script for a fresh 8xH100 instance:
sudo apt-get updatessudo apt-get install vim tmux python3-pip python-is-python3 -ygit clone https://github.com/KellerJordan/modded-nanogpt.gitcd modded-nanogpttmuxpip install numpy==1.23.5 huggingface-hub tqdmpip install --upgrade torch &python data/cached_fineweb10B.py 18If the CUDA or NCCL version is different from your current Where system settings are incompatible, Docker can be a useful alternative. This approach standardizes versions of CUDA, NCCL, CUDNN, and Python, reducing dependency issues and simplifying setup. Note: NVIDIA drivers must be installed on the system.
sudo docker build -t modded-nanogpt .sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt python data/cached_fineweb10B.py 18sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt sh run.shOne problem is that it is great that NanoGPT trains quickly, but it may not scale and just overfit with the val loss? Keller Jordan says this is hard to refute because "at scale" is an infinite category (what if these methods don't work for >100T models?) and therefore cannot be fully proven. Additionally, the authors agree that some of the methods used in fast runs are unlikely to scale. But if readers care about the 1.5B model, they may be convinced by this result:
Directly extending the fast run (10/18/24 version) to 1.5B parameters can get a model with GPT-2 (1.5B )-level HellaSwag performance model, it is 2.5 times cheaper than Karpathy’s baseline ($233 vs. $576):
Muon optimizer
In addition to exploring on the shoulders of predecessors, the new project also uses Keller Jordan’s self-developed optimization methods. for exampleThis Muon optimizer, which he says is currently the fastest known optimizer, is suitable for a variety of training scenarios including CIFAR-10 and GPT-2 scale language modeling.
Muon is defined as follows:
where NewtonSchulz5 is the iteration after Newton-Schulz, which approximately replaces G with U @ V.T, where U, S, V = G.svd ( ).
@torch.compiledef zeroth_power_via_newtonschulz5 (G, steps=5, eps=1e-7): assert len (G.shape) == 2 a, b, c = (3.4445, -4.7750, 2.0315) X = G .bfloat16 () / (G.norm () + eps) if G.size (0) > G.size (1): X = X.T for _ in range (steps): A = X @ X.T B = b * A + c * A @ A X = a * X + B @ X if G.size (0) > G.size (1): X = X.T return The wall clock overhead is less than 2% Quickly run and experimentally obtained. Notable experiences include using Nesterov momentum in updates and applying orthogonalization after the momentum. A specific five Newton-Schulz iterations were used as the orthogonalization method. Use the non-convergent coefficients of the fifth degree polynomial to maximize the slope at zero, thereby minimizing the necessary number of Newton-Schulz iterations. It turns out that the variance is actually not that important, so we end up with a fifth-order polynomial that (quickly) converges to the range of 0.68, 1.13 after repeated applications, rather than to 1. Running Newton-Schulz iterations in bfloat16 (whereas Shampoo implementations typically rely on inverse pts running in fp32 or fp64h root).The method of orthogonalization using Newton-Schulz iterations can be traced back to Bernstein & Newhouse (2024), who suggested it as a method for computing Shampoo preconditioners and theoretically explored Shampoo without preconditioner accumulation. . Keller Jordan specifically thanks Jeremy Bernstein, one of the authors of the paper, for his assistance.
If we used SVD here instead of Newton-Schulz iterations, this optimizer would be too slow to use. Bernstein & Newhouse also pointed out that Shampoo accumulated without a preprocessor is equivalent to the steepest descent in the spectral norm, so Shampoo can be thought of as a method of smoothing the steepest descent in the spectrum. The proposed optimizer can be considered as a second approach to smoothing the steepest descent of the spectrum, which has different memory and runtime trade-offs compared to Shampoo.