This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Theme by Nerfies.
Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.
EvoLM consists of over 100 language models with various parameters (ranging from 0.5B to 4B) trained from scratch with open-source training data and training frameworks, enabling controlled experiments to dissect the effects of pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In the following table, we list all the models in EvoLM (BT: billion tokens, FW: FineWeb-Edu, FM: FineMath, ep: epoch).
Model Size
|
Pre-training (BT)
|
CPT (BT)
|
SFT (epochs * examples)
|
RL (epochs * examples)
|
Link
|
---|---|---|---|---|---|
0.5B | 10 | / | / | / | Model |
0.5B | 20 | / | / | / | Model |
0.5B | 40 | / | / | / | Model |
0.5B | 80 | / | / | / | Model |
0.5B | 160 | / | / | / | Model |
0.5B | 320 | / | / | / | Model |
1B | 20 | / | / | / | Model |
1B | 20 | / | ep1 100k | / | Model |
1B | 20 | FW8+FM42 | / | / | Model |
1B | 20 | FW8+FM42 | ep1 100k | / | Model |
1B | 20 | FW8+FM42 | ep1 100k | ep8 100k | Model |
1B | 20 | FM10 | / | / | Model |
1B | 20 | FM30 | / | / | Model |
1B | 20 | FM50 | / | / | Model |
1B | 40 | / | / | / | Model |
1B | 40 | / | ep1 100k | / | Model |
1B | 40 | FW8+FM42 | / | / | Model |
1B | 40 | FW8+FM42 | ep1 100k | / | Model |
1B | 40 | FW8+FM42 | ep1 100k | ep8 100k | Model |
1B | 40 | FM10 | / | / | Model |
1B | 40 | FM30 | / | / | Model |
1B | 40 | FM50 | / | / | Model |
1B | 80 | / | / | / | Model |
1B | 80 | / | ep1 100k | / | Model |
1B | 80 | FW8+FM42 | / | / | Model |
1B | 80 | FW8+FM42 | ep1 100k | / | Model |
1B | 80 | FW8+FM42 | ep1 100k | ep8 100k | Model |
1B | 80 | FM10 | / | / | Model |
1B | 80 | FM30 | / | / | Model |
1B | 80 | FM50 | / | / | Model |
1B | 160 | / | / | / | Model |
1B | 160 | / | ep1 100k | / | Model |
1B | 160 | / | ep1 100k | ep8 100k | Model |
1B | 160 | FW8+FM2 | / | / | Model |
1B | 160 | FW8+FM2 | ep1 100k | / | Model |
1B | 160 | FW8+FM2 | ep1 100k | ep8 100k | Model |
1B | 160 | FW8+FM12 | / | / | Model |
1B | 160 | FW8+FM12 | ep1 100k | / | Model |
1B | 160 | FW8+FM12 | ep1 100k | ep8 100k | Model |
1B | 160 | FW8+FM22 | / | / | Model |
1B | 160 | FW8+FM22 | ep1 100k | / | Model |
1B | 160 | FW8+FM22 | ep1 100k | ep8 100k | Model |
1B | 160 | FW8+FM32 | / | / | Model |
1B | 160 | FW8+FM32 | ep1 100k | / | Model |
1B | 160 | FW8+FM32 | ep1 100k | ep8 100k | Model |
1B | 160 | FW8+FM42 | / | / | Model |
1B | 160 | FW8+FM42 | ep1 100k | / | Model |
1B | 160 | FW8+FM42 | ep1 100k | ep1 100k | Model |
1B | 160 | FW8+FM42 | ep1 100k | ep2 100k | Model |
1B | 160 | FW8+FM42 | ep1 100k | ep4 100k | Model |
1B | 160 | FW8+FM42 | ep1 100k | ep16 100k | Model |
1B | 160 | FW8+FM42 | ep1 100k | ep8 100k | Model |
1B | 160 | FW8+FM42 | ep1 100k | ep8 200k | Model |
1B | 160 | FW8+FM42 | ep1 100k | ep8 300k | Model |
1B | 160 | FW8+FM42 | ep1 100k | ep8 400k | Model |
1B | 160 | FW8+FM42 | ep1 100k | ep32 100k | Model |
1B | 160 | FW8+FM42 | ep1 200k | / | Model |
1B | 160 | FW8+FM42 | ep1 200k | ep8 100k | Model |
1B | 160 | FW8+FM42 | ep1 300k | / | Model |
1B | 160 | FW8+FM42 | ep1 300k | ep8 100k | Model |
1B | 160 | FW8+FM42 | ep1 400k | / | Model |
1B | 160 | FW8+FM42 | ep1 400k | ep8 100k | Model |
1B | 160 | FW8+FM42 | ep2 100k | / | Model |
1B | 160 | FW8+FM42 | ep2 100k | ep8 100k | Model |
1B | 160 | FW8+FM42 | ep4 100k | / | Model |
1B | 160 | FW8+FM42 | ep4 100k | ep8 100k | Model |
1B | 160 | FW8+FM42 | ep8 100k | / | Model |
1B | 160 | FW8+FM42 | ep8 100k | ep8 100k | Model |
1B | 160 | FW8+FM42 | ep16 100k | / | Model |
1B | 160 | FW8+FM42 | ep16 100k | ep8 100k | Model |
1B | 160 | FW8+FM42 | ep32 100k | / | Model |
1B | 160 | FW8+FM42 | ep32 100k | ep8 100k | Model |
1B | 160 | FW1.6+FM48.4 | / | / | Model |
1B | 160 | FW16+FM34 | / | / | Model |
1B | 160 | FM10 | / | / | Model |
1B | 160 | FM20 | / | / | Model |
1B | 160 | FM30 | / | / | Model |
1B | 160 | FM40 | / | / | Model |
1B | 160 | FM50 | / | / | Model |
1B | 320 | / | / | / | Model |
1B | 320 | / | ep1 100k | / | Model |
1B | 320 | FW8+FM42 | / | / | Model |
1B | 320 | FW8+FM42 | ep1 100k | / | Model |
1B | 320 | FW8+FM42 | ep1 100k | ep8 100k | Model |
2B | 40 | / | / | / | coming soon... |
2B | 80 | / | / | / | coming soon... |
2B | 160 | / | / | / | coming soon... |
2B | 320 | / | / | / | coming soon... |
4B | 80 | / | / | / | Model |
4B | 80 | FW8+FM42 | / | / | Model |
4B | 80 | FW8+FM42 | ep1 100k | / | Model |
4B | 80 | FW8+FM42 | ep1 100k | ep8 100k | Model |
4B | 160 | / | / | / | Model |
4B | 160 | / | ep1 100k | / | Model |
4B | 160 | / | ep1 100k | ep8 100k | Model |
4B | 160 | FW8+FM2 | ep1 100k | ep8 100k | coming soon... |
4B | 160 | FW8+FM12 | ep1 100k | ep8 100k | coming soon... |
4B | 160 | FW8+FM22 | ep1 100k | ep8 100k | coming soon... |
4B | 160 | FW8+FM32 | ep1 100k | ep8 100k | coming soon... |
4B | 160 | FW8+FM42 | / | / | Model |
4B | 160 | FW8+FM42 | ep1 100k | / | Model |
4B | 160 | FW8+FM42 | ep1 100k | ep1 100k | Model |
4B | 160 | FW8+FM42 | ep1 100k | ep2 100k | Model |
4B | 160 | FW8+FM42 | ep1 100k | ep4 100k | Model |
4B | 160 | FW8+FM42 | ep1 100k | ep16 100k | Model |
4B | 160 | FW8+FM42 | ep1 100k | ep8 100k | Model |
4B | 160 | FW8+FM42 | ep1 100k | ep8 200k | Model |
4B | 160 | FW8+FM42 | ep1 100k | ep8 300k | Model |
4B | 160 | FW8+FM42 | ep1 100k | ep8 400k | Model |
4B | 160 | FW8+FM42 | ep1 100k | ep32 100k | Model |
4B | 160 | FW8+FM42 | ep1 200k | / | Model |
4B | 160 | FW8+FM42 | ep1 200k | ep8 100k | Model |
4B | 160 | FW8+FM42 | ep1 300k | / | Model |
4B | 160 | FW8+FM42 | ep1 300k | ep8 100k | Model |
4B | 160 | FW8+FM42 | ep1 400k | / | Model |
4B | 160 | FW8+FM42 | ep1 400k | ep8 100k | Model |
4B | 160 | FW8+FM42 | ep2 100k | / | Model |
4B | 160 | FW8+FM42 | ep2 100k | ep8 100k | Model |
4B | 160 | FW8+FM42 | ep4 100k | / | Model |
4B | 160 | FW8+FM42 | ep4 100k | ep8 100k | Model |
4B | 160 | FW8+FM42 | ep8 100k | / | Model |
4B | 160 | FW8+FM42 | ep8 100k | ep8 100k | Model |
4B | 160 | FW8+FM42 | ep16 100k | / | Model |
4B | 160 | FW8+FM42 | ep16 100k | ep8 100k | Model |
4B | 160 | FW8+FM42 | ep32 100k | / | Model |
4B | 160 | FW8+FM42 | ep32 100k | ep8 100k | Model |
4B | 320 | / | / | / | Model |
4B | 320 | FW8+FM42 | / | / | Model |
4B | 320 | FW8+FM42 | ep1 100k | / | Model |
4B | 320 | FW8+FM42 | ep1 100k | ep8 100k | Model |
To ensure a systematic and transparent analysis of language model (LM) capabilities, we established a rigorous evaluation protocol that spans both upstream (language modeling) and downstream (problem-solving) tasks. This comprehensive setup enables robust benchmarking across all stages of the EvoLM training pipeline.
We evaluate pre-trained and continued-pretrained models using a suite of cloze-style language modeling benchmarks, which focus on next-token prediction without requiring conversational abilities. The selected datasets are widely used for assessing general reasoning and language understanding:
We report average zero-shot accuracy across these benchmarks, providing a high-level view of each model's raw language modeling strength.
For a practical assessment of problem-solving and reasoning, we test supervised fine-tuned and RL-finetuned models on open-ended, generative tasks. The evaluation covers both in-domain and out-of-domain (OOD) challenges:
All tasks are evaluated in a zero-shot setting, where models generate full solutions without prior exposure to the specific test items.
To thoroughly assess performance, we employ several robust metrics under diverse sampling protocols:
Final answers are automatically extracted and compared to ground truth for precise, objective scoring.
We rigorously benchmark LMs of different sizes, increasing pre-training tokens far beyond traditional recipes (Chinchilla optimal ratio of 20x model size). As visualized in Figure 1, we track accuracy improvements at every token budget. While adding more tokens at first yields clear improvements, after a certain threshold, extra pre-training becomes less cost-effective.
Does endlessly scaling pre-training always help with downstream tasks? In Figure 2, we evaluate how different pre-training regimes affect real-world downstream performance, both for tasks similar to the mid-training and post-training data and for novel (OOD) tasks. Remarkably, overly pre-training does not always improve or can even harm downstream reasoning.
A common assumption is that larger models will always outperform their smaller counterparts. We study whether this assumption holds under limited pre-training resources. We directly compare 1B and 4B models at fixed compute and data limits (Table 2), examining their downstream results. For limited resources, a well-tuned small model may be more effective, while larger models perform better only after a certain data threshold is met.
How do models adapt to new domains without forgetting old knowledge? We investigate continued pre-training (CPT) with and without "replay" of general-domain data, tracking upstream performance in Figure 3 and downstream performance in Table 1. A small percentage of general replay (just 5%) proved critical for balancing new skills with retention of broad knowledge—an easy but powerful trick for practical domain adaptation.
We test the effect of varying CPT data volume on downstream results, showing results in Figure 4.
Inadequate domain data risks leaving the model poorly adapted, even after SFT or RL. Investing in rich domain datasets, on the other hand, is essential for strong post-training results.
Our study reveals the upward trend in in-domain accuracy as more CPT tokens are used. This sustained improvement justifies larger domain-specific datasets, especially when downstream reasoning or RL enhancement is desired.
We analyze the effect of high-volume CPT on both in-domain and OOD tasks. Well-designed domain adaptation can create more flexible models—not just specialists—by strengthening transferable reasoning abilities to OOD tasks.
We vary both SFT epochs and dataset size, charting downstream metrics in Figures 5 and 6, and show that more SFT is not always better. Overfitting is real, and can hurt generalization—fine-tuning should be done with care and strong validation.
By systematically increasing SFT before RL, we measure the headroom left for further RL gains. When a model is already over-specialized from SFT, RL has little left to improve. Keeping SFT at a balanced level leaves more opportunity for RL to make a difference.
We scale RL epochs and data size (see Figure 7), documenting how performance changes across different regimes. For both ID and OOD tasks, most of RL's benefit comes early. Targeting 4–8 epochs or ~100K examples gives a practical balance between results and cost in our study on 1B models.
Does RL make the model reason better, or just sample more confidently? We delve into RL's effect on solution diversity and quality. We show that after saturation, RL mainly sharpens the output distribution—helping you sample correct answers more often, but does not fundamentally improve reasoning skill—solving problems that cannot be solved.
Given a limited downstream budget, should you spend it on SFT or RL? We experiment with different SFT/RL splits (see Figure 8) on 1B and 4B models to quantify the trade-off between in-domain and OOD performance. Choose the allocation based on goals: favor SFT for specialists, or RL for generalists. This helps tailor the LM for the tasks that matter most.
We assess ORM (Outcome Reward Model) scores on their ability to predict success on downstream tasks. ORM scoring gives a true picture of reasoning quality post-training. This can help researchers and engineers more reliably monitor and optimize their models during post-training, especially when a validation set (with ground truth labels) is not available or too expensive to collect.
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Theme by Nerfies.