Contents

EvoLM: In Search of Lost Language Model Training Dynamics

1Harvard, 2Stanford, 3EPFL, 4CMU

Abstract

Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.

Model Suite

EvoLM consists of over 100 language models with various parameters (ranging from 0.5B to 4B) trained from scratch with open-source training data and training frameworks, enabling controlled experiments to dissect the effects of pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In the following table, we list all the models in EvoLM (BT: billion tokens, FW: FineWeb-Edu, FM: FineMath, ep: epoch).

Showing 0 of 0 models
Model Size
Pre-training (BT)
CPT (BT)
SFT (epochs * examples)
RL (epochs * examples)
Link
0.5B 10 / / / Model
0.5B 20 / / / Model
0.5B 40 / / / Model
0.5B 80 / / / Model
0.5B 160 / / / Model
0.5B 320 / / / Model
1B 20 / / / Model
1B 20 / ep1 100k / Model
1B 20 FW8+FM42 / / Model
1B 20 FW8+FM42 ep1 100k / Model
1B 20 FW8+FM42 ep1 100k ep8 100k Model
1B 20 FM10 / / Model
1B 20 FM30 / / Model
1B 20 FM50 / / Model
1B 40 / / / Model
1B 40 / ep1 100k / Model
1B 40 FW8+FM42 / / Model
1B 40 FW8+FM42 ep1 100k / Model
1B 40 FW8+FM42 ep1 100k ep8 100k Model
1B 40 FM10 / / Model
1B 40 FM30 / / Model
1B 40 FM50 / / Model
1B 80 / / / Model
1B 80 / ep1 100k / Model
1B 80 FW8+FM42 / / Model
1B 80 FW8+FM42 ep1 100k / Model
1B 80 FW8+FM42 ep1 100k ep8 100k Model
1B 80 FM10 / / Model
1B 80 FM30 / / Model
1B 80 FM50 / / Model
1B 160 / / / Model
1B 160 / ep1 100k / Model
1B 160 / ep1 100k ep8 100k Model
1B 160 FW8+FM2 / / Model
1B 160 FW8+FM2 ep1 100k / Model
1B 160 FW8+FM2 ep1 100k ep8 100k Model
1B 160 FW8+FM12 / / Model
1B 160 FW8+FM12 ep1 100k / Model
1B 160 FW8+FM12 ep1 100k ep8 100k Model
1B 160 FW8+FM22 / / Model
1B 160 FW8+FM22 ep1 100k / Model
1B 160 FW8+FM22 ep1 100k ep8 100k Model
1B 160 FW8+FM32 / / Model
1B 160 FW8+FM32 ep1 100k / Model
1B 160 FW8+FM32 ep1 100k ep8 100k Model
1B 160 FW8+FM42 / / Model
1B 160 FW8+FM42 ep1 100k / Model
1B 160 FW8+FM42 ep1 100k ep1 100k Model
1B 160 FW8+FM42 ep1 100k ep2 100k Model
1B 160 FW8+FM42 ep1 100k ep4 100k Model
1B 160 FW8+FM42 ep1 100k ep16 100k Model
1B 160 FW8+FM42 ep1 100k ep8 100k Model
1B 160 FW8+FM42 ep1 100k ep8 200k Model
1B 160 FW8+FM42 ep1 100k ep8 300k Model
1B 160 FW8+FM42 ep1 100k ep8 400k Model
1B 160 FW8+FM42 ep1 100k ep32 100k Model
1B 160 FW8+FM42 ep1 200k / Model
1B 160 FW8+FM42 ep1 200k ep8 100k Model
1B 160 FW8+FM42 ep1 300k / Model
1B 160 FW8+FM42 ep1 300k ep8 100k Model
1B 160 FW8+FM42 ep1 400k / Model
1B 160 FW8+FM42 ep1 400k ep8 100k Model
1B 160 FW8+FM42 ep2 100k / Model
1B 160 FW8+FM42 ep2 100k ep8 100k Model
1B 160 FW8+FM42 ep4 100k / Model
1B 160 FW8+FM42 ep4 100k ep8 100k Model
1B 160 FW8+FM42 ep8 100k / Model
1B 160 FW8+FM42 ep8 100k ep8 100k Model
1B 160 FW8+FM42 ep16 100k / Model
1B 160 FW8+FM42 ep16 100k ep8 100k Model
1B 160 FW8+FM42 ep32 100k / Model
1B 160 FW8+FM42 ep32 100k ep8 100k Model
1B 160 FW1.6+FM48.4 / / Model
1B 160 FW16+FM34 / / Model
1B 160 FM10 / / Model
1B 160 FM20 / / Model
1B 160 FM30 / / Model
1B 160 FM40 / / Model
1B 160 FM50 / / Model
1B 320 / / / Model
1B 320 / ep1 100k / Model
1B 320 FW8+FM42 / / Model
1B 320 FW8+FM42 ep1 100k / Model
1B 320 FW8+FM42 ep1 100k ep8 100k Model
2B 40 / / / coming soon...
2B 80 / / / coming soon...
2B 160 / / / coming soon...
2B 320 / / / coming soon...
4B 80 / / / Model
4B 80 FW8+FM42 / / Model
4B 80 FW8+FM42 ep1 100k / Model
4B 80 FW8+FM42 ep1 100k ep8 100k Model
4B 160 / / / Model
4B 160 / ep1 100k / Model
4B 160 / ep1 100k ep8 100k Model
4B 160 FW8+FM2 ep1 100k ep8 100k coming soon...
4B 160 FW8+FM12 ep1 100k ep8 100k coming soon...
4B 160 FW8+FM22 ep1 100k ep8 100k coming soon...
4B 160 FW8+FM32 ep1 100k ep8 100k coming soon...
4B 160 FW8+FM42 / / Model
4B 160 FW8+FM42 ep1 100k / Model
4B 160 FW8+FM42 ep1 100k ep1 100k Model
4B 160 FW8+FM42 ep1 100k ep2 100k Model
4B 160 FW8+FM42 ep1 100k ep4 100k Model
4B 160 FW8+FM42 ep1 100k ep16 100k Model
4B 160 FW8+FM42 ep1 100k ep8 100k Model
4B 160 FW8+FM42 ep1 100k ep8 200k Model
4B 160 FW8+FM42 ep1 100k ep8 300k Model
4B 160 FW8+FM42 ep1 100k ep8 400k Model
4B 160 FW8+FM42 ep1 100k ep32 100k Model
4B 160 FW8+FM42 ep1 200k / Model
4B 160 FW8+FM42 ep1 200k ep8 100k Model
4B 160 FW8+FM42 ep1 300k / Model
4B 160 FW8+FM42 ep1 300k ep8 100k Model
4B 160 FW8+FM42 ep1 400k / Model
4B 160 FW8+FM42 ep1 400k ep8 100k Model
4B 160 FW8+FM42 ep2 100k / Model
4B 160 FW8+FM42 ep2 100k ep8 100k Model
4B 160 FW8+FM42 ep4 100k / Model
4B 160 FW8+FM42 ep4 100k ep8 100k Model
4B 160 FW8+FM42 ep8 100k / Model
4B 160 FW8+FM42 ep8 100k ep8 100k Model
4B 160 FW8+FM42 ep16 100k / Model
4B 160 FW8+FM42 ep16 100k ep8 100k Model
4B 160 FW8+FM42 ep32 100k / Model
4B 160 FW8+FM42 ep32 100k ep8 100k Model
4B 320 / / / Model
4B 320 FW8+FM42 / / Model
4B 320 FW8+FM42 ep1 100k / Model
4B 320 FW8+FM42 ep1 100k ep8 100k Model

Evaluation Protocol

To ensure a systematic and transparent analysis of language model (LM) capabilities, we established a rigorous evaluation protocol that spans both upstream (language modeling) and downstream (problem-solving) tasks. This comprehensive setup enables robust benchmarking across all stages of the EvoLM training pipeline.

Upstream Cloze Tasks

We evaluate pre-trained and continued-pretrained models using a suite of cloze-style language modeling benchmarks, which focus on next-token prediction without requiring conversational abilities. The selected datasets are widely used for assessing general reasoning and language understanding:

  • HellaSwag: Commonsense completion
  • Winogrande: Coreference reasoning
  • PIQA: Physical commonsense reasoning
  • OBQA: Open book question answering
  • ARC-Easy/Challenge: Science and multi-step reasoning

We report average zero-shot accuracy across these benchmarks, providing a high-level view of each model's raw language modeling strength.

Downstream Generative Tasks

For a practical assessment of problem-solving and reasoning, we test supervised fine-tuned and RL-finetuned models on open-ended, generative tasks. The evaluation covers both in-domain and out-of-domain (OOD) challenges:

In-Domain Tasks (Mathematical Reasoning)
  • GSM8K-Platinum: High-quality, grade-school math word problems
  • MATH: Competition-level mathematical problem-solving
Out-of-Domain (OOD) Tasks
  • CRUXEval: Code reasoning and program output prediction
  • BGQA: Logical reasoning with contradictions
  • TabMWP: Mathematical reasoning over tables
  • StrategyQA: Multi-hop commonsense and strategic reasoning

All tasks are evaluated in a zero-shot setting, where models generate full solutions without prior exposure to the specific test items.

Metrics and Decoding Schemes

To thoroughly assess performance, we employ several robust metrics under diverse sampling protocols:

  • Accuracy under four prompting schemes:
    • Pass@1: Deterministic, single output (temperature = 0)
    • Maj@16: Majority vote among 16 stochastic samples (temperature = 1)
    • RM@16: Best of 16 samples, selected by the highest Outcome Reward Model (ORM) score
    • Pass@16: Problem considered solved if any one of 16 samples is correct
  • Correct Ratio: Fraction of correct solutions within a batch of generated responses
  • ORM Score: Scalar reward assigned by a large, off-the-shelf reward model, reflecting the overall quality of generated solutions

Final answers are automatically extracted and compared to ground truth for precise, objective scoring.

Key Findings

Pre-training

Where Do Returns Diminish?

We rigorously benchmark LMs of different sizes, increasing pre-training tokens far beyond traditional recipes (Chinchilla optimal ratio of 20x model size). As visualized in Figure 1, we track accuracy improvements at every token budget. While adding more tokens at first yields clear improvements, after a certain threshold, extra pre-training becomes less cost-effective.

Diminishing returns illustration
Takeaway 1. Excessive general-domain pre-training improves upstream performance but with diminishing returns (saturation happens around 80x to 160x model size in our study).

When More Isn't Better: Downstream Surprises

Does endlessly scaling pre-training always help with downstream tasks? In Figure 2, we evaluate how different pre-training regimes affect real-world downstream performance, both for tasks similar to the mid-training and post-training data and for novel (OOD) tasks. Remarkably, overly pre-training does not always improve or can even harm downstream reasoning.

Forgetting mitigation illustration
Takeaway 2. Excessive general-domain pre-training does not always improve domain-specific post-training and might even cause performance degradation on some downstream tasks (saturation happens around 80x to 160x model size in our study).

Small vs. Large Models: The Budget-Compute Tradeoff

A common assumption is that larger models will always outperform their smaller counterparts. We study whether this assumption holds under limited pre-training resources. We directly compare 1B and 4B models at fixed compute and data limits (Table 2), examining their downstream results. For limited resources, a well-tuned small model may be more effective, while larger models perform better only after a certain data threshold is met.

Forgetting mitigation illustration
Takeaway 3. Under limited pre-training budgets, smaller post-trained models can even outperform larger counterparts. Conversely, once pre-training tokens reach the saturation regime, increasing model size enables clear improvements in both in-domain performance and OOD generalization.

Mid-training

Catastrophic Forgetting in Continued Pre-training

How do models adapt to new domains without forgetting old knowledge? We investigate continued pre-training (CPT) with and without "replay" of general-domain data, tracking upstream performance in Figure 3 and downstream performance in Table 1. A small percentage of general replay (just 5%) proved critical for balancing new skills with retention of broad knowledge—an easy but powerful trick for practical domain adaptation.

Forgetting mitigation illustration
Forgetting mitigation illustration
Takeaway 4. Continued pre-training on domain-specific data induces catastrophic forgetting of pre-trained knowledge which could harm both upstream and downstream performance, while incorporating a small replay budget (e.g. 5%) could effectively mitigate this degradation.

Importance of Continued Pre-training for Post-training

We test the effect of varying CPT data volume on downstream results, showing results in Figure 4.

Forgetting mitigation illustration

Inadequate domain data risks leaving the model poorly adapted, even after SFT or RL. Investing in rich domain datasets, on the other hand, is essential for strong post-training results.

Takeaway 5. Domain-specific post-training should be supported by adequate domain-specific CPT data: without it, SFT performance remains suboptimal and RL can even degrade such performance.

Our study reveals the upward trend in in-domain accuracy as more CPT tokens are used. This sustained improvement justifies larger domain-specific datasets, especially when downstream reasoning or RL enhancement is desired.

Takeaway 6. As domain-specific CPT data increase, in-domain downstream performance steadily improves and the SFT models could benefit more from RL finetuning.

We analyze the effect of high-volume CPT on both in-domain and OOD tasks. Well-designed domain adaptation can create more flexible models—not just specialists—by strengthening transferable reasoning abilities to OOD tasks.

Takeaway 7. With sufficient domain-specific CPT data, post-training on in-domain tasks not only improves in-domain performance but also generalizes effectively to OOD tasks.

Post-training

SFT: Diminishing Returns and Overfitting Risks

Forgetting mitigation illustration
Forgetting mitigation illustration

We vary both SFT epochs and dataset size, charting downstream metrics in Figures 5 and 6, and show that more SFT is not always better. Overfitting is real, and can hurt generalization—fine-tuning should be done with care and strong validation.

Takeaway 8. Excessive SFT improves ID performance with diminishing returns but does not necessarily improve and can even degrade OOD performance.

By systematically increasing SFT before RL, we measure the headroom left for further RL gains. When a model is already over-specialized from SFT, RL has little left to improve. Keeping SFT at a balanced level leaves more opportunity for RL to make a difference.

Takeaway 9. Excessive SFT, especially overly large epochs, could limit further RL improvements.

RL: Diminishing Returns and Practical Solutions

Forgetting mitigation illustration

We scale RL epochs and data size (see Figure 7), documenting how performance changes across different regimes. For both ID and OOD tasks, most of RL's benefit comes early. Targeting 4–8 epochs or ~100K examples gives a practical balance between results and cost in our study on 1B models.

Takeaway 10. RL with excessive epochs or examples improves downstream performance on both ID and OOD tasks but with diminishing returns (saturation happens at 4-8 epochs or 50-100K examples in our study).

Does RL make the model reason better, or just sample more confidently? We delve into RL's effect on solution diversity and quality. We show that after saturation, RL mainly sharpens the output distribution—helping you sample correct answers more often, but does not fundamentally improve reasoning skill—solving problems that cannot be solved.

Takeaway 11. Beyond saturation regime, RL primarily increases the probability of sampling high-quality rollouts but does not necessarily improve models' fundamental reasoning capabilities.

SFT/RL Allocation Under Data Constraints

Given a limited downstream budget, should you spend it on SFT or RL? We experiment with different SFT/RL splits (see Figure 8) on 1B and 4B models to quantify the trade-off between in-domain and OOD performance. Choose the allocation based on goals: favor SFT for specialists, or RL for generalists. This helps tailor the LM for the tasks that matter most.

Forgetting mitigation illustration
Takeaway 12. Under a constrained downstream data budget, allocating more examples to SFT maximizes in-domain gains at the expense of weaker OOD generalization, while allocating more to RL improves OOD performance.

Validation

ORM Score as Validation Metric

We assess ORM (Outcome Reward Model) scores on their ability to predict success on downstream tasks. ORM scoring gives a true picture of reasoning quality post-training. This can help researchers and engineers more reliably monitor and optimize their models during post-training, especially when a validation set (with ground truth labels) is not available or too expensive to collect.

Forgetting mitigation illustration
Takeaway 13. ORM score could be a more reliable unsupervised validation metric that helps predict downstream task performance during post-training, compared to validation loss. Notably, ORM scores from an 8B reward model correlate well with problem-solving accuracies of 1B models on many downstream reasoning tasks.

BibTeX