EvoLM: In Search of Lost Language Model Training Dynamics

Abstract

Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.

Model Suite

EvoLM consists of over 100 language models with various parameters (ranging from 0.5B to 4B) trained from scratch with open-source training data and training frameworks, enabling controlled experiments to dissect the effects of pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In the following table, we list all the models in EvoLM (BT: billion tokens, FW: FineWeb-Edu, FM: FineMath, ep: epoch).

Model Size	Pre-training (BT)	CPT (BT)	SFT (epochs * examples)	RL (epochs * examples)	Link
0.5B	10	/	/	/	Model
0.5B	20	/	/	/	Model
0.5B	40	/	/	/	Model
0.5B	80	/	/	/	Model
0.5B	160	/	/	/	Model
0.5B	320	/	/	/	Model
1B	20	/	/	/	Model
1B	20	/	ep1 100k	/	Model
1B	20	FW8+FM42	/	/	Model
1B	20	FW8+FM42	ep1 100k	/	Model
1B	20	FW8+FM42	ep1 100k	ep8 100k	Model
1B	20	FM10	/	/	Model
1B	20	FM30	/	/	Model
1B	20	FM50	/	/	Model
1B	40	/	/	/	Model
1B	40	/	ep1 100k	/	Model
1B	40	FW8+FM42	/	/	Model
1B	40	FW8+FM42	ep1 100k	/	Model
1B	40	FW8+FM42	ep1 100k	ep8 100k	Model
1B	40	FM10	/	/	Model
1B	40	FM30	/	/	Model
1B	40	FM50	/	/	Model
1B	80	/	/	/	Model
1B	80	/	ep1 100k	/	Model
1B	80	FW8+FM42	/	/	Model
1B	80	FW8+FM42	ep1 100k	/	Model
1B	80	FW8+FM42	ep1 100k	ep8 100k	Model
1B	80	FM10	/	/	Model
1B	80	FM30	/	/	Model
1B	80	FM50	/	/	Model
1B	160	/	/	/	Model
1B	160	/	ep1 100k	/	Model
1B	160	/	ep1 100k	ep8 100k	Model
1B	160	FW8+FM2	/	/	Model
1B	160	FW8+FM2	ep1 100k	/	Model
1B	160	FW8+FM2	ep1 100k	ep8 100k	Model
1B	160	FW8+FM12	/	/	Model
1B	160	FW8+FM12	ep1 100k	/	Model
1B	160	FW8+FM12	ep1 100k	ep8 100k	Model
1B	160	FW8+FM22	/	/	Model
1B	160	FW8+FM22	ep1 100k	/	Model
1B	160	FW8+FM22	ep1 100k	ep8 100k	Model
1B	160	FW8+FM32	/	/	Model
1B	160	FW8+FM32	ep1 100k	/	Model
1B	160	FW8+FM32	ep1 100k	ep8 100k	Model
1B	160	FW8+FM42	/	/	Model
1B	160	FW8+FM42	ep1 100k	/	Model
1B	160	FW8+FM42	ep1 100k	ep1 100k	Model
1B	160	FW8+FM42	ep1 100k	ep2 100k	Model
1B	160	FW8+FM42	ep1 100k	ep4 100k	Model
1B	160	FW8+FM42	ep1 100k	ep16 100k	Model
1B	160	FW8+FM42	ep1 100k	ep8 100k	Model
1B	160	FW8+FM42	ep1 100k	ep8 200k	Model
1B	160	FW8+FM42	ep1 100k	ep8 300k	Model
1B	160	FW8+FM42	ep1 100k	ep8 400k	Model
1B	160	FW8+FM42	ep1 100k	ep32 100k	Model
1B	160	FW8+FM42	ep1 200k	/	Model
1B	160	FW8+FM42	ep1 200k	ep8 100k	Model
1B	160	FW8+FM42	ep1 300k	/	Model
1B	160	FW8+FM42	ep1 300k	ep8 100k	Model
1B	160	FW8+FM42	ep1 400k	/	Model
1B	160	FW8+FM42	ep1 400k	ep8 100k	Model
1B	160	FW8+FM42	ep2 100k	/	Model
1B	160	FW8+FM42	ep2 100k	ep8 100k	Model
1B	160	FW8+FM42	ep4 100k	/	Model
1B	160	FW8+FM42	ep4 100k	ep8 100k	Model
1B	160	FW8+FM42	ep8 100k	/	Model
1B	160	FW8+FM42	ep8 100k	ep8 100k	Model
1B	160	FW8+FM42	ep16 100k	/	Model
1B	160	FW8+FM42	ep16 100k	ep8 100k	Model
1B	160	FW8+FM42	ep32 100k	/	Model
1B	160	FW8+FM42	ep32 100k	ep8 100k	Model
1B	160	FW1.6+FM48.4	/	/	Model
1B	160	FW16+FM34	/	/	Model
1B	160	FM10	/	/	Model
1B	160	FM20	/	/	Model
1B	160	FM30	/	/	Model
1B	160	FM40	/	/	Model
1B	160	FM50	/	/	Model
1B	320	/	/	/	Model
1B	320	/	ep1 100k	/	Model
1B	320	FW8+FM42	/	/	Model
1B	320	FW8+FM42	ep1 100k	/	Model
1B	320	FW8+FM42	ep1 100k	ep8 100k	Model
2B	40	/	/	/	coming soon...
2B	80	/	/	/	coming soon...
2B	160	/	/	/	coming soon...
2B	320	/	/	/	coming soon...
4B	80	/	/	/	Model
4B	80	FW8+FM42	/	/	Model
4B	80	FW8+FM42	ep1 100k	/	Model
4B	80	FW8+FM42	ep1 100k	ep8 100k	Model
4B	160	/	/	/	Model
4B	160	/	ep1 100k	/	Model
4B	160	/	ep1 100k	ep8 100k	Model
4B	160	FW8+FM2	ep1 100k	ep8 100k	coming soon...
4B	160	FW8+FM12	ep1 100k	ep8 100k	coming soon...
4B	160	FW8+FM22	ep1 100k	ep8 100k	coming soon...
4B	160	FW8+FM32	ep1 100k	ep8 100k	coming soon...
4B	160	FW8+FM42	/	/	Model
4B	160	FW8+FM42	ep1 100k	/	Model
4B	160	FW8+FM42	ep1 100k	ep1 100k	Model
4B	160	FW8+FM42	ep1 100k	ep2 100k	Model
4B	160	FW8+FM42	ep1 100k	ep4 100k	Model
4B	160	FW8+FM42	ep1 100k	ep16 100k	Model
4B	160	FW8+FM42	ep1 100k	ep8 100k	Model
4B	160	FW8+FM42	ep1 100k	ep8 200k	Model
4B	160	FW8+FM42	ep1 100k	ep8 300k	Model
4B	160	FW8+FM42	ep1 100k	ep8 400k	Model
4B	160	FW8+FM42	ep1 100k	ep32 100k	Model
4B	160	FW8+FM42	ep1 200k	/	Model
4B	160	FW8+FM42	ep1 200k	ep8 100k	Model
4B	160	FW8+FM42	ep1 300k	/	Model
4B	160	FW8+FM42	ep1 300k	ep8 100k	Model
4B	160	FW8+FM42	ep1 400k	/	Model
4B	160	FW8+FM42	ep1 400k	ep8 100k	Model
4B	160	FW8+FM42	ep2 100k	/	Model
4B	160	FW8+FM42	ep2 100k	ep8 100k	Model
4B	160	FW8+FM42	ep4 100k	/	Model
4B	160	FW8+FM42	ep4 100k	ep8 100k	Model
4B	160	FW8+FM42	ep8 100k	/	Model
4B	160	FW8+FM42	ep8 100k	ep8 100k	Model
4B	160	FW8+FM42	ep16 100k	/	Model
4B	160	FW8+FM42	ep16 100k	ep8 100k	Model
4B	160	FW8+FM42	ep32 100k	/	Model
4B	160	FW8+FM42	ep32 100k	ep8 100k	Model
4B	320	/	/	/	Model
4B	320	FW8+FM42	/	/	Model
4B	320	FW8+FM42	ep1 100k	/	Model
4B	320	FW8+FM42	ep1 100k	ep8 100k	Model

Model Size

Pre-training (BT)

CPT (BT)

SFT (epochs * examples)

RL (epochs * examples)

Link

0.5B

Model

0.5B

Model

0.5B

Model

0.5B

Model

0.5B

160

Model

0.5B

320

Model

ep1 100k

Model

FW8+FM42

Model

FW8+FM42

ep1 100k

Model

FW8+FM42

ep1 100k

ep8 100k

Model

FM10

Model

FM30

Model

FM50

Model

ep1 100k

Model

FW8+FM42

Model

FW8+FM42

ep1 100k

Model

FW8+FM42

ep1 100k

ep8 100k

Model

FM10

Model

FM30

Model

FM50

Model

ep1 100k

Model

FW8+FM42

Model

FW8+FM42

ep1 100k

Model

FW8+FM42

ep1 100k

ep8 100k

Model

FM10

Model

FM30

Model

FM50

Model

160

Model

160

ep1 100k

Model

160

ep1 100k

ep8 100k

Model

160

FW8+FM2

Model

160

FW8+FM2

ep1 100k

Model

160

FW8+FM2

ep1 100k

ep8 100k

Model

160

FW8+FM12

Model

160

FW8+FM12

ep1 100k

Model

160

FW8+FM12

ep1 100k

ep8 100k

Model

160

FW8+FM22

Model

160

FW8+FM22

ep1 100k

Model

160

FW8+FM22

ep1 100k

ep8 100k

Model

160

FW8+FM32

Model

160

FW8+FM32

ep1 100k

Model

160

FW8+FM32

ep1 100k

ep8 100k

Model

160

FW8+FM42

Model

160

FW8+FM42

ep1 100k

Model

160

FW8+FM42

ep1 100k

Model

160

FW8+FM42

ep1 100k

ep2 100k

Model

160

FW8+FM42

ep1 100k

ep4 100k

Model

160

FW8+FM42

ep1 100k

ep16 100k

Model

160

FW8+FM42

ep1 100k

ep8 100k

Model

160

FW8+FM42

ep1 100k

ep8 200k

Model

160

FW8+FM42

ep1 100k

ep8 300k

Model

160

FW8+FM42

ep1 100k

ep8 400k

Model

160

FW8+FM42

ep1 100k

ep32 100k

Model

160

FW8+FM42

ep1 200k

Model

160

FW8+FM42

ep1 200k

ep8 100k

Model

160

FW8+FM42

ep1 300k

Model

160

FW8+FM42

ep1 300k

ep8 100k

Model

160

FW8+FM42

ep1 400k

Model

160

FW8+FM42

ep1 400k

ep8 100k

Model

160

FW8+FM42

ep2 100k

Model

160

FW8+FM42

ep2 100k

ep8 100k

Model

160

FW8+FM42

ep4 100k

Model

160

FW8+FM42

ep4 100k

ep8 100k

Model

160

FW8+FM42

ep8 100k

Model

160

FW8+FM42

ep8 100k

Model

160

FW8+FM42

ep16 100k

Model

160

FW8+FM42

ep16 100k

ep8 100k

Model

160

FW8+FM42

ep32 100k

Model

160

FW8+FM42

ep32 100k

ep8 100k

Model

160

FW1.6+FM48.4

Model

160

FW16+FM34

Model

160

FM10

Model

160

FM20

Model

160

FM30

Model

160

FM40

Model

160

FM50

Model

320

Model

320

ep1 100k

Model

320

FW8+FM42

Model

320

FW8+FM42

ep1 100k

Model

320

FW8+FM42

ep1 100k

ep8 100k

Model

coming soon...

160

coming soon...

320

coming soon...

Model

FW8+FM42

Model

FW8+FM42

ep1 100k

Model

FW8+FM42

ep1 100k

ep8 100k

Model

160

Model

160

ep1 100k

Model

160

ep1 100k

ep8 100k

Model

160

FW8+FM2

ep1 100k

ep8 100k

coming soon...

160

FW8+FM12

ep1 100k

ep8 100k

coming soon...

160

FW8+FM22

ep1 100k

ep8 100k

coming soon...

160

FW8+FM32

ep1 100k

ep8 100k

coming soon...

160

FW8+FM42

Model

160

FW8+FM42

ep1 100k

Model

160

FW8+FM42

ep1 100k

Model

160

FW8+FM42

ep1 100k

ep2 100k

Model

160

FW8+FM42

ep1 100k

ep4 100k

Model

160

FW8+FM42

ep1 100k

ep16 100k

Model

160

FW8+FM42

ep1 100k

ep8 100k

Model

160

FW8+FM42

ep1 100k

ep8 200k

Model

160

FW8+FM42

ep1 100k

ep8 300k

Model

160

FW8+FM42

ep1 100k

ep8 400k

Model

160

FW8+FM42

ep1 100k

ep32 100k

Model

160

FW8+FM42

ep1 200k

Model

160

FW8+FM42

ep1 200k

ep8 100k

Model

160

FW8+FM42

ep1 300k

Model

160

FW8+FM42

ep1 300k

ep8 100k

Model

160

FW8+FM42

ep1 400k

Model

160

FW8+FM42

ep1 400k

ep8 100k

Model

160

FW8+FM42

ep2 100k

Model

160

FW8+FM42

ep2 100k

ep8 100k

Model

160

FW8+FM42

ep4 100k

Model

160

FW8+FM42

ep4 100k

ep8 100k

Model

160

FW8+FM42

ep8 100k

Model

160

FW8+FM42

ep8 100k

Model

160

FW8+FM42

ep16 100k

Model

160

FW8+FM42

ep16 100k

ep8 100k

Model

160

FW8+FM42

ep32 100k

Model

160

FW8+FM42

ep32 100k

ep8 100k

Model

320

Model

320

FW8+FM42

Model

320

FW8+FM42

ep1 100k

Model

320

FW8+FM42

ep1 100k

ep8 100k

Model

Evaluation Protocol

To ensure a systematic and transparent analysis of language model (LM) capabilities, we established a rigorous evaluation protocol that spans both upstream (language modeling) and downstream (problem-solving) tasks. This comprehensive setup enables robust benchmarking across all stages of the EvoLM training pipeline.

Upstream Cloze Tasks

We evaluate pre-trained and continued-pretrained models using a suite of cloze-style language modeling benchmarks, which focus on next-token prediction without requiring conversational abilities. The selected datasets are widely used for assessing general reasoning and language understanding:

HellaSwag: Commonsense completion
Winogrande: Coreference reasoning
PIQA: Physical commonsense reasoning
OBQA: Open book question answering
ARC-Easy/Challenge: Science and multi-step reasoning

We report average zero-shot accuracy across these benchmarks, providing a high-level view of each model's raw language modeling strength.

Downstream Generative Tasks

For a practical assessment of problem-solving and reasoning, we test supervised fine-tuned and RL-finetuned models on open-ended, generative tasks. The evaluation covers both in-domain and out-of-domain (OOD) challenges:

In-Domain Tasks (Mathematical Reasoning)

GSM8K-Platinum: High-quality, grade-school math word problems
MATH: Competition-level mathematical problem-solving

Out-of-Domain (OOD) Tasks

CRUXEval: Code reasoning and program output prediction
BGQA: Logical reasoning with contradictions
TabMWP: Mathematical reasoning over tables
StrategyQA: Multi-hop commonsense and strategic reasoning

All tasks are evaluated in a zero-shot setting, where models generate full solutions without prior exposure to the specific test items.

Metrics and Decoding Schemes

To thoroughly assess performance, we employ several robust metrics under diverse sampling protocols:

Accuracy under four prompting schemes:
- Pass@1: Deterministic, single output (temperature = 0)
- Maj@16: Majority vote among 16 stochastic samples (temperature = 1)
- RM@16: Best of 16 samples, selected by the highest Outcome Reward Model (ORM) score
- Pass@16: Problem considered solved if any one of 16 samples is correct
Correct Ratio: Fraction of correct solutions within a batch of generated responses
ORM Score: Scalar reward assigned by a large, off-the-shelf reward model, reflecting the overall quality of generated solutions

Final answers are automatically extracted and compared to ground truth for precise, objective scoring.

Key Findings

Pre-training ▾

Where Do Returns Diminish?

We rigorously benchmark LMs of different sizes, increasing pre-training tokens far beyond traditional recipes (Chinchilla optimal ratio of 20x model size). As visualized in Figure 1, we track accuracy improvements at every token budget. While adding more tokens at first yields clear improvements, after a certain threshold, extra pre-training becomes less cost-effective.

Takeaway 1. Excessive general-domain pre-training improves upstream performance but with diminishing returns (saturation happens around 80x to 160x model size in our study).

When More Isn't Better: Downstream Surprises

Does endlessly scaling pre-training always help with downstream tasks? In Figure 2, we evaluate how different pre-training regimes affect real-world downstream performance, both for tasks similar to the mid-training and post-training data and for novel (OOD) tasks. Remarkably, overly pre-training does not always improve or can even harm downstream reasoning.

Takeaway 2. Excessive general-domain pre-training does not always improve domain-specific post-training and might even cause performance degradation on some downstream tasks (saturation happens around 80x to 160x model size in our study).

Small vs. Large Models: The Budget-Compute Tradeoff

A common assumption is that larger models will always outperform their smaller counterparts. We study whether this assumption holds under limited pre-training resources. We directly compare 1B and 4B models at fixed compute and data limits (Table 2), examining their downstream results. For limited resources, a well-tuned small model may be more effective, while larger models perform better only after a certain data threshold is met.

Takeaway 3. Under limited pre-training budgets, smaller post-trained models can even outperform larger counterparts. Conversely, once pre-training tokens reach the saturation regime, increasing model size enables clear improvements in both in-domain performance and OOD generalization.

Mid-training ▾

Catastrophic Forgetting in Continued Pre-training

How do models adapt to new domains without forgetting old knowledge? We investigate continued pre-training (CPT) with and without "replay" of general-domain data, tracking upstream performance in Figure 3 and downstream performance in Table 1. A small percentage of general replay (just 5%) proved critical for balancing new skills with retention of broad knowledge—an easy but powerful trick for practical domain adaptation.

Takeaway 4. Continued pre-training on domain-specific data induces catastrophic forgetting of pre-trained knowledge which could harm both upstream and downstream performance, while incorporating a small replay budget (e.g. 5%) could effectively mitigate this degradation.

Importance of Continued Pre-training for Post-training

We test the effect of varying CPT data volume on downstream results, showing results in Figure 4.

Inadequate domain data risks leaving the model poorly adapted, even after SFT or RL. Investing in rich domain datasets, on the other hand, is essential for strong post-training results.

Takeaway 5. Domain-specific post-training should be supported by adequate domain-specific CPT data: without it, SFT performance remains suboptimal and RL can even degrade such performance.

Our study reveals the upward trend in in-domain accuracy as more CPT tokens are used. This sustained improvement justifies larger domain-specific datasets, especially when downstream reasoning or RL enhancement is desired.

Takeaway 6. As domain-specific CPT data increase, in-domain downstream performance steadily improves and the SFT models could benefit more from RL finetuning.

We analyze the effect of high-volume CPT on both in-domain and OOD tasks. Well-designed domain adaptation can create more flexible models—not just specialists—by strengthening transferable reasoning abilities to OOD tasks.

Takeaway 7. With sufficient domain-specific CPT data, post-training on in-domain tasks not only improves in-domain performance but also generalizes effectively to OOD tasks.

Post-training ▾

SFT: Diminishing Returns and Overfitting Risks

We vary both SFT epochs and dataset size, charting downstream metrics in Figures 5 and 6, and show that more SFT is not always better. Overfitting is real, and can hurt generalization—fine-tuning should be done with care and strong validation.

Takeaway 8. Excessive SFT improves ID performance with diminishing returns but does not necessarily improve and can even degrade OOD performance.

By systematically increasing SFT before RL, we measure the headroom left for further RL gains. When a model is already over-specialized from SFT, RL has little left to improve. Keeping SFT at a balanced level leaves more opportunity for RL to make a difference.

Takeaway 9. Excessive SFT, especially overly large epochs, could limit further RL improvements.

RL: Diminishing Returns and Practical Solutions

We scale RL epochs and data size (see Figure 7), documenting how performance changes across different regimes. For both ID and OOD tasks, most of RL's benefit comes early. Targeting 4–8 epochs or ~100K examples gives a practical balance between results and cost in our study on 1B models.

Takeaway 10. RL with excessive epochs or examples improves downstream performance on both ID and OOD tasks but with diminishing returns (saturation happens at 4-8 epochs or 50-100K examples in our study).

Does RL make the model reason better, or just sample more confidently? We delve into RL's effect on solution diversity and quality. We show that after saturation, RL mainly sharpens the output distribution—helping you sample correct answers more often, but does not fundamentally improve reasoning skill—solving problems that cannot be solved.

Takeaway 11. Beyond saturation regime, RL primarily increases the probability of sampling high-quality rollouts but does not necessarily improve models' fundamental reasoning capabilities.

SFT/RL Allocation Under Data Constraints

Given a limited downstream budget, should you spend it on SFT or RL? We experiment with different SFT/RL splits (see Figure 8) on 1B and 4B models to quantify the trade-off between in-domain and OOD performance. Choose the allocation based on goals: favor SFT for specialists, or RL for generalists. This helps tailor the LM for the tasks that matter most.

Takeaway 12. Under a constrained downstream data budget, allocating more examples to SFT maximizes in-domain gains at the expense of weaker OOD generalization, while allocating more to RL improves OOD performance.

Validation ▾

ORM Score as Validation Metric

We assess ORM (Outcome Reward Model) scores on their ability to predict success on downstream tasks. ORM scoring gives a true picture of reasoning quality post-training. This can help researchers and engineers more reliably monitor and optimize their models during post-training, especially when a validation set (with ground truth labels) is not available or too expensive to collect.

Takeaway 13. ORM score could be a more reliable unsupervised validation metric that helps predict downstream task performance during post-training, compared to validation loss. Notably, ORM scores from an 8B reward model correlate well with problem-solving accuracies of 1B models on many downstream reasoning tasks.

@article{qi2025evolm, title={EvoLM: In Search of Lost Language Model Training Dynamics}, author={Qi, Zhenting and Nie, Fan and Alahi, Alexandre and Zou, James and Lakkaraju, Himabindu and Du, Yilun and Xing, Eric and Kakade, Sham and Zhang, Hanlin}, journal={arXiv preprint arXiv:2506.16029}, year={2025} }

EvoLM: In Search of Lost Language Model Training Dynamics

Abstract

Model Suite

Evaluation Protocol

Upstream Cloze Tasks

Downstream Generative Tasks

In-Domain Tasks (Mathematical Reasoning)

Out-of-Domain (OOD) Tasks

Metrics and Decoding Schemes

Key Findings

Pre-training ▾

Where Do Returns Diminish?

When More Isn't Better: Downstream Surprises

Small vs. Large Models: The Budget-Compute Tradeoff

Mid-training ▾

Catastrophic Forgetting in Continued Pre-training

Importance of Continued Pre-training for Post-training

Post-training ▾

SFT: Diminishing Returns and Overfitting Risks

RL: Diminishing Returns and Practical Solutions

SFT/RL Allocation Under Data Constraints

Validation ▾

ORM Score as Validation Metric

BibTeX