Reinforcement Learning

We trained 14B and 7B reasoning model surpassing DeepSeek R1 models twice their size in math olympiads

Improving DeepSeek R1 in Math

I joined a team and we trained 7B and 14B math reasoning models based on DeepSeek-R1-Distill using SFT and GRPO. Our 14B model achieved 75.8% Maj@32 on AIME’25 (+8.7% improvement), and our 7B model reached 65.8% Maj@32 (+7.5%). Here is what I’ve learned.

Changes introduced by RL process is low-rank

Are Reasoning Abilities Low-Rank?

Turns out, the RL training process of DeepScaleR-1.5B introduced only low-rank changes to its base model, DeepSeek-R1-Distill-1.5B.