NOVER (NO-Verifier Reinforcement Learning) enables
incentive training on any text-to-text task without external verifiers. It utilizes policy model's reasoning perplexity to estimate the reward.
• Your LLM is secretly a verifier.
• Your LLM only reason on Easy-to-Verify tasks.
• Your LLM can reason on ANY tasks.
• Your LLM can be incentivized to do more than reasoning.
NOVER enables training large reasoning models on any text data and any task.
NO verifiers/models/rules needed, just ground truth answer, and policy model itself.
General Reasoning: ⚛️ physics • ⚖️ law • 🏥 medical • 💰 finance
Creative Tasks: 🎨 creative writing
Social Intelligence: 🧠 theory of mind • 😊 emotion detection • 🤝 social reasoning
Nautral Language Generation: 🌍 translation • 📚 summarization
Memorize Input-Output Patterns
Train Reward Model
Give Preference Feedback
Rule-based Reward
End2End Outcome RL
Reasoning Perplexity as Reward
Reason on Any Task
Method | NR | GT | WI | SGN | EB | TB | OPUS |
---|---|---|---|---|---|---|---|
Qwen2.5-3B | |||||||
Base | 21.80% | 43.10% | 18.40% | 18.70% | 32.03% | 46.79% | 16.70% |
+ CoT | 24.40% | 48.90% | 24.20% | 14.76% | 28.12% | 51.23% | 1.40% |
+ SFT | 27.00% | 36.20% | 27.30% | 20.08% | 36.72% | 48.66% | 17.30% |
+ NOVER | 28.60% | 60.30% | 28.10% | 41.64% | 38.28% | 57.88% | 20.70% |
Qwen2.5-7B | |||||||
Base | 31.80% | 48.50% | 20.70% | 24.21% | 28.91% | 44.22% | 19.30% |
+ CoT | 31.20% | 57.60% | 29.20% | 33.46% | 38.28% | 50.99% | 1.60% |
+ SFT | 27.50% | 45.20% | 33.50% | 37.85% | 47.66% | 57.06% | 23.30% |
+ NOVER | 38.20% | 61.80% | 36.60% | 50.79% | 49.22% | 67.79% | 26.80% |
Other Baselines | |||||||
Qwen2.5-3B-Instruct | 27.10% | 50.00% | 31.50% | 21.25% | 40.62% | 58.69% | 19.90% |
Qwen2.5-7B-Instruct | 29.90% | 56.20% | 35.60% | 67.72% | 46.88% | 65.23% | 23.50% |
R1-Distill-Qwen-7B | 41.00% | 60.20% | 38.00% | 40.16% | 35.16% | 54.61% | 8.20% |
Model Type | Model | Method | NR | GT | WI |
---|---|---|---|---|---|
Base | Qwen2.5 3B | Base | 21.80% | 43.10% | 18.40% |
+ CoT | 24.40% | 48.90% | 24.20% | ||
+ SFT | 27.00% | 36.20% | 27.30% | ||
+ NOVER | 28.60% | 60.30% | 28.10% | ||
Qwen 2.5 7B | Base | 31.80% | 48.50% | 20.70% | |
+ CoT | 31.20% | 57.60% | 29.20% | ||
+ SFT | 27.50% | 45.20% | 33.50% | ||
+ NOVER | 38.20% | 61.80% | 36.60% | ||
Instruct | Llama-3.1-8B | Base | 34.20% | 36.70% | 29.90% |
+ CoT | 28.10% | 35.10% | 30.00% | ||
+ SFT | 23.60% | 23.40% | 34.50% | ||
+ NOVER | 40.70% | 41.50% | 34.00% | ||
Mistral-7B | Base | 33.00% | 17.80% | 27.00% | |
+ CoT | 29.20% | 18.60% | 27.10% | ||
+ SFT | 22.50% | 20.70% | 27.80% | ||
+ NOVER | 32.20% | 21.90% | 29.30% |
@article{liu2025nover,
title={NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning},
author={Liu, Wei and Qi, Siya and Wang, Xinyu and Qian, Chen and Du, Yali and He, Yulan},
journal={arXiv preprint arXiv:2505.16022},
year={2025}
}