NOVER

Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Wei Liu¹ • Siya Qi¹ • Xinyu Wang¹ • Chen Qian² • Yali Du¹·³ • Yulan He¹·³
¹King's College London • ²Shanghai Jiao Tong University • ³The Alan Turing Institute

TL;DR

NOVER (NO-Verifier Reinforcement Learning) enables incentive training on any text-to-text task without external verifiers. It utilizes policy model's reasoning perplexity to estimate the reward.

• Your LLM is secretly a verifier.
• Your LLM only reason on Easy-to-Verify tasks.
• Your LLM can reason on ANY tasks.
• Your LLM can be incentivized to do more than reasoning.

NOVER Framework Overview
NOVER extends RLVR on any text-to-text task
beyond easy-to-verify math/coding problems.

Incentivize Reasoning on Any Task

NOVER enables training large reasoning models on any text data and any task.
NO verifiers/models/rules needed, just ground truth answer, and policy model itself.
General Reasoning: ⚛️ physics • ⚖️ law • 🏥 medical • 💰 finance
Creative Tasks: 🎨 creative writing
Social Intelligence: 🧠 theory of mind • 😊 emotion detection • 🤝 social reasoning
Nautral Language Generation: 🌍 translation • 📚 summarization

NOVER Framework Overview

NOVER Methodology

paradigm
overall

SFT

Memorize Input-Output Patterns

RLHF

Train Reward Model
Give Preference Feedback

RLVR

Rule-based Reward
End2End Outcome RL

NOVER

Reasoning Perplexity as Reward
Reason on Any Task

Reasoning Perplexity
$P_r(p, t, g) = \exp\left(-\frac{\sum_{i=1}^{|g|} \log \pi_{p}(g_i \mid p, t, g_{<i})}{|g| \cdot N(|t|)}\right)$
Use perplexity of policy model on ground truth conditioned on reasoning trajectory as reward proxy
Rewards
$$R_{\mathrm{total}} = w_{\mathrm{f}} R_{\mathrm{f}} + \mathbb{I}(R_{\mathrm{f}} = 1) \cdot (w_{\mathrm{r}} R_{\mathrm{r}} + w_{\mathrm{e}} R_{\mathrm{e}})$$
Combined reward function incorporating reasoning, efficiency, and format components
Policy-Proxy Synchronization
$$\pi_{\mathrm{p}} \leftarrow \alpha \cdot \pi_{\mathrm{p}} + (1-\alpha) \cdot \pi_{\theta}$$
Smooth synchronization between policy and proxy ensures stable training with limited resource

Experimental Results

Overall on NOVEReason Dataset

Method NR GT WI SGN EB TB OPUS
Qwen2.5-3B
Base 21.80% 43.10% 18.40% 18.70% 32.03% 46.79% 16.70%
+ CoT 24.40% 48.90% 24.20% 14.76% 28.12% 51.23% 1.40%
+ SFT 27.00% 36.20% 27.30% 20.08% 36.72% 48.66% 17.30%
+ NOVER 28.60% 60.30% 28.10% 41.64% 38.28% 57.88% 20.70%
Qwen2.5-7B
Base 31.80% 48.50% 20.70% 24.21% 28.91% 44.22% 19.30%
+ CoT 31.20% 57.60% 29.20% 33.46% 38.28% 50.99% 1.60%
+ SFT 27.50% 45.20% 33.50% 37.85% 47.66% 57.06% 23.30%
+ NOVER 38.20% 61.80% 36.60% 50.79% 49.22% 67.79% 26.80%
Other Baselines
Qwen2.5-3B-Instruct 27.10% 50.00% 31.50% 21.25% 40.62% 58.69% 19.90%
Qwen2.5-7B-Instruct 29.90% 56.20% 35.60% 67.72% 46.88% 65.23% 23.50%
R1-Distill-Qwen-7B 41.00% 60.20% 38.00% 40.16% 35.16% 54.61% 8.20%
NR: Natural Reasoning, GT: General Thoughts-430k, WI: WebInstruct, SGN: SS-GEN, EB: EmoBench, TB: TomBench, OPUS: OPUS-BOOK-TRANSLATION.

General Reasoning with Different Backends

Model Type Model Method NR GT WI
Base Qwen2.5 3B Base 21.80% 43.10% 18.40%
+ CoT 24.40% 48.90% 24.20%
+ SFT 27.00% 36.20% 27.30%
+ NOVER 28.60% 60.30% 28.10%
Qwen 2.5 7B Base 31.80% 48.50% 20.70%
+ CoT 31.20% 57.60% 29.20%
+ SFT 27.50% 45.20% 33.50%
+ NOVER 38.20% 61.80% 36.60%
Instruct Llama-3.1-8B Base 34.20% 36.70% 29.90%
+ CoT 28.10% 35.10% 30.00%
+ SFT 23.60% 23.40% 34.50%
+ NOVER 40.70% 41.50% 34.00%
Mistral-7B Base 33.00% 17.80% 27.00%
+ CoT 29.20% 18.60% 27.10%
+ SFT 22.50% 20.70% 27.80%
+ NOVER 32.20% 21.90% 29.30%
NR: Natural Reasoning, GT: General Thoughts-430k, WI: WebInstruct.

Key Takeaways

  • • NOVER trains successfully on both pretrained and instruct models, with larger gains on stronger base models
  • • Despite the free-form nature of answers, NOVER still prefer objective solutions instead of subjective ones
  • • On general reasoning, NOVER inherits base model boundaries, which have been observed in math reasoning. It struggles on false-premise tasks like FANToM
  • • NOVER's design prevent reward hacking, avoiding issues such as reasoning explosion and collapse
  • • Unlike closed-source or verifier-based rewards that suffer from cold start and hacking risks, NOVER remains stable
  • • Its dense reward signals allow greater error tolerance and encourage diverse reasoning patterns

Inverse Incentive Training

iit
iit_result
Reward the Outcome, Incentivize Process
Write Rubrics in the Outcome, Process as Result
Teaching Models "How to Fish" Rather Than Giving Them Fish

Citation

@article{liu2025nover,
  title={NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning},
  author={Liu, Wei and Qi, Siya and Wang, Xinyu and Qian, Chen and Du, Yali and He, Yulan},
  journal={arXiv preprint arXiv:2505.16022},
  year={2025}
}