NOVER: NO-VERifier Reinforcement Learning

TL;DR

NOVER (NO-Verifier Reinforcement Learning) enables incentive training on any text-to-text task without external verifiers. It utilizes policy model's reasoning perplexity to estimate the reward.

• Your LLM is secretly a verifier.
• Your LLM only reason on ~~Easy-to-Verify~~ tasks.
• Your LLM can ~~reason~~ on ANY tasks.
• Your LLM can be incentivized to do more than reasoning.

NOVER extends RLVR on any text-to-text task

beyond easy-to-verify math/coding problems.

Incentivize Reasoning on Any Task

NOVER enables training large reasoning models on any text data and any task.
NO verifiers/models/rules needed, just ground truth answer, and policy model itself.
General Reasoning: ⚛️ physics • ⚖️ law • 🏥 medical • 💰 finance
Creative Tasks: 🎨 creative writing
Social Intelligence: 🧠 theory of mind • 😊 emotion detection • 🤝 social reasoning
Nautral Language Generation: 🌍 translation • 📚 summarization

NOVER Methodology

SFT

Memorize Input-Output Patterns

RLHF

Train Reward Model
Give Preference Feedback

RLVR

Rule-based Reward
End2End Outcome RL

NOVER

Reasoning Perplexity as Reward
Reason on Any Task

Reasoning Perplexity

$P_r(p, t, g) = \exp\left(-\frac{\sum_{i=1}^{|g|} \log \pi_{p}(g_i \mid p, t, g_{<i})}{|g| \cdot N(|t|)}\right)$

Use perplexity of policy model on ground truth conditioned on reasoning trajectory as reward proxy

Rewards

$$R_{\mathrm{total}} = w_{\mathrm{f}} R_{\mathrm{f}} + \mathbb{I}(R_{\mathrm{f}} = 1) \cdot (w_{\mathrm{r}} R_{\mathrm{r}} + w_{\mathrm{e}} R_{\mathrm{e}})$$

Combined reward function incorporating reasoning, efficiency, and format components

Policy-Proxy Synchronization

$$\pi_{\mathrm{p}} \leftarrow \alpha \cdot \pi_{\mathrm{p}} + (1-\alpha) \cdot \pi_{\theta}$$

Smooth synchronization between policy and proxy ensures stable training with limited resource

Experimental Results

Overall on NOVEReason Dataset

Method	NR	GT	WI	SGN	EB	TB	OPUS
Qwen2.5-3B
Base	21.80%	43.10%	18.40%	18.70%	32.03%	46.79%	16.70%
+ CoT	24.40%	48.90%	24.20%	14.76%	28.12%	51.23%	1.40%
+ SFT	27.00%	36.20%	27.30%	20.08%	36.72%	48.66%	17.30%
+ NOVER	28.60%	60.30%	28.10%	41.64%	38.28%	57.88%	20.70%
Qwen2.5-7B
Base	31.80%	48.50%	20.70%	24.21%	28.91%	44.22%	19.30%
+ CoT	31.20%	57.60%	29.20%	33.46%	38.28%	50.99%	1.60%
+ SFT	27.50%	45.20%	33.50%	37.85%	47.66%	57.06%	23.30%
+ NOVER	38.20%	61.80%	36.60%	50.79%	49.22%	67.79%	26.80%
Other Baselines
Qwen2.5-3B-Instruct	27.10%	50.00%	31.50%	21.25%	40.62%	58.69%	19.90%
Qwen2.5-7B-Instruct	29.90%	56.20%	35.60%	67.72%	46.88%	65.23%	23.50%
R1-Distill-Qwen-7B	41.00%	60.20%	38.00%	40.16%	35.16%	54.61%	8.20%

NR: Natural Reasoning, GT: General Thoughts-430k, WI: WebInstruct, SGN: SS-GEN, EB: EmoBench, TB: TomBench, OPUS: OPUS-BOOK-TRANSLATION.

General Reasoning with Different Backends

Model Type	Model	Method	NR	GT	WI
Base	Qwen2.5 3B	Base	21.80%	43.10%	18.40%
		+ CoT	24.40%	48.90%	24.20%
		+ SFT	27.00%	36.20%	27.30%
		+ NOVER	28.60%	60.30%	28.10%
	Qwen 2.5 7B	Base	31.80%	48.50%	20.70%
		+ CoT	31.20%	57.60%	29.20%
		+ SFT	27.50%	45.20%	33.50%
		+ NOVER	38.20%	61.80%	36.60%
Instruct	Llama-3.1-8B	Base	34.20%	36.70%	29.90%
		+ CoT	28.10%	35.10%	30.00%
		+ SFT	23.60%	23.40%	34.50%
		+ NOVER	40.70%	41.50%	34.00%
	Mistral-7B	Base	33.00%	17.80%	27.00%
		+ CoT	29.20%	18.60%	27.10%
		+ SFT	22.50%	20.70%	27.80%
		+ NOVER	32.20%	21.90%	29.30%

NR: Natural Reasoning, GT: General Thoughts-430k, WI: WebInstruct.

Key Takeaways

• NOVER trains successfully on both pretrained and instruct models, with larger gains on stronger base models
• Despite the free-form nature of answers, NOVER still prefer objective solutions instead of subjective ones
• On general reasoning, NOVER inherits base model boundaries, which have been observed in math reasoning. It struggles on false-premise tasks like FANToM
• NOVER's design prevent reward hacking, avoiding issues such as reasoning explosion and collapse
• Unlike closed-source or verifier-based rewards that suffer from cold start and hacking risks, NOVER remains stable
• Its dense reward signals allow greater error tolerance and encourage diverse reasoning patterns

Inverse Incentive Training

Reward the Outcome, Incentivize Process

→

Write Rubrics in the Outcome, Process as Result

Teaching Models "How to Fish" Rather Than Giving Them Fish

Citation

@article{liu2025nover,
  title={NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning},
  author={Liu, Wei and Qi, Siya and Wang, Xinyu and Qian, Chen and Du, Yali and He, Yulan},
  journal={arXiv preprint arXiv:2505.16022},
  year={2025}
}