OMEGA: A Structured Math Benchmark to Probe the Reasoning Limits of LLMs

Introduction to Generalization in Mathematical Reasoning

Large-scale language models with long CoT reasoning, such as DeepSeek-R1, have shown good results on Olympiad-level mathematics. However, models trained through Supervised Fine-Tuning or Reinforcement Learning depend on limited techniques, such as repeating known algebra rules or defaulting to coordinate geometry in diagram problems. Since these models follow learned reasoning patterns rather than showing true mathematical creativity, they face challenges with complex tasks that demand original insights. Current math datasets are poorly suited for analyzing math skills that RL models can learn. Large-scale corpora integrate a range of math questions varying in topic and difficulty, making it challenging to isolate specific reasoning skills.

Limitations of Current Mathematical Benchmarks

Current methods, such as out-of-distribution generalization, focus on handling test distributions that differ from training data, which is crucial for mathematical reasoning, physical modeling, and financial forecasting. Compositional generalization techniques aim to help models systematically combine learned skills. Researchers have created datasets through various methods to benchmark mathematical abilities, which include hiring humans to write problems like GSM8K and MinervaMath, collecting exam questions such as AIME and OlympiadBench, and scraping and filtering exam corpora like NuminaMath and BigMath. However, these approaches either lack sufficient challenge for modern LLMs or fail to provide analysis granularity.

Introducing OMEGA: A Controlled Benchmark for Reasoning Skills

Researchers from the University of California, Ai2, the University of Washington, and dmodel.ai have proposed OMEGA, a benchmark designed to evaluate three dimensions of Out-of-Distribution generalization, inspired by Boden’s typology of creativity. It creates matched training and test pairs designed to isolate specific reasoning skills across three dimensions: Exploratory, Compositional, and Transformative. OMEGA’s test and train problems are constructed using carefully engineered templates, allowing precise control over diversity, complexity, and the specific reasoning strategies required for solutions. Moreover, it employs 40 templated problem generators across six mathematical domains: arithmetic, algebra, combinatorics, number theory, geometry, and logic & puzzles.

Evaluation on Frontier LLMs and Reinforcement Learning Setup

Researchers evaluate four frontier models, including DeepSeek-R1, Claude-3.7-Sonnet, OpenAI-o3-mini, and OpenAI-o4-mini, across different complexity levels. For RL generalization experiments, the framework applies the GRPO algorithm on 1,000 training problems using Qwen2.5-7B-Instruct and Qwen2.5-Math-7B models. Exploratory generalization trains on restricted complexity levels and evaluates on higher complexity problems. Compositional generalization involves training models on individual skills in isolation and testing their ability to combine and apply those skills effectively. Transformational generalization trains on conventional solution approaches and evaluates performance on problems that need unconventional strategies.

Performance Observations and Model Behavior Patterns

Reasoning LLMs tend to perform worse as problem complexity increases, often finding correct solutions early but spending too many tokens on unnecessary verification. RL applied only on low-complexity problems enhances generalization to medium-complexity problems, with larger gains on in-domain examples than out-of-distribution ones, indicating RL’s effectiveness at reinforcing familiar patterns. For instance, in the Zebra Logic domain, the base model achieves only 30% accuracy. However, RL training increased performance by 61 points on in-domain examples and 53 points on out-of-distribution examples without SFT.

Conclusion: Toward Advancing Transformational Reasoning

In conclusion, researchers introduced OMEGA, a benchmark that isolates and evaluates three axes of out-of-distribution generalization in mathematical reasoning: explorative, compositional, and transformative. The empirical study reveals three insights: (a) RL fine-tuning significantly improves performance on in-distribution and exploratory generalization tasks, (b) RL’s benefits for compositional tasks are limited, and (c) RL fails to induce genuinely new reasoning patterns. These findings highlight a fundamental limitation: RL can amplify problem-solving breadth and depth, but it falls short in enabling the creative leaps essential for transformational reasoning. Future work should explore curriculum scaffolding and meta-reasoning controllers.

Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

#OMEGA #Structured #Math #Benchmark #Probe #Reasoning #Limits #LLMs

utech506@gmail.com 14 seconds ago

0 0 3 minutes read