New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning

Optimizing LLMs for Human Alignment Using Reinforcement Learning

Large language models often require a further alignment phase to optimize them for human use. In this phase, reinforcement learning plays a central role by enabling models to make decisions based on human feedback or task-based correctness. This fine-tuning allows for the models to align more closely with user expectations, making them more suitable for instruction-based applications or precise mathematical tasks.

Challenges in Choosing Offline vs. Online Reinforcement Learning Strategies

A major difficulty arises when choosing the most effective way to conduct this fine-tuning. Training methods fall into two extremes—offline approaches that depend on static, pre-generated data and fully online approaches that continuously update with each new interaction. Each method has distinct challenges. Offline models can’t adapt during training, which limits performance, while online models often demand more computational resources. Moreover, ensuring that models perform well across both mathematical (verifiable) and open-ended (non-verifiable) tasks adds further complexity to this choice.

Overview of Alignment Algorithms: DPO and GRPO

Historically, tools like Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have been employed for model alignment. DPO operates offline and is designed to work with preference-based data pairs. It is valued for its simplicity and data efficiency but lacks the adaptability of online methods. GRPO is based on the PPO algorithm and handles online fine-tuning by comparing groups of outputs to compute relative advantages. While GRPO adapts in real-time and suits dynamic reward systems, its on-policy nature increases computational load and makes experimentation more demanding.

A Balanced Alternative for LLM Alignment

Research introduced by Meta and NYU explored a method to overcome these limitations through a semi-online training setup. This technique modulates how frequently the model’s generation and training components are synchronized, rather than updating at every training step, as in fully online methods, or not at all, as in offline setups. The semi-online method strikes a middle ground by adjusting the synchronization rate. Researchers designed this approach to reduce training time and maintain high model adaptability. The modular setup also allowed them to apply either DPO or GRPO with task-specific reward models in a flexible manner.

Instruction Following and Mathematical Reasoning

The methodology involved fine-tuning the Llama-3.1-8B-Instruct model using two types of tasks: open-ended instruction following and math problem-solving. For non-verifiable tasks, user prompts were sampled from the WildChat-1M dataset and evaluated using the Athene-RM-8B reward model, which assigns scalar scores to each prompt. For verifiable tasks, the team utilized the NuminaMath dataset in conjunction with the Math-Verify toolkit, which verifies whether generated answers align with expected outputs. Training experiments were conducted on 32 NVIDIA H200 GPUs for training and 8 GPUs for inference, with different setups comparing offline, semi-online, and online synchronization intervals.

Performance Gains Across Both Verifiable and Non-Verifiable Tasks

The performance differences were observed. On Math500, the offline DPO reached 53.7% accuracy, whereas the semi-online DPO with a synchronization interval of s = 100 achieved 58.9%. Online DPO and GRPO showed similar results at 58.7% and 58.1%, respectively. Similar trends were observed on the NuminaMath benchmark, where the offline DPO achieved 36.4%, and semi-online variants increased this to 39.4% (s = 10). The performance gains were not limited to math tasks. When non-verifiable tasks were evaluated with AlpacaEval 2.0 and Arena-Hard benchmarks, models trained with mixed reward types performed consistently better. Combining verifiable and non-verifiable rewards in a single training setup resulted in stronger average scores, indicating that the method generalized effectively.

A Flexible, Scalable Approach for Reinforcement Learning in LLMs

This study demonstrates that fine-tuning large language models does not require strict adherence to either offline or online setups. By introducing a flexible synchronization scheme, the research team from Meta and NYU effectively increased training efficiency while maintaining or improving performance. The results show that carefully balancing reward types and training synchronization frequency leads to models that perform well across task types without incurring high computational costs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

#Method #Meta #NYU #Boosts #LLM #Alignment #SemiOnline #Reinforcement #Learning

utech506@gmail.com 11 seconds ago

0 0 3 minutes read