In the past, I have interned at FAIR Paris (with Remi Munos),
Amazon NYC (with Udaya Ghai
and Dean Foster),
and Microsoft Research NYC (with Akshay Krishnamurthy
and Dylan Foster).
I finished my master degree in MLD,
advised by Kris Kitani.
I completed my undergraduate at UC San Diego with CS and Math majors and
I was advised by Sicun Gao.
Research
I am interested in the theory, science and application of interactive decision-making.
My current study focuses on when and how we can achieve efficient and robust learning, from thinking about the two foundations of interactive decision-making: the environment (structure, data) and the policy (nowadays with foundation models). I am also interested in the application of principled decision-making algorithms in large-scale real-world applications, such as generative models and robotics.
Yuda Song, Dhruv Rohatgi, Aarti Singh, J Andrew Bagnell
NeurIPS, 2025
We study the algorithmic trade-off between expert distillation and end-to-end RL in POMDPs, where the empirical results on visual based robotics locomotion tasks corrobate our theoretical findings.
Yuda Song, Hanlin Zhang, Udaya Ghai, Carson Eisenach, Sham M. Kakade, Dean Foster
ICLR, 2025
Through a large scale scientific study, we find 1) a scaling law of LLM self-improvement, 2) a loss of coverage through LLM post-training, 3) test-time scaling through self-improvement and many other interesting phenomena.
Yuda Song, Gokul Swamy, Aarti Singh, J. Andrew Bagnell, Wen Sun
NeurIPS, 2024
We prove that offline contrastive-based method (e.g., DPO)
requires a stronger coverage property than online RL-based method (e.g., RLHF). We propose
Hybrid Preference Optimization to combine the benefits of both offline and online methods.
We consider a practical setting of hybrid RL where the agent only has access to offline observation data without action labels (e.g., videos of human demonstrations), and we show that it is possible to achieve efficient learning in this setting with a practical algorithm.
Yuda Song, Lili Wu, Dylan J. Foster, Akshay Krishnamurthy
ICML, 2024
We introduce a new theoretical framework, RichCLD, in which the agent performs control based on high-dimensional observations, but the environment is governed by low-dimensional latent states and Lipschitz continuous dynamics.
We prove the benefit of representation learning on diverse source environments which allows efficient learning on the
source environment with the learned representation under the low-rank MDPs setting.
An efficient rich-observation RL algorithm that learns to decode from rich observations to latent states
(via adversarial training), while balancing exploration and exploitation.
A simple provably efficient model-based algorithm that achieves competitive performance in both dense reward
continuous control tasks and sparse reward control tasks that require efficient exploration.
We study Sim-to-Real/policy transfer/policy adaptation under a model-based framework
resulting an algorithm that enjoys strong theoretical guarantees
and excellent empirical performance.
Talks
Rethinking the Foundation of LLM Post-Training: Signal and Objective
Frontiers in Online Reinforcement Learning Workshop, March 2026.
Stanford, March 2026.
Harvard ML Foundations Group, February 2026.
Harnessing Additional Feedback in LLM Post-Training