CMU researchers are presenting 127 papers at the Forty-Second International Conference on Machine Learning (ICML 2025), held from July 13th-19th at the Vancouver Convention Center. Here is a quick overview of the areas our researchers are working on:

Here are our most frequent collaborator institutions:

Table of Contents
- Oral Papers
- Spotlight Papers
- Poster Papers
- Accountability, Transparency, And Interpretability
- Active Learning And Interactive Learning
- Applications
- Causality
- Chemistry, Physics, And Earth Sciences
- Computer Vision
- Deep Learning
- Discrete And Combinatorial Optimization
- Domain Adaptation And Transfer Learning
- Evaluation
- Everything Else
- Fairness
- Foundation Models
- Game Theory
- General Machine Learning
- Graph Neural Networks
- Graphical Models
- Health / Medicine
- Language, Speech And Dialog
- Large Language Models
- Learning Theory
- Multi-agent
- Online Learning And Bandits
- Online Learning, Active Learning And Bandits
- Optimization
- Privacy
- Probabilistic Methods
- Reinforcement Learning And Planning
- Representation Learning
- Research Priorities, Methodology, And Evaluation
- Robotics
- Safety
- Security
- Sequential Models, Time Series
- Social Aspects
- Structure Learning
- Supervised Learning
- Theory
- Time Series
Oral Papers
Expected Variational Inequalities
This paper introduces expected variational inequalities (EVIs), a relaxed version of variational inequalities (VIs) where the goal is to find a distribution that satisfies the VI condition in expectation. While VIs are generally hard to solve, the authors show that EVIs can be solved efficiently, even under challenging, non-monotone conditions, by leveraging ideas from game theory. EVIs generalize the concept of correlated equilibria and unify various results across smooth games, constrained games, and settings with non-concave utilities, making them broadly applicable beyond traditional game-theoretic contexts.
Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards
This paper shows that voting-based benchmarks for evaluating LLMs (such as Chatbot Arena) can be vulnerable to adversarial manipulation if proper defenses aren’t in place. The authors show that an attacker can identify which model generated a response and then strategically vote to boost or demote specific models, altering the leaderboard with only around a thousand votes in a simulated environment. They collaborate with Chatbot Arena’s developers to propose and implement security measures such as reCAPTCHA and login requirements that significantly raise the cost of such attacks and enhance the platform’s robustness.
High-Dimensional Prediction for Sequential Decision Making
This paper presents a new algorithmic framework for making reliable, multi-dimensional forecasts in adversarial, nonstationary environments. Unlike existing online learning methods, this approach offers simultaneous performance guarantees for many agents, even when they face different objectives, act over large action spaces, or care about specific conditions (e.g. weather or route choice). The algorithm ensures low bias across many conditional events and enables each agent to achieve strong guarantees like diminishing regret. Applications include efficient solutions for online combinatorial optimization and multicalibration.
LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
This paper introduces LLM-SRBench, a new benchmark designed to rigorously evaluate the ability of LLMs to discover scientific equations (rather than merely recall them from training data). Existing tests often rely on well-known equations, making it hard to tell whether models are truly reasoning or just memorizing. LLM-SRBench addresses this by including 239 challenging problems across four scientific domains, split into two categories: one that disguises familiar physics equations (LSR-Transform) and another that features fully synthetic, reasoning-driven tasks (LSR-Synth). Evaluations show that even the best current models only achieve 31.5% accuracy, highlighting the difficulty of the task and establishing LLM-SRBench as a valuable tool for driving progress in LLM-based scientific discovery.
On Differential Privacy for Adaptively Solving Search Problems via Sketching
This paper explores how to use differential privacy to protect against information leakage in adaptive search queries, a harder problem than traditional private estimation tasks. Unlike prior work that only returns numerical summaries (e.g., cost), the authors design algorithms that return actual solutions, like nearest neighbors or regression vectors, even when the inputs or queries change over time. They show how key problem parameters (like the number of approximate near neighbors or condition number of the data matrix) affect the performance of these private algorithms. This work has practical implications for AI systems that rely on private database searches or real-time regression, enabling them to provide useful results while safeguarding sensitive information from attackers.
Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
This paper proposes a set of simple, abstract tasks designed to probe the creative limits of today’s language models in a controlled and measurable way. These tasks mimic real-world open-ended challenges like generating analogies or designing puzzles, where success requires discovering new connections or constructing novel patterns. The authors show that standard next-token prediction tends to be short-sighted and overly reliant on memorization, while alternative approaches like teacherless training and diffusion models produce more diverse, original outputs. They also introduce a technique called seed-conditioning, which adds randomness at the input rather than the output and can improve coherence without sacrificing creativity.
Training a Generally Curious Agent
This paper introduces Paprika, a fine-tuning method that equips language models with general decision-making and exploration strategies, enabling them to adapt to new tasks through interaction alone (i.e. without further training). Paprika trains models on synthetic environments requiring different exploration behaviors, encouraging them to learn flexible strategies rather than memorizing solutions. To improve efficiency, it uses a curriculum learning-based approach that prioritizes tasks with high learning value, making the most of limited interaction data. Models trained with Paprika show strong transfer to completely new tasks, suggesting a promising direction for building AI agents that can learn to solve unfamiliar, sequential problems with minimal supervision.
Spotlight Papers
GMAIL: Generative Modality Alignment for generated Image Learning
Generative models can create realistic images that could help train machine learning models, but using them as if they were real images can lead to problems because of differences between the two. This paper introduces a method called GMAIL that treats real and generated images as separate types (or modalities) and aligns them in a shared latent space during training, rather than just mixing them at the pixel level. The approach fine-tunes models on generated data using a special loss to bridge the gap, then uses these aligned models to improve training on tasks like image captioning and retrieval. The results show that GMAIL improves performance on several vision-language tasks and scales well as more generated data is added.
LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D
LOCATE 3D is a model that can find specific objects in 3D scenes based on natural language descriptions (like “the small coffee table between the sofa and the lamp”). It achieves state-of-the-art performance on standard benchmarks and works well in real-world settings, like on robots or AR devices, by using RGB-D sensor data. A key component is 3D-JEPA, a new self-supervised learning method that uses features from 2D vision models (like CLIP or DINO) to understand 3D point clouds through masked prediction tasks. The model is trained on a newly introduced large dataset (130K+ examples), helping it generalize better across different environments.
Masked Autoencoders Are Effective Tokenizers for Diffusion Models
This paper introduces MAETok, a masked autoencoder designed to create a high-quality, semantically meaningful latent space for diffusion models. The authors show that having a well-structured latent space, meaning fewer Gaussian modes and more discriminative features, leads to better image generation without needing complex variational autoencoders. MAETok outperforms existing methods on ImageNet using just 128 tokens, and it’s also much faster: 76× quicker to train and 31× faster during inference. The key takeaway is that the structure of the latent space, not variational constraints, is what truly matters for high-quality diffusion-based generation.
This paper highlights the lack of robust systems for identifying and reporting flaws in general-purpose AI (GPAI), especially compared to mature fields like software security. The authors propose three key solutions: (1) standardized reporting formats and engagement rules to streamline flaw reporting and triaging, (2) formal disclosure programs with legal protections for researchers (similar to bug bounties), and (3) better infrastructure for distributing flaw reports to relevant stakeholders. These steps aim to address growing risks like jailbreaks and cross-system vulnerabilities, ultimately improving the safety and accountability of GPAI systems.
Scaling Test-Time Compute Without Verification or RL is Suboptimal
This paper explores how to best scale test-time compute for large language models (LLMs), comparing two strategies: (1) distilling search traces (verifier-free, or VF) and (2) using verifiers or rewards to guide learning (verifier-based, or VB). The authors show—both theoretically and through experiments—that VB methods significantly outperform VF ones when working with limited compute or data. They explain that this performance gap grows as models and tasks get more complex, especially when solution paths vary in style or quality. Ultimately, the paper argues that verification is essential for effectively scaling LLM performance, especially for reasoning tasks.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
As long-context LLMs become more common, their growing memory demands during inference slow down performance, especially due to the expanding key-value (KV) cache. This paper introduces ShadowKV, a system that significantly improves throughput by compressing the key cache using low-rank representations and offloading the value cache without major latency costs. It reconstructs only the necessary KV pairs during decoding to maintain speed and accuracy. Experiments show ShadowKV supports much larger batch sizes (up to 6×) and improves throughput by over 3× on standard hardware, all while preserving model quality across several LLMs and benchmarks.
Poster Papers
Accountability, Transparency, And Interpretability
Active Learning And Interactive Learning
Applications
Causality
Chemistry, Physics, And Earth Sciences
Computer Vision
Deep Learning
Discrete And Combinatorial Optimization
Domain Adaptation And Transfer Learning
Evaluation
Everything Else
Fairness
Foundation Models
Game Theory
General Machine Learning
Graph Neural Networks
Graphical Models
Health / Medicine
Language, Speech And Dialog
Large Language Models
Learning Theory
Multi-agent
Online Learning And Bandits
Online Learning, Active Learning And Bandits
Optimization
Privacy
Probabilistic Methods
Reinforcement Learning And Planning
Representation Learning
Research Priorities, Methodology, And Evaluation
Robotics
Safety
Security
Sequential Models, Time Series
Social Aspects
Structure Learning
Supervised Learning
Theory
Time Series
Source: Read MoreÂ