CMU researchers are presenting 143 papers at the Thirteenth International Conference on Learning Representations (ICLR 2025), held from April 24 – 28 at the Singapore EXPO. Here is a quick overview of the areas our researchers are working on:
And here are our most frequent collaborator institutions:
Table of Contents
- Oral Papers
- Spotlight Papers
- Poster Papers
- Alignment, Fairness, Safety, Privacy, And Societal Considerations
- Applications to Computer Vision, Audio, Language, And Other Modalities
- Applications to Neuroscience & Cognitive Science
- Applications to Physical Sciences (Physics, Chemistry, Biology, Etc.)
- Applications to Robotics, Autonomy, Planning
- Causal Reasoning
- Datasets and Benchmarks
- Foundation or Frontier Models, Including LLMs
- Generative Models
- Infrastructure, Software Libraries, Hardware, Systems, etc.
- Interpretability and Explainable AI
- Learning on Graphs and Other Geometries & Topologies
- Learning Theory
- Neurosymbolic & Hybrid AI Systems (Physics-Informed, Logic & Formal Reasoning, etc.)
- Optimization
- Other Topics in Machine Learning (i.e., none of the above)
- Probabilistic Methods (Bayesian Methods, Variational Inference, Sampling, Uncertainty Quantification, etc.)
- Reinforcement Learning
- Transfer Learning, Meta Learning, and Lifelong Learning
- Unsupervised, Self-supervised, Semi-supervised, and Supervised Representation Learning
Oral Papers
Backtracking Improves Generation Safety
This paper introduces backtracking, a new technique that allows language models to recover from unsafe text generation by using a special [RESET] token to “undo” problematic outputs. Unlike traditional safety methods that aim to prevent harmful responses outright, backtracking trains the model to self-correct mid-generation. The authors demonstrate that backtracking significantly improves safety without sacrificing helpfulness, and it also provides robustness against several adversarial attacks.
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Recent advances in LLMs have enabled task automation through Python code, but existing benchmarks mainly focus on simple, self-contained tasks. To assess LLMs’ ability to handle more practical challenges requiring diverse and compositional function use, the authors introduce BigCodeBench—a benchmark covering 1,140 tasks across 139 libraries and 7 domains. Each task includes rigorous testing with high branch coverage, and a variant, BigCodeBench-Instruct, reformulates instructions for natural language evaluation. Results from testing 60 LLMs reveal significant performance gaps, highlighting that current models struggle to follow complex instructions and compose function calls accurately compared to human performance.
Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance
LLMs are expected to follow user-provided context, especially when they contain new or conflicting information. While instruction finetuning should improve this ability, the authors uncover a surprising failure mode called context-parametric inversion: models initially rely more on input context, but this reliance decreases as finetuning continues—even as benchmark performance improves. Through controlled experiments and theoretical analysis, the authors trace the cause to training examples where context aligns with pretraining knowledge, reinforcing parametric reliance. They suggest mitigation strategies and highlight this as a key challenge in instruction tuning.
EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Embodied tasks demand fine-grained 3D perception, which is difficult to achieve due to limited high-quality 3D data. To address this, the authors propose a method that leverages the Segment Anything Model (SAM) for online 3D instance segmentation by transforming 2D masks into 3D-aware queries. Their approach enables real-time object matching across video frames and efficient inference using a similarity matrix. Experiments across multiple datasets show that the method outperforms offline alternatives and generalizes well to new settings with minimal data.
LLM-SR: Scientific Equation Discovery via Programming with Large Language Models
Mathematical equations are remarkably effective at describing natural phenomena, but discovering them from data is challenging due to vast combinatorial search spaces. Existing symbolic regression methods often overlook domain knowledge and rely on limited representations. To address this, the authors propose LLM-SR, a novel approach that uses Large Language Models to generate equation hypotheses informed by scientific priors and refines them through evolutionary search. Evaluated across multiple scientific domains, LLM-SR outperforms existing methods, particularly in generalization, by efficiently exploring the equation space and producing accurate, interpretable models.
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
Self-improvement in Large Language Models involves the model verifying its outputs, filtering data accordingly, and using the refined data for further learning. While effective in practice, there has been little theoretical grounding for this technique. This work presents a comprehensive study of LLM self-improvement, introducing a formal framework centered on the generation-verification gap—a key quantity that governs self-improvement. Experiments reveal that this gap scales consistently with pretraining FLOPs across tasks and model families. The authors also explore when and how iterative self-improvement works and offer insights and strategies to enhance it.
On the Benefits of Memory for Modeling Time-Dependent PDEs
Data-driven methods offer an efficient alternative to traditional numerical solvers for PDEs, but most existing approaches assume Markovian dynamics, limiting their effectiveness when input signals are distorted. Inspired by the Mori-Zwanzig theory, the authors propose MemNO, a Memory Neural Operator that explicitly incorporates past states using structured state-space models and the Fourier Neural Operator. MemNO demonstrates strong performance on various PDE families, especially on low-resolution inputs, achieving over six times lower error than memoryless baselines.
On the Identification of Temporal Causal Representation with Instantaneous Dependence
This work introduces IDOL (Identification framework for Instantaneous Latent dynamics), a method designed to identify latent causal processes in time series data, even when instantaneous relationships are present. Unlike existing methods that require interventions or grouping of observations, IDOL imposes a sparse influence constraint, allowing both time-delayed and instantaneous causal relations to be captured. Through a temporally variational inference architecture and gradient-based sparsity regularization, IDOL effectively estimates latent variables. Experimental results show that IDOL can identify latent causal processes in simulations and real-world human motion forecasting tasks, demonstrating its practical applicability.
Progressive distillation induces an implicit curriculum
This work explores the concept of progressive distillation, where a student model learns from intermediate checkpoints of a teacher model, rather than just the final model. The authors identify an “implicit curriculum” that emerges through these intermediate checkpoints, which accelerates the student’s learning and provides a sample complexity benefit. Using sparse parity as a sandbox, they demonstrate that this curriculum imparts valuable learning steps that are unavailable from the final teacher model. The study extends this idea to Transformers trained on probabilistic context-free grammars (PCFGs) and real-world datasets, showing that the teacher progressively teaches the student to capture longer contexts. Both theoretical and empirical results highlight the effectiveness of progressive distillation across different tasks.
This work introduces precision-aware scaling laws that extend traditional scaling frameworks to account for the effects of low-precision training and inference in language models. The authors show that lower precision effectively reduces a model’s usable parameter count, enabling predictions of performance degradation due to quantization. For inference, they find that post-training quantization causes increasing degradation with more pretraining data, potentially making additional training counterproductive. Their unified framework predicts loss across varying precisions and suggests that training larger models in lower precision may be more compute-efficient. These predictions are validated on over 465 pretraining runs, including models up to 1.7B parameters.
Self-Improvement in Language Models: The Sharpening Mechanism
This paper presents a theoretical framework for understanding how LLMs can self-improve by using themselves as verifiers to refine their own outputs; a process the authors call “sharpening.” The key insight is that LLMs are often better at judging response quality than generating high-quality responses outright, so sharpening helps concentrate probability mass on better sequences. The paper analyzes two families of self-improvement algorithms: one based on supervised fine-tuning (SFT) and one on reinforcement learning (RLHF). They show that while the SFT-based approach is optimal under certain conditions, the RLHF-based approach can outperform it by actively exploring beyond the model’s existing knowledge.
When Selection meets Intervention: Additional Complexities in Causal Discovery
This work tackles the often-overlooked issue of selection bias in interventional studies, where participants are selectively included based on specific criteria. Existing causal discovery methods typically ignore this bias, leading to inaccurate conclusions. To address this, the authors introduce a novel graphical model that distinguishes between the observed world with interventions and the counterfactual world where selection occurs. They develop a sound algorithm that identifies both causal relationships and selection mechanisms, demonstrating its effectiveness through experiments on both synthetic and real-world data.
miniCTX: Neural Theorem Proving with (Long-)Contexts
Real-world formal theorem proving relies heavily on rich contextual information, which is often absent from traditional benchmarks. To address this, the authors introduce miniCTX, a benchmark designed to test models’ ability to prove theorems using previously unseen, extensive context from real Lean projects and textbooks. Unlike prior benchmarks, miniCTX includes large repositories with relevant definitions, lemmas, and structures. Baseline experiments show that models conditioned on this broader context significantly outperform those relying solely on the local state. The authors also provide a toolkit to facilitate the expansion of the benchmark.
Spotlight Papers
ADIFF: Explaining audio difference using natural language
This paper tackles the novel task of explaining differences between audio recordings, which is important for applications like audio forensics, quality assessment, and generative audio systems. The authors introduce two new datasets and propose a three-tiered explanation framework—ranging from concise event descriptions to rich, emotionally grounded narratives—generated using large language models. They present ADIFF, a new method that improves on baselines by incorporating audio cross-projection, position-aware captioning, and multi-stage training, and show that it significantly outperforms existing audio-language models both quantitatively and via human evaluation.
Better Instruction-Following Through Minimum Bayes Risk
This paper explores how LLMs can be used as judges to evaluate and improve other LLMs. The authors show that using a method called Minimum Bayes Risk (MBR) decoding—where an LLM judge selects the best output from a set—can significantly improve model performance compared to standard decoding methods. They also find that training models on these high-quality outputs can lead to strong gains even without relying on MBR at test time, making the models faster and more efficient while maintaining or exceeding previous performance.
DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference
This paper introduces DeFT, a new algorithm that speeds up how large language models handle tasks involving tree-like structures with shared text prefixes, such as multi-step reasoning or few-shot prompting. Existing methods waste time and memory by repeatedly accessing the same data and poorly distributing the workload across the GPU. DeFT solves this by smartly grouping and splitting memory usage to avoid redundant operations and better balance the work, leading to up to 3.6x faster performance on key tasks compared to current approaches.
Holistically Evaluating the Environmental Impact of Creating Language Models
This paper estimates the full environmental impact of developing large language models, including not just the final training runs but also model development and hardware manufacturing—areas typically underreported. The authors found that training a series of models released 493 metric tons of carbon emissions and used 2.769 million liters of water, even in a highly efficient data center. Notably, around half of the carbon emissions came from the development phase alone, and power usage during training varied significantly, raising concerns for energy grid planning as AI systems grow.
Language Model Alignment in Multilingual Trolley Problems
This paper evaluates how well LLMs align with human moral preferences across languages using multilingual trolley problems. The authors introduce MultiTP, a new dataset of moral dilemmas in over 100 languages based on the Moral Machine experiment, enabling cross-lingual analysis of LLM decision-making. By assessing 19 models across six moral dimensions and examining demographic correlations and prompt consistency, they uncover significant variation in moral alignment across languages—highlighting ethical biases and the need for more inclusive, multilingual approaches to responsible AI development.
Lean-STaR: Learning to Interleave Thinking and Proving
This paper introduces Lean-STaR, a framework that improves language model-based theorem proving by incorporating informal “thoughts” before each proof step. Unlike traditional approaches that rely solely on formal proof data, Lean-STaR generates synthetic thought processes using retrospective proof tactics during training. At inference time, the model generates these thoughts to guide its next action, and expert iteration further refines its performance using the Lean theorem prover. This approach boosts proof success rates and offers new insights into how structured reasoning improves formal mathematical problem solving.
MagicPIG: LSH Sampling for Efficient LLM Generation
This paper introduces MagicPIG, a new system that speeds up LLM inference by approximating attention more efficiently. While many methods assume attention is sparse and use TopK approximations, the authors show this isn’t always accurate and can hurt performance. Instead, MagicPIG uses a sampling method backed by theoretical guarantees and accelerates it using Locality Sensitive Hashing, offloading computations to the CPU to support longer inputs and larger batches without sacrificing accuracy.
Multi-Robot Motion Planning with Diffusion Models
This paper introduces a method for planning coordinated, collision-free movements for many robots using only data from individual robots. The authors combine learned diffusion models with classical planning algorithms to generate realistic, safe multi-robot trajectories. Their approach, called Multi-robot Multi-model planning Diffusion, also scales to large environments by stitching together multiple diffusion models, showing strong results in simulated logistics scenarios.
Reinforcement Learning for Control of Non-Markovian Cellular Population Dynamics
This paper explores how reinforcement learning can be used to develop drug dosing strategies for controlling cell populations that adapt over time, such as cancer cells switching between resistant and susceptible states. Traditional methods struggle when the system’s dynamics are unknown or involve memory of past environments, making optimal control difficult. The authors show that deep RL can successfully learn effective strategies even in complex, memory-based systems, offering a promising approach for real-world biomedical applications.
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
This paper explores how to improve large language models’ reasoning by giving feedback at each step of their thinking process, rather than only at the final answer. The authors introduce a method where feedback—called a process reward—is based on whether a step helps make a correct final answer more likely, as judged by a separate model (a “prover”) that can recognize progress better than the model being trained. They show both theoretically and experimentally that this strategy makes learning more efficient, leading to significantly better and faster results than traditional outcome-based feedback methods.
SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models
This paper introduces SVDQuant, a method for significantly speeding up diffusion models by quantizing both weights and activations to 4 bits. Since such aggressive quantization can hurt image quality, the authors use a clever technique: they shift problematic “outlier” values into a separate low-rank component handled with higher precision, while the rest is processed with efficient low-bit operations. To avoid slowing things down due to extra computation, they also design a custom inference engine called Nunchaku, which merges the processing steps to minimize memory access. Together, these techniques reduce memory usage and deliver over 3x speedups without sacrificing image quality.
Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation
This paper tackles the challenge of applying reinforcement learning (RL) to soft-body robotics, where simulations are usually too slow for data-hungry RL algorithms. The authors introduce SAPO, a new model-based RL algorithm that efficiently learns from differentiable simulations using analytic gradients. The authors also present Rewarped, a fast, parallel simulation platform that supports both rigid and deformable materials, demonstrating that their approach outperforms existing methods on complex manipulation and locomotion tasks.
Streaming Algorithms For $ell_p$ Flows and $ell_p$ Regression
This paper investigates how to solve underdetermined linear regression problems in a streaming setting, where the data arrives one column at a time and storing the full dataset is impractical. The authors develop algorithms that approximate the regression cost or output a near-optimal solution using much less memory than storing the entire dataset—particularly relevant for applications like computing flows on large graphs. They also establish space lower bounds, showing the limitations of what’s possible, and provide the first algorithms that achieve nontrivial approximations using sublinear space in various settings.
Poster Papers
Alignment, Fairness, Safety, Privacy, And Societal Considerations
Applications To Computer Vision, Audio, Language, And Other Modalities
Applications To Neuroscience & Cognitive Science
Applications To Physical Sciences (Physics, Chemistry, Biology, Etc.)
Applications To Robotics, Autonomy, Planning
Causal Reasoning
Datasets And Benchmarks
Foundation Or Frontier Models, Including Llms
Generative Models
Infrastructure, Software Libraries, Hardware, Systems, Etc.
Interpretability And Explainable Ai
Learning On Graphs And Other Geometries & Topologies
Learning Theory
Neurosymbolic & Hybrid Ai Systems (Physics-informed, Logic & Formal Reasoning, Etc.)
Optimization
Other Topics In Machine Learning (I.e., None Of The Above)
Probabilistic Methods (Bayesian Methods, Variational Inference, Sampling, Uq, Etc.)
Reinforcement Learning
Transfer Learning, Meta Learning, And Lifelong Learning
Unsupervised, Self-supervised, Semi-supervised, And Supervised Representation Learning
Source: Read MoreÂ