ETH AI Digest: #1

Self-correcting langauge models, robots learning from human videos and benchmarking LLM-generated code

Marco

Mar 13, 2025

In this week’s digest:

Language Models That Fix Themselves — GIDD introduces a diffusion framework enabling language models to identify and correct their own mistakes without explicit training
Robots Learning from Videos — VidBot transforms everyday human videos into robotic manipulation skills, bridging the gap between human demonstration and robotic execution
The Security Gap in AI-Generated Code — BaxBench reveals critical vulnerabilities in backend code produced by even the most advanced LLMs

Selected Papers of the Week

1. Generalized Interpolating Discrete Diffusion

Beyond masked diffusion: this framework enables language models to identify and fix their own mistakes.

Examples of self-correction (green replaces red) by the GIDD+ BASE model. trained with 20% uniform noise. The model is able to correct grammatical mistakes, improve word choices, and even improve factuality without being explicitly trained to do so.

✍️ Authors: Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann

🏛️ Lab: Data Analytics Lab

⚡ Summary

Language models struggle with revising already generated text, causing errors to persist in the final output.

This paper introduces Generalized Interpolating Discrete Diffusion (GIDD), a framework that extends masked diffusion by allowing flexible combinations of masking and uniform noise during training.

Models trained with GIDD achieve state-of-the-art performance among diffusion language models and, most importantly, develop the ability to identify and correct their own mistakes without explicit supervision.

This self-correction capability addresses a fundamental limitation of autoregressive models and opens new possibilities for more reliable AI-generated content.

👉 Read the full paper

💻 Github page

🤗 Hugging Face

2. VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

Learning robot skills from everyday human videos.

ciao — VidBot is a framework to learn interactions from in-the-wild RGB-only human videos. The affordance model can be deployed across robots for daily manipulation tasks.

✍️ Authors: Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, Stefan Leutenegger

🏛️ Lab: Computer Vision and Geometry Lab

⚡ Summary

VidBot addresses the challenge of teaching robots manipulation skills without expensive physical demonstrations by learning from everyday human videos.

The framework extracts 3D hand trajectories from monocular videos using depth models and structure-from-motion, then employs a coarse-to-fine affordance model to generate interaction trajectories.

Test-time cost guidance adapts trajectories to new environments and robot embodiments, enabling zero-shot transfer.

Experiments show VidBot outperforms existing methods by 20% across 13 manipulation tasks and successfully transfers to real robots without additional training.

👉 Read the full paper

🌐 Project website

3. BaxBench: Can LLMs Generate Correct and Secure Backends?

Even top LLMs fail to create secure, deployment-ready backend applications

Overview of the structure and execution process of BAXBENCH. The benchmark consists of 28 scenarios describing backend applications and 14 popular backend framework environments across 6 programming languages. Combined, these result in 392 challenging benchmark tasks. To evaluate an LLM, they prompt it with the scenario specification to generate a set of code files and assets that implement the scenario. They evaluate the correctness of those solutions using functional tests, and attempt to practically exploit the LLM code, targeting specific vulnerabilities.

✍️ Authors: Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev

🏛️ Lab: Secure, Reliable, and Intelligent Systems (SRI) Lab

⚡ Summary

Current LLMs fall short when tasked with generating production-ready backend code, as revealed by BAXBENCH, a new benchmark testing both functionality and security across 392 tasks.

Even flagship models like OpenAI's O3-mini achieve only 35% correct and secure code generation, with approximately half of functionally correct solutions containing exploitable vulnerabilities.

Performance varies significantly across programming languages and frameworks, with models performing better on popular languages like Python and JavaScript.

The findings indicate that LLMs require substantial improvement before they can be reliably used for autonomous backend development in real-world applications.

👉 Read the full paper

🤗 HuggingFace dataset

ETH AI Digest

ETH AI Digest: #1

Self-correcting langauge models, robots learning from human videos and benchmarking LLM-generated code

Selected Papers of the Week

1. Generalized Interpolating Discrete Diffusion

⚡ Summary

2. VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

⚡ Summary

3. BaxBench: Can LLMs Generate Correct and Secure Backends?

⚡ Summary

Other noteworthy articles

Discussion about this post