ETH AI Digest: #12

Multimodal Vision Models, Real-Time Hallucination Detection, and Memory-Efficient LLM Training

Marco

Jun 16, 2025

In this week's digest:

Unified Egocentric Intelligence — Vision model combines video, depth, gaze, and motion data to achieve 30x faster perception while maintaining accuracy.
Efficient Hallucination Detection — New RAUQ system spots LLM inaccuracies by tracking attention patterns with minimal computational overhead.
Smarter Model Optimization — Researchers prove that gradient-free optimization naturally finds better solutions for large language models through flat minima convergence.

Selected Papers of the Week

1. EgoM2P: Egocentric Multimodal Multitask Pretraining

First multitask foundation model integrating four modalities for faster, more accurate egocentric perception.

teaser — EgoM2P: A large-scale egocentric multimodal and multitask model, pretrained on eight extensive egocentric datasets. It incorporates four modalities—RGB and depth video, gaze dynamics, and camera trajectories—to handle challenging tasks like monocular egocentric depth estimation, camera tracking, gaze estimation, and conditional egocentric video synthesis.

✍️ Authors: Gen Li, Yutong Chen, Yiqian Wu, Kaifeng Zhao, Marc Pollefeys, Siyu Tang

🏛️ Lab: Computer Vision and Geometry Lab, Computer Vision and Learning Group

⚡ Summary

EgoM2P addresses the challenge of building unified models for egocentric vision by integrating RGB video, depth, gaze, and camera trajectories through temporal tokenizers and masked modeling.

The model performs multiple tasks including camera tracking, gaze prediction, depth estimation, and video synthesis, matching or outperforming specialist models while being up to 30x faster.

It effectively handles missing modalities in heterogeneous datasets and demonstrates strong generalization to unseen data, establishing a foundation for applications in augmented reality and robotics.

👉 Read the full paper

🌐 Project page

💻 Github repo

2. Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs

RAUQ: Detecting LLM hallucinations by analyzing attention drops in uncertainty-aware heads.

Refer to caption — Attention patterns in Llama 3.1 8B's 29th layer reveal a potential hallucination detection signal. When the model generates the incorrect token "falcon" (correct answer: "gloves" and "dagger"), attention head 25 shows a characteristic drop in attention to the preceding token, contrasting with its typically high attention values throughout the sequence.

✍️ Authors: a, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Mrinmaya Sachan, Preslav Nakov, Artem Shelmanov

🏛️ Lab: Language, Reasoning and Education Lab

⚡ Summary

Large language models often produce fluent but factually incorrect outputs known as hallucinations, creating a need for efficient detection methods.

RAUQ identifies specific "uncertainty-aware" attention heads where drops in attention to preceding tokens signal hallucinations, and uses these patterns to compute uncertainty scores.

This unsupervised approach outperforms existing methods across 12 tasks and 4 LLMs while adding less than 1% computational overhead.

Unlike sampling-based or supervised alternatives, RAUQ requires no task-specific labels and works with a single forward pass, making it practical for real-time applications.

👉 Read the full paper

3. Zeroth-Order Optimization Finds Flat Minima

Gradient-free methods implicitly minimize Hessian trace, explaining their success in memory-constrained LLM fine-tuning.

✍️ Authors: Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil, Sewoong Oh, Michael Muehlebach, Niao He

🏛️ Lab: Optimization & Decision Intelligence Group

⚡ Summary

This paper reveals that zeroth-order optimization methods naturally converge to flat minima without requiring gradients.

The authors prove that the two-point estimator implicitly regularizes the trace of Hessian, providing theoretical convergence guarantees for convex and smooth functions.

Experiments on binary classification and language model fine-tuning confirm that zeroth-order methods consistently reduce Hessian trace compared to gradient descent.

This finding explains why zeroth-order optimization works effectively for fine-tuning large language models despite theoretical dimension-dependent complexity.

👉 Read the full paper

ETH AI Digest

ETH AI Digest: #12

Multimodal Vision Models, Real-Time Hallucination Detection, and Memory-Efficient LLM Training

Selected Papers of the Week

1. EgoM2P: Egocentric Multimodal Multitask Pretraining

⚡ Summary

2. Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs

⚡ Summary

3. Zeroth-Order Optimization Finds Flat Minima

⚡ Summary

Other noteworthy articles

Discussion about this post