Gaze-LLE: A New AI Model for Gaze Target Estimation Built on Top of a Frozen Visual Foundation Model

Accurately predicting where a person is looking in a sceneâ€”gaze target estimationâ€”represents a significant challenge in AI research. Integrating complex cues such as head orientation and scene context must be used to infer gaze direction. Traditionally, methods for this problem use multi-branch architectures, processing the scene and head features separately before integrating them with auxiliary inputs, such as depth and pose. However, these methods are computationally intensive, hard to train, and often fail to generalize well across datasets. This calls for overcoming these issues so that the applications in understanding human behavior, robotics, and assistive technologies can progress.

Existing gaze estimation methods heavily depend on multi-branch pipelines, where separate encoders handle the scene and head features, followed by fusion modules to combine these inputs. To improve efficiency, many of these models use additional signals, such as pose, depth, and auxiliary features, which are obtained from specific modules. However, these approaches have several limitations. First, their high computational cost makes real-time implementation impossible. Second, these systems generally require large amounts of labeled training data, which is labor-intensive and nearly impossible to scale. This limits their ability to transfer learned generalizations to numerous environments and datasets when relying on particular encoders with supplementary inputs.

To address these issues, researchers from the Georgia Institute of Technology and the University of Illinois Urbana-Champaign introduced Gaze-LLE, a streamlined and efficient framework for gaze target estimation. Gaze-LLE eliminates the need for complex multi-branch architectures through a static DINOv2 visual encoder and a minimalist decoder module. The framework uses a unified backbone for feature extraction and has an innovative head positional prompting mechanism that allows the gaze estimation to be specific to certain individuals in the scene. Some of the primary contributions of this methodology are reducing trainable parameters to a significant level that translates into 95% fewer computations in comparison with traditional methods. Gaze-LLE is also a successful method in transforming transformer-based encoders at a large scale. It accurately enables gaze estimation without complex auxiliary models and allows for the maintenance of superior performance with minimal adjustments across a range of datasets and tasks through a simple and scalable architecture.

The architecture of Gaze-LLE comprises two main components. First, a frozen DINOv2 visual encoder extracts robust features from the input image, which are then projected into a lower-dimensional space via a linear layer for efficient processing. Second, a lightweight gaze decoder integrates these scene features with a head position embedding that encodes the location of the individual being observed. This mechanism allows the model to focus on the specific source of gaze. The gaze decoder consists of three transformer layers intended to be used for feature enhancement, and it produces a gaze heatmap that indicates possible targets of gaze, as well as an in-frame classification to determine whether the gaze falls within the observable frame. The simple model requires using a straightforward training objective: simply a pixel-wise binary cross-entropy loss, allowing the optimal tuning without a sophisticated approach based on complex multitasking objectives. Benchmarks comprised benchmark datasets: GazeFollow, VideoAttentionTarget, and ChildPlay.

Gaze-LLE achieves state-of-the-art performance across multiple benchmarks with significantly fewer parameters and faster training times. The GazeFollow dataset, yields an AUC of 0.958 and an average L2 error of 0.099, besting prior methods both in precision and in computational efficiency. The training time is, in particular, remarkably efficient, with the model achieving convergence within less than 1.5 GPU hours and significantly outperforming traditional multi-branch architectures. Further, Gaze-LLE also exhibits strong generalization properties as its high performance is retained over several datasets, like ChildPlay and GOO-Real, even without fine-tuning. Results like these show that the frozen foundational models in an optimized architecture can be useful for accurate and flexible gaze estimation.Â

In summary, Gaze-LLE redefines gaze target estimation with a streamlined and effective framework that brings in fundamental visual encoders and an innovative head positional prompting system. Free from the intricacies of architectures with multiple branches, this achieves higher accuracy, better efficiency, and scalability. Moreover, its ability to generalize across various datasets provides promise for its applications in further research on human behavior and related fields, thus introducing a new benchmark for the advancement of gaze estimation research.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. Donâ€™t Forget to join ourÂ 60k+ ML SubReddit.

The post Gaze-LLE: A New AI Model for Gaze Target Estimation Built on Top of a Frozen Visual Foundation Model appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

Gaze-LLE: A New AI Model for Gaze Target Estimation Built on Top of a Frozen Visual Foundation Model

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

The best Samsung Galaxy S25 cases of 2025

Samsung’s next-gen Galaxy Ring 2 may launch at Unpacked next month

WaveMaker releases AutoCode plugin for Figma for generating front-end components

How to Implement PHP Performance Best Practices Using AJAX and Smart HTTP Responses

Critical Tinyproxy Flaw Opens Over 50,000 Hosts to Remote Code Execution

Smashing Security podcast #373: iPhone undeleted photos, and stealing Scarlett Johanssonâ€™s voice

Managing Request Host Information in Laravel

Organize your life with this tiny Bluetooth thermal label printer for only $14

Gaze-LLE: A New AI Model for Gaze Target Estimation Built on Top of a Frozen Visual Foundation Model

Related Posts