Natural language processing

AI models struggle with “lost in the middle” issue when processing large image sets

AI models struggle with “lost in the middle” issue when processing large image sets



summary
Summary

New research from UC Berkeley shows that current AI systems have trouble extracting relevant information from large collections of images. The study highlights weaknesses in existing large multimodal models (LMMs) when dealing with visual data.

A team at Berkeley Artificial Intelligence Research (BAIR) has created “Visual Haystacks” (VHS), a benchmark to test AI models’ ability to process high volumes of images. The test includes about 1,000 binary question-answer pairs, with each set containing between 1 and 10,000 images.

The benchmark consists of two tasks: In the “single-needle” task, only one relevant “needle” image is hidden in the “haystack” of images. In the “multi-needle” task, there are two to five relevant images. Questions ask if a specific object appears in one, all, or any of the relevant images.

The researchers tested various models, including open-source and proprietary ones like LLaVA-v1.5, GPT-4o, Claude 3 Opus, and Gemini-v1.5-pro. They also used a baseline model that generates captions with LLaVA and then answers questions based on text using Llama 3.

Ad

Results show that models struggle to filter out irrelevant visual information. Their performance on the single-needle task drops significantly as the number of images increases.

AI models struggle with "lost in the middle" issue when processing large image sets
Single-needle precision. | Image: Tsung-Han et al.

Interestingly, simple two-stage approaches (generating captions first, then evaluating with a language model) outperform all tested LMMs on the multi-needle task. This suggests that LMMs have difficulty processing information from multiple images, which again calls into question the current advantages of large context windows.

AI models struggle with "lost in the middle" issue when processing large image sets
Multi-needle precision. | Bild: Tsung-Han et al.

The models are also very sensitive to image position in the sequence. If the relevant image is in the middle, performance is much worse than if it’s at the beginning or end.

AI models struggle with "lost in the middle" issue when processing large image sets
Existing LMMs show a drop in performance of up to 41% if the image to be found is not ideally positioned. Gray boxes: Exceeding the context length. | Image: Tsung-Han et al.

This mirrors the “lost in the middle” phenomenon seen in language processing, where models focus on the start and end of a document while more or less ignoring the middle. LLMs also have a problem drawing meaningful conclusions from large amounts of text, according to a recent study.

To address these issues, the BAIR team developed MIRAGE (Multi-Image Retrieval Augmented Generation), an image processing-optimized RAG system. MIRAGE compresses visual tokens, allowing for more images in the same context lengths, a retriever trained in-line with the LLM fine-tuning to filter irrelevant images, and is trained on multi-image reasoning data. This approach achieves better results on both VHs and more complex visual question-answering tasks.

Recommendation

AI models struggle with "lost in the middle" issue when processing large image sets
The Mirage image RAG system developed by the researchers can outperform LMMs without image RAG on image tasks. | Image: Tsung-Han et al.

The researchers recommend that future LMM projects use the Visual Haystacks framework to identify and fix potential weaknesses before deployment, adding that multi-image question answering is an important step toward artificial general intelligence (AGI). The benchmark is available on GitHub.

AI models struggle with "lost in the middle" issue when processing large image sets

Source link