This AI Paper from Stanford University Evaluates the Performance of Multimodal Foundation Models Scaling from Few-Shot to Many-Shot-In-Context Learning ICL
Incorporating demonstrating examples, known as in-context learning (ICL), significantly enhances large language models (LLMs) and large multimodal models (LMMs) without requiring parameter updates. Recent studies confirm the efficacy of few-shot multimodal ICL, particularly in improving LMM performance on out-of-domain tasks. With longer context windows in advanced models like GPT-4o and Gemini 1.5 Pro, researchers can now investigate the impact of increasing demonstrating examples, a factor previously constrained by context window limitations.
Some researchers observed enhanced performance in LLMs with increased in-context examples, albeit constrained by context size. Recent studies extended this exploration, demonstrating improvements with over 1,000 examples, besides in text-only benchmarks. Multimodal ICL research remains emerging, with studies showing benefits for models like GPT-4V and Gemini in out-domain tasks. Batch querying strategies offer efficiency gains in inference, with recent variations proposed to optimize performance, utilizing larger context windows in recent models.
To examine the potential of advanced multimodal foundation models in many-shot ICL, researchers from Stanford execute an extensive array of experiments to assess model efficacy across 10 datasets covering various domains and image classification tasks. This involves significantly increasing the number of demonstrating examples to gauge model performance.
The Key findings of this study include:
1. Increased demonstrating examples significantly enhance model performance, with Gemini 1.5 Pro showing consistent log-linear improvements compared to GPT-4o.
2. Gemini 1.5 Pro demonstrates higher ICL data efficiency compared to GPT-4o across most datasets.
3. Combining multiple queries into a single request can deliver comparable or superior performance to individual queries in a many-shot scenario. This approach also reduces per-example latency significantly and offers a more cost-effective inference process.
4. Batched questioning notably enhances performance in zero-shot scenarios, attributed to domain and class calibrated and self-generated demonstrating examples through autoregressive decoding.
Three advanced multimodal foundation models—GPT-4o, GPT4(V)-Turbo, and Gemini 1.5 Pro—are employed, with GPT-4o and Gemini 1.5 Pro emphasized due to superior performance. Claude3-Opus is excluded from experiments due to its 20-image limit per request. Each model is accessed through specific endpoints, with OpenAI’s API service for GPT-4o and GPT-4(V)-Turbo, and Google Cloud’s Vertex AI for Gemini 1.5 Pro. Zero temperature is set for all models, and a random seed ensures deterministic responses. Sampling strategies ensure class balance in demonstration and test sets across 10 datasets spanning various domains and classification tasks, with demonstration examples scaled up while maintaining balance for evaluation.
Gemini 1.5 Pro consistently demonstrates significant performance enhancements across most datasets as demonstrating examples increase, except for DrugOOD Assay. Particularly significant improvements are observed in HAM10000 (+23% accuracy compared to zero-shot), FIVES (+29% accuracy), and EuroSAT (+38% accuracy). for 5 out of the 10 datasets (FIVES, UCMerced, EuroSAT, Oxford Pets, and
DTD), Gemini 1.5 Pro performance continues to improve up to the highest number of demonstrating
examples considered (~1,000 examples). Conversely, GPT-4o exhibits performance improvements on most datasets but with less consistency, showing V-shaped scaling curves on many datasets. GPT-4o’s performance on DrugOOD Assay also displays high variance, similar to Gemini 1.5 Pro, with peak performance at 50 demo examples.
To recapitulate, This study assesses many-shot ICL of state-of-the-art multimodal foundation models across 10 datasets, revealing consistent performance enhancements. Batching queries with many-shot ICL significantly reduces per-example latency and inference costs without sacrificing performance. These findings suggest the potential of utilizing large numbers of demonstrating examples to adapt models quickly to new tasks and domains, circumventing the need for traditional fine-tuning. Future research should investigate the comparative effectiveness and data efficiency of traditional fine-tuning versus many-shot ICL. Also, examining issues like hallucinations and biases in the context of many-shot ICL and batched queries is crucial for model refinement and mitigating biases across diverse sub-groups.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 42k+ ML SubReddit