The AI company Galileo has just announced its latest Hallucination Index, which is a framework that evaluates 22 leading generative AI models.Â
Models are tested using a metric called context adherence, which measures “closed-domain hallucinations: cases where your model said things that were not provided in the context.â€
The best performing model overall for RAG, according to the ranking, is Claude 3.5 Sonnet from Anthropic. Galileo said that this model and Anthropic’s other model Claude 3 Opus had near perfect scores, beating out OpenAI’s models, which won last year.Â
From a cost perspective, the best performing model was Google’s Gemini 1.5 Flash. And Alibaba’s Qwen2-72B-Instruct was overall the best performing open source model, though in short context RAG tests, Meta’s llama-3-60b-instruct was the best.Â
Broken down by context length, the best closed-source model in short context RAG was Claude 3.5 Sonnet, in medium context RAG was Google’s Gemini-1.5-flash-001 (with cost being the tiebreaker with other models that also scored a perfect score), and in large context RAG was again Claude 3.5 Sonnet.Â
“In today’s rapidly evolving AI landscape, developers and enterprises face a critical challenge: how to harness the power of generative AI while balancing cost, accuracy, and reliability. Current benchmarks are often based on academic use-cases, rather than real-world applications. Our new Index seeks to address this by testing models in real-world use cases that require the LLMs to retrieve data, a common practice in enterprise AI implementations,†says Vikram Chatterji, CEO and co-founder of Galileo. “As hallucinations continue to be a major hurdle, our goal wasn’t to just rank models, but rather give AI teams and leaders the real-world data they need to adopt the right model, for the right task, at the right price.â€
You may also like…
Meta’s new Llama 3.1 model competes with GPT-4o and Claude 3.5 Sonnet
The post Claude 3.5 Sonnet comes out on top in Galileo’s Hallucination Index appeared first on SD Times.
Source: Read MoreÂ