Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Innodata’s Comprehensive Benchmarking of Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Hallucination Propensity

    Innodata’s Comprehensive Benchmarking of Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Hallucination Propensity

    July 9, 2024

    In a recent study by Innodata, various large language models (LLMs) such as Llama2, Mistral, Gemma, and GPT were benchmarked for their performance in factuality, toxicity, bias, and propensity for hallucinations. The research introduced fourteen novel datasets designed to evaluate the safety of these models, focusing on their ability to produce factual, unbiased, and appropriate content. The OpenAI GPT model was used as a point of comparison due to its superior performance across all safety metrics.

    The evaluation methodology revolved around assessing the models’ performance in four key areas:

    Factuality: This refers to the LLMs’ ability to provide accurate information. Llama2 showed strong performance in factuality tests, excelling in tasks that required grounding answers in verifiable facts. The datasets used for this evaluation included a mix of summarization tasks and factual consistency checks, such as the Correctness of Generated Summaries and the Factual Consistency of Abstractive Summaries.

    Toxicity: Toxicity assessment involved testing the models’ ability to avoid producing offensive or inappropriate content. This was measured using prompts to elicit potentially toxic responses, such as paraphrasing, translation, and error correction tasks. Llama2 demonstrated a robust ability to handle toxic content, properly censoring inappropriate language when instructed. However, it needed to work on maintaining this safety in multi-turn conversations, where user interactions extend over several exchanges.

    Bias: The bias evaluation focused on detecting the generation of content with religious, political, gender, or racial prejudice. This was tested using a variety of prompts across different domains, including finance, healthcare, and general topics. The results indicated that all models, including GPT, had difficulty identifying and avoiding biased content. Gemma showed some promise by often refusing to answer biased prompts, but overall, the task proved challenging for all models tested.

    Propensity for Hallucinations: Hallucinations in LLMs are instances where the models generate factually incorrect or nonsensical information. The evaluation involved using datasets like the General AI Assistants Benchmark, which includes difficult questions that LLMs without access to external resources should be unable to answer. Mistral performed notably well in this area, showing a strong ability to avoid generating hallucinatory content. This was particularly evident in tasks involving reasoning and multi-turn prompts, where Mistral maintained high safety standards.

    The study highlighted several key findings:

    Meta’s Llama2: This model performed exceptionally well in factuality and handling toxic content, making it a strong contender for applications requiring reliable and safe responses. However, its high propensity for hallucinations in out-of-scope tasks and its reduced safety in multi-turn interactions are areas that need improvement.

    Mistral: This model avoided hallucinations and performed well in multi-turn conversations. However, it struggled with toxicity detection and failed to manage toxic content effectively, limiting its application in environments where safety from offensive content is critical.

    Gemma: A newer model based on Google’s Gemini, Gemma displayed balanced performance across various tasks but lagged behind Llama2 and Mistral in overall effectiveness. Its tendency to refuse to answer potentially biased prompts helped it avoid generating unsafe content but limited its usability in certain contexts.

    OpenAI GPT: Unsurprisingly, GPT models, particularly GPT-4, outperformed the smaller open-source models across all safety vectors. The GPT-4 model significantly improved in reducing “laziness,” or the tendency to avoid completing tasks, while maintaining high safety standards. This underscores the advanced engineering and larger parameter sizes of OpenAI models, placing them in a league different from open-source alternatives.

    The research emphasized the importance of comprehensive safety evaluations for LLMs, especially as these models are increasingly deployed in enterprise environments. The novel datasets and benchmarking tools introduced by Innodata offer a valuable resource for ongoing and future research, aiming to improve the safety and reliability of LLMs in diverse applications.

    In conclusion, while Llama2, Mistral, and Gemma show promise in different areas, significant room remains for improvement. OpenAI’s GPT models set a high benchmark for safety and performance, highlighting the potential benefits of continued advancements and refinements in LLM technology. As the field progresses, comprehensive benchmarking and rigorous safety evaluations will be essential to ensure that LLMs can be safely and effectively integrated into various enterprise and consumer applications.

    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter.. Don’t Forget to join our 46k+ ML SubReddit

    If You are interested in a promotional partnership (content/ad/newsletter), please fill out this form.

    The post Innodata’s Comprehensive Benchmarking of Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Hallucination Propensity appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleVCHAR: A Novel Artificial Intelligence AI Framework that Treats the Outputs of Atomic Activities as a Distribution Over Specified Intervals
    Next Article Create Custom Tattoo Designs – ArtiTattoos

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    JMeter Integration With Azure Pipeline and after execution of test getting as directory not found

    Development

    20+ Best Paper & Newspaper Background Textures

    Development

    Telescope is a browser for the small internet

    Linux

    Amazon warns Surface Laptop 7 shoppers as Mojang unveils massive visual update to Minecraft and Microsoft leaks a potential new feature for the Xbox app on Windows 11

    News & Updates

    Highlights

    Linux

    Rilasciata Nitrux 3.9.1: Kernel Linux 6.13 e Fiery Web Browser

    March 31, 2025

    Nitrux è una distribuzione GNU/Linux progettata con un approccio immutabile, basata su tecnologie moderne e…

    The ethics of advanced AI assistants

    April 19, 2024

    Meta’s new Llama 3.1 model competes with GPT-4o and Claude 3.5 Sonnet

    July 27, 2024

    Mastering Heuristic Evaluation for Better UX

    November 11, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.