Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Sarvam AI Releases Samvaad-Hi-v1 Dataset and Sarvam-2B: A 2 Billion Parameter Language Model with 4 Trillion Tokens Focused on 10 Indic Languages for Enhanced NLP

    Sarvam AI Releases Samvaad-Hi-v1 Dataset and Sarvam-2B: A 2 Billion Parameter Language Model with 4 Trillion Tokens Focused on 10 Indic Languages for Enhanced NLP

    August 14, 2024

    Sarvam AI has recently unveiled its cutting-edge language model, Sarvam-2B. This powerful model, boasting 2 billion parameters, represents a significant stride in Indic language processing. With a focus on inclusivity and cultural representation, Sarvam-2B is pre-trained from scratch on a massive dataset of 4 trillion high-quality tokens, with an impressive 50% dedicated to Indic languages. This development, particularly their ability to understand and generate text in languages, is historically underrepresented in AI research.

    They have also introduced the Samvaad-Hi-v1 dataset, a meticulously curated collection of 100,000 high-quality English, Hindi, and Hinglish conversations. This dataset is uniquely designed with an Indic context, making it an invaluable resource for researchers and developers working on multilingual and culturally relevant AI models. Samvaad-Hi-v1 is poised to enhance the training of conversational AI systems that can understand and engage with users more naturally and contextually appropriately across different languages and dialects prevalent in India.

    The Vision Behind Sarvam-2B

    Sarvam AI’s vision with Sarvam-2B is clear: to create a robust and versatile language model that excels in English and champions Indic languages. This is especially important in a country like India, where linguistic diversity is vast, and the need for AI models that can effectively process and generate text in multiple languages is paramount.

    The model supports 10 Indic languages, including Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. This broad language support ensures the model is accessible to many users across different linguistic backgrounds. The model’s architecture and training process have been meticulously designed to ensure it performs well across all supported languages, making it a versatile tool for developers and researchers.

    Technical Excellence and Implementation

    Sarvam-2B has been trained on a balanced mix of English and Indic language data, each contributing 2 trillion tokens to the training process. This careful balance ensures that the model is equally proficient in English and the supported Indic languages. The training process involved sophisticated techniques to enhance the model’s understanding and generation capabilities, making it one of the most advanced models in its category.

    Expanding the Horizon: Complementary Models

    In addition to Sarvam-2B, Sarvam AI has also introduced three other remarkable models that complement its capabilities:

    Bulbul 1.0: A Text-to-Speech (TTS) model that supports combinations of 10 languages and six voices. This model generates natural-sounding speech, making it a valuable tool for applications requiring multilingual voice output.

    Saaras 1.0: A Speech-to-Text (STT) model that supports the same ten languages and includes automatic language identification. This model is particularly useful for transcribing spoken language into text, with the added advantage of detecting the language automatically.

    Mayura 1.0: A translation API designed to handle the complexities of translating between Indian languages and English. This model is tailored to address the nuances and unique challenges associated with Indian languages, providing more accurate and culturally relevant translations.

    Conclusion

    Sarvam AI launched Sarvam-2B, particularly in the context of language models designed for Indic languages. By dedicating half of its training data to these languages, Sarvam-2B stands out as a model that actively promotes linguistic diversity’s importance. The model’s versatility, combined with the complementary capabilities of Bulbul 1.0, Saaras 1.0, and Mayura 1.0, positions Sarvam AI as a leader in developing inclusive, innovative, and forward-thinking AI technologies.

    Check out the Model Card and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 48k+ ML SubReddit

    Find Upcoming AI Webinars here

    Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

    The post Sarvam AI Releases Samvaad-Hi-v1 Dataset and Sarvam-2B: A 2 Billion Parameter Language Model with 4 Trillion Tokens Focused on 10 Indic Languages for Enhanced NLP appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleIntroducing document-level sync reports: Enhanced data sync visibility in Amazon Q Business
    Next Article Derive generative AI-powered insights from ServiceNow with Amazon Q Business

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    TorchSim: A Next-Generation PyTorch-Native Atomistic Simulation Engine for the MLIP Era

    TorchSim: A Next-Generation PyTorch-Native Atomistic Simulation Engine for the MLIP Era

    Machine Learning

    Web Scraping With RSelenium (Chrome Driver) and Rvest

    Development

    CVE-2025-32354 – Zimbra Collaboration CSRF Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Fighting osteoporosis before it starts

    Artificial Intelligence

    Highlights

    CVE-2025-3272 – OpenText Operations Bridge Manager Password Change Bypass

    May 7, 2025

    CVE ID : CVE-2025-3272

    Published : May 7, 2025, 7:16 p.m. | 20 minutes ago

    Description : Incorrect Authorization vulnerability in OpenText™ Operations Bridge Manager. 

    The vulnerability could allow authenticated users to change their password without providing their old password.

    This issue affects Operations Bridge Manager: 24.2, 24.4.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    DAI#46 – Skeleton key, exam cheats, and famous AI voices

    July 5, 2024

    CVE-2025-4707 – Campcodes Sales and Inventory System SQL Injection Vulnerability

    May 15, 2025

    Address Validation in Android App Development – A Comprehensive Guide

    June 8, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.