Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 15, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 15, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 15, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 15, 2025

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025

      NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

      May 15, 2025

      Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

      May 15, 2025

      Microsoft plans to lay off 3% of its workforce, reportedly targeting management cuts as it changes to fit a “dynamic marketplace”

      May 15, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      A cross-platform Markdown note-taking application

      May 15, 2025
      Recent

      A cross-platform Markdown note-taking application

      May 15, 2025

      AI Assistant Demo & Tips for Enterprise Projects

      May 15, 2025

      Celebrating Global Accessibility Awareness Day (GAAD)

      May 15, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025
      Recent

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025

      NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

      May 15, 2025

      Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

      May 15, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs

    aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs

    August 17, 2024

    Large language models require large datasets of prompts paired with particular user requests and correct responses for training purposes. LLMs require this for human-like text understanding and generation as the answers to various questions. Conversely, unlike other languages, mainly Arabic, immense efforts have been made to develop such datasets in English. This imbalance in data availability between languages severely restricts the applicability of LLMs to non-English-speaking regions and, therefore, denotes a critical need in the NLP domain.

    The recent research challenge this paper addresses is the need for good-quality Arabic prompts datasets to train LLMs to perform well in Arabic. These issues must be addressed so LLMs can effectively understand and generate Arabic text. Therefore, they would be contributing less usefulness to the Arabic-speaking users. This is quite relevant because Arabic is spoken by one of the largest numbers in the world. Yet, it lacks sufficient resources for its language, meaning that present AI technologies serve a huge fraction of mankind. Besides the complexity of the Arabic language, due to its rich morphology and huge number of dialects, it takes a lot of work to develop templates that can portray the language the way it should appropriately. Therefore, creating a highly powerful dataset for Arabic is important to upscale the usefulness of the LLM models to a wider audience.

    Current prompt dataset generation approaches are mostly oriented towards English and include manual prompt creation or tools generating them based on existing datasets. For example, PromptSource and Super-NaturalInstructions have made millions of prompts available for English-language LLMs. However, these methods have yet to be adapted on any wide scale for other languages, and hence, the resources for training LLMs in non-English languages are considerably lacking. This, coupled with the limited availability of prompt datasets in languages like Arabic, may have hampered the ability of LLMs to excel in these languages, underlining that more focused efforts in dataset creation are necessary.

    Researchers from aiXplain Inc. have introduced two innovative methods for creating large-scale Arabic prompt datasets to address this issue. The first method involves translating existing English prompt datasets into Arabic using an automatic translation system, followed by a rigorous quality assessment process. This method relies on state-of-the-art machine translation technologies and quality estimation tools to ensure that the translated prompts maintain high accuracy. By applying these techniques, researchers retained approximately 20% of the translated prompts, resulting in a dataset of around 20 million high-quality Arabic prompts. The second method focuses on creating new prompts directly from existing Arabic NLP datasets. This method uses a prompt sourcing tool to generate prompts for 78 publicly available Arabic datasets, covering tasks such as answering questions, summarization, and detecting hate speech. Over 67.4 million prompts were created through this process, significantly expanding the resources available for training Arabic LLMs.

    The translation-based approach follows an end-to-end pipeline in data processing, starting from the tokenization of the English prompts into sentences further translated into Arabic by a neural machine translation model. Then, it performs quality estimation on such translations using a referenceless machine translation quality estimation model, where each sentence will be attributed some quality score. These prompts will be retained only if the set threshold for quality is met; therefore, the final dataset will be highly accurate. Manual verification is conducted on a random sample of prompts to increase the dataset’s quality further. Another approach is to generate prompts directly; PromptSource creates multiple templates for every task in the Arabic datasets. The approach allows the creation of diverse, contextually relevant prompts desirable for training effective language models.

    The researchers then used these newly created prompts to fine-tune an open 7 billion parameter LLM, namely the Qwen2 7B model. The fine-tuned model was tested against several benchmarks and significantly improved handling Arabic prompts, outperforming a state-of-the-art 70 billion parameter instruction-tuned model, Llama3 70B. Specifically, the Qwen2 7B model fine-tuned on just 800,000 prompts achieved a ROUGE-L score of 0.184, while the model fine-tuned on 8 million prompts achieved a score of 0.224. These results highlight the effectiveness of the newly developed prompt datasets and demonstrate that fine-tuning with larger datasets leads to better model performance.

    In a nutshell, this research speaks about a grave issue: no datasets of Arabic prompts are available to train large language models. The research has opened up the resources for training Arabic LLMs by introducing two new ways to create such datasets. Fine-tuning the Qwen2 7B model using these newly generated prompts produces a model at the top of all other existing models in terms of performance and places a gold standard for Arabic LLMs. It points to the need to develop robust, scalable methods for creating datasets in languages other than English.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 48k+ ML SubReddit

    Find Upcoming AI Webinars here

    Arcee AI Introduces Arcee Swarm: A Groundbreaking Mixture of Agents MoA Architecture Inspired by the Cooperative Intelligence Found in Nature Itself

    The post aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAI and Cybersecurity: Navigating Innovation, Resilience, and Global Collaborative Efforts
    Next Article CyberJaya Digital – Web Design, SEO, Digital Marketing

    Related Posts

    Development

    February 2025 Baseline monthly digest

    May 15, 2025
    Artificial Intelligence

    Markus Buehler receives 2025 Washington Award

    May 15, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Making Progress on Regression Testing

    Development

    CVE-2025-46759 – Apache HTTP Server Cross-Site Request Forgery

    Common Vulnerabilities and Exposures (CVEs)

    Required Login Session to Run HTTP Request in JMETER

    Development

    Microsoft has killed “several” data center projects in the U.S. and Europe, according to reports — Microsoft responds (Updated)

    News & Updates

    Highlights

    libdatachannel is a WebRTC network library

    May 10, 2025

    libdatachannel is a standalone implementation of WebRTC Data Channels, WebRTC Media Transport, and WebSockets The…

    Apache Tomcat Vulnerability Actively Exploited Just 30 Hours After Public Disclosure

    March 17, 2025

    Minecraft accounts deleted by Microsoft for not migrating from Mojang, sparks rage among users

    June 13, 2024

    How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

    December 2, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.