aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs

Large language models require large datasets of prompts paired with particular user requests and correct responses for training purposes. LLMs require this for human-like text understanding and generation as the answers to various questions. Conversely, unlike other languages, mainly Arabic, immense efforts have been made to develop such datasets in English. This imbalance in data availability between languages severely restricts the applicability of LLMs to non-English-speaking regions and, therefore, denotes a critical need in the NLP domain.

The recent research challenge this paper addresses is the need for good-quality Arabic prompts datasets to train LLMs to perform well in Arabic. These issues must be addressed so LLMs can effectively understand and generate Arabic text. Therefore, they would be contributing less usefulness to the Arabic-speaking users. This is quite relevant because Arabic is spoken by one of the largest numbers in the world. Yet, it lacks sufficient resources for its language, meaning that present AI technologies serve a huge fraction of mankind. Besides the complexity of the Arabic language, due to its rich morphology and huge number of dialects, it takes a lot of work to develop templates that can portray the language the way it should appropriately. Therefore, creating a highly powerful dataset for Arabic is important to upscale the usefulness of the LLM models to a wider audience.

Current prompt dataset generation approaches are mostly oriented towards English and include manual prompt creation or tools generating them based on existing datasets. For example, PromptSource and Super-NaturalInstructions have made millions of prompts available for English-language LLMs. However, these methods have yet to be adapted on any wide scale for other languages, and hence, the resources for training LLMs in non-English languages are considerably lacking. This, coupled with the limited availability of prompt datasets in languages like Arabic, may have hampered the ability of LLMs to excel in these languages, underlining that more focused efforts in dataset creation are necessary.

Researchers from aiXplain Inc. have introduced two innovative methods for creating large-scale Arabic prompt datasets to address this issue. The first method involves translating existing English prompt datasets into Arabic using an automatic translation system, followed by a rigorous quality assessment process. This method relies on state-of-the-art machine translation technologies and quality estimation tools to ensure that the translated prompts maintain high accuracy. By applying these techniques, researchers retained approximately 20% of the translated prompts, resulting in a dataset of around 20 million high-quality Arabic prompts. The second method focuses on creating new prompts directly from existing Arabic NLP datasets. This method uses a prompt sourcing tool to generate prompts for 78 publicly available Arabic datasets, covering tasks such as answering questions, summarization, and detecting hate speech. Over 67.4 million prompts were created through this process, significantly expanding the resources available for training Arabic LLMs.

The translation-based approach follows an end-to-end pipeline in data processing, starting from the tokenization of the English prompts into sentences further translated into Arabic by a neural machine translation model. Then, it performs quality estimation on such translations using a referenceless machine translation quality estimation model, where each sentence will be attributed some quality score. These prompts will be retained only if the set threshold for quality is met; therefore, the final dataset will be highly accurate. Manual verification is conducted on a random sample of prompts to increase the datasetâ€™s quality further. Another approach is to generate prompts directly; PromptSource creates multiple templates for every task in the Arabic datasets. The approach allows the creation of diverse, contextually relevant prompts desirable for training effective language models.

The researchers then used these newly created prompts to fine-tune an open 7 billion parameter LLM, namely the Qwen2 7B model. The fine-tuned model was tested against several benchmarks and significantly improved handling Arabic prompts, outperforming a state-of-the-art 70 billion parameter instruction-tuned model, Llama3 70B. Specifically, the Qwen2 7B model fine-tuned on just 800,000 prompts achieved a ROUGE-L score of 0.184, while the model fine-tuned on 8 million prompts achieved a score of 0.224. These results highlight the effectiveness of the newly developed prompt datasets and demonstrate that fine-tuning with larger datasets leads to better model performance.

In a nutshell, this research speaks about a grave issue: no datasets of Arabic prompts are available to train large language models. The research has opened up the resources for training Arabic LLMs by introducing two new ways to create such datasets. Fine-tuning the Qwen2 7B model using these newly generated prompts produces a model at the top of all other existing models in terms of performance and places a gold standard for Arabic LLMs. It points to the need to develop robust, scalable methods for creating datasets in languages other than English.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 48k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Introduces Arcee Swarm: A Groundbreaking Mixture of Agents MoA Architecture Inspired by the Cooperative Intelligence Found in Nature Itself

The post aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

Microsoft plans to lay off 3% of its workforce, reportedly targeting management cuts as it changes to fit a “dynamic marketplace”

A cross-platform Markdown note-taking application

A cross-platform Markdown note-taking application

AI Assistant Demo & Tips for Enterprise Projects

Celebrating Global Accessibility Awareness Day (GAAD)

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs

February 2025 Baseline monthly digest

Markus Buehler receives 2025 Washington Award

Making Progress on Regression Testing

CVE-2025-46759 – Apache HTTP Server Cross-Site Request Forgery

Required Login Session to Run HTTP Request in JMETER

Microsoft has killed “several” data center projects in the U.S. and Europe, according to reports — Microsoft responds (Updated)

libdatachannel is a WebRTC network library

Apache Tomcat Vulnerability Actively Exploited Just 30 Hours After Public Disclosure

Minecraft accounts deleted by Microsoft for not migrating from Mojang, sparks rage among users

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs

Related Posts