Large language models require large datasets of prompts paired with particular user requests and correct responses for training purposes. LLMs require this for human-like text understanding and generation as the answers to various questions. Conversely, unlike other languages, mainly Arabic, immense efforts have been made to develop such datasets in English. This imbalance in data availability between languages severely restricts the applicability of LLMs to non-English-speaking regions and, therefore, denotes a critical need in the NLP domain.
The recent research challenge this paper addresses is the need for good-quality Arabic prompts datasets to train LLMs to perform well in Arabic. These issues must be addressed so LLMs can effectively understand and generate Arabic text. Therefore, they would be contributing less usefulness to the Arabic-speaking users. This is quite relevant because Arabic is spoken by one of the largest numbers in the world. Yet, it lacks sufficient resources for its language, meaning that present AI technologies serve a huge fraction of mankind. Besides the complexity of the Arabic language, due to its rich morphology and huge number of dialects, it takes a lot of work to develop templates that can portray the language the way it should appropriately. Therefore, creating a highly powerful dataset for Arabic is important to upscale the usefulness of the LLM models to a wider audience.
Current prompt dataset generation approaches are mostly oriented towards English and include manual prompt creation or tools generating them based on existing datasets. For example, PromptSource and Super-NaturalInstructions have made millions of prompts available for English-language LLMs. However, these methods have yet to be adapted on any wide scale for other languages, and hence, the resources for training LLMs in non-English languages are considerably lacking. This, coupled with the limited availability of prompt datasets in languages like Arabic, may have hampered the ability of LLMs to excel in these languages, underlining that more focused efforts in dataset creation are necessary.
Researchers from aiXplain Inc. have introduced two innovative methods for creating large-scale Arabic prompt datasets to address this issue. The first method involves translating existing English prompt datasets into Arabic using an automatic translation system, followed by a rigorous quality assessment process. This method relies on state-of-the-art machine translation technologies and quality estimation tools to ensure that the translated prompts maintain high accuracy. By applying these techniques, researchers retained approximately 20% of the translated prompts, resulting in a dataset of around 20 million high-quality Arabic prompts. The second method focuses on creating new prompts directly from existing Arabic NLP datasets. This method uses a prompt sourcing tool to generate prompts for 78 publicly available Arabic datasets, covering tasks such as answering questions, summarization, and detecting hate speech. Over 67.4 million prompts were created through this process, significantly expanding the resources available for training Arabic LLMs.
The translation-based approach follows an end-to-end pipeline in data processing, starting from the tokenization of the English prompts into sentences further translated into Arabic by a neural machine translation model. Then, it performs quality estimation on such translations using a referenceless machine translation quality estimation model, where each sentence will be attributed some quality score. These prompts will be retained only if the set threshold for quality is met; therefore, the final dataset will be highly accurate. Manual verification is conducted on a random sample of prompts to increase the dataset’s quality further. Another approach is to generate prompts directly; PromptSource creates multiple templates for every task in the Arabic datasets. The approach allows the creation of diverse, contextually relevant prompts desirable for training effective language models.
The researchers then used these newly created prompts to fine-tune an open 7 billion parameter LLM, namely the Qwen2 7B model. The fine-tuned model was tested against several benchmarks and significantly improved handling Arabic prompts, outperforming a state-of-the-art 70 billion parameter instruction-tuned model, Llama3 70B. Specifically, the Qwen2 7B model fine-tuned on just 800,000 prompts achieved a ROUGE-L score of 0.184, while the model fine-tuned on 8 million prompts achieved a score of 0.224. These results highlight the effectiveness of the newly developed prompt datasets and demonstrate that fine-tuning with larger datasets leads to better model performance.
In a nutshell, this research speaks about a grave issue: no datasets of Arabic prompts are available to train large language models. The research has opened up the resources for training Arabic LLMs by introducing two new ways to create such datasets. Fine-tuning the Qwen2 7B model using these newly generated prompts produces a model at the top of all other existing models in terms of performance and places a gold standard for Arabic LLMs. It points to the need to develop robust, scalable methods for creating datasets in languages other than English.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 48k+ ML SubReddit
Find Upcoming AI Webinars here
The post aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs appeared first on MarkTechPost.
Source: Read MoreÂ