Source2Synth: A New AI Technique for Synthetic Data Generation and Curation Grounded in Real Data Sources

Large Language Models (LLMs) have demonstrated impressive performance in tasks like Natural Language Processing, generation, and text synthesis. However, they still encounter major difficulties in more complicated circumstances. These are assignments that call for using tools to solve problems, dealing with structured data, or carrying out complex multi-step reasoning. For instance, although LLMs are adept at comprehending unstructured text, they have trouble utilizing and interpreting organized data, such as spreadsheets, tables, and databases. In addition, subpar performance is frequently achieved on tasks like multi-hop question answering (MHQA), which calls for combining data from several sources. Similarly, LLMs still find it challenging to complete duties that require the use of tools, including using SQL to answer tabular inquiries.

To overcome these issues, a new technique called Source2Synth has been introduced by researchers from Meta, Oxford University, and University College London. The primary benefit of Source2Synth is its capacity to impart new skills to LLMs without the need for expensive and time-consuming human annotations. Conventional approaches to enhancing LLM performance frequently call for a great deal of manual annotation, which is costly and difficult to scale, particularly for complicated jobs. This requirement has been removed by Source2Synth, which creates synthetic data that imitates actual situations and thought processes.

In order to create synthetic instances with intermediate reasoning steps, Source2Synth uses a specific data source, such as tables from the internet or relevant articles. Since these examples are based on actual data, the synthetic data is guaranteed to be diversified, realistic, and factually correct. The methodâ€™s main step is creating a seed topic, which might be an entity or a factual statement, and then developing it into a comprehensive example. The example contains the instructions for the task, the steps needed to solve the problem using reasoning, and the solution. Through this procedure, Source2Synth is able to generate intricate, realistic data points that mimic the way LLMs ought to handle structured data or carry out multi-step activities.

The method that Source2Synth uses to enhance dataset quality is an essential component. Low-quality examples can deteriorate model performance, and not all generated data points are equally valuable. In order to address this, Source2Synth uses filtering strategies determined by how answerable the synthetic instances are. For example, the example is discarded if the generated data does not result in the right response within a certain number of trials. This quality control procedure ensures that only excellent examples, those that help in the LLMâ€™s acquisition of the necessary skills, are kept for the last round of fine-tuning.

The technique has been implemented in two unique and demanding fields, which are as follows,

Multi-Hop Question Answering (MHQA): To respond to a single question, the LLM in this domain analyzes and synthesizes data from several sources. When Source2Synth was evaluated on HotPotQA, a dataset created for multi-hop reasoning, it outperformed baseline models that were adjusted by conventional techniques by 22.57%.

Answering questions with structured data is known as tabular question answering (TQA), and it frequently calls for SQL queries to communicate with tables. WikiSQL is a dataset that focuses on using SQL to answer questions about tables. Source2Synth was tested on it and achieved a 25.51% improvement over baseline models.

The results have demonstrated how Source2Synth can increase LLM performance on challenging tasks without requiring large amounts of human annotations on datasets. For training LLMs in domains requiring sophisticated reasoning and tool usage, Source2Synth offers a scalable method by producing grounded, realistic examples and rigorously filtering the dataset to ensure high quality.

In conclusion, Source2Synth is a unique method for imparting new knowledge to LLMs, particularly in situations where human annotation is not feasible. This strategy solves the current constraints of LLMs in complicated tasks like multi-step reasoning and structured data manipulation by guaranteeing that only high-quality examples are utilized for fine-tuning and by rooting synthetic data generation in real-world sources for validation.Â

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 50k+ ML SubReddit

FREE AI WEBINAR: â€˜SAM 2 for Video: How to Fine-tune On Your Dataâ€™ (Wed, Sep 25, 4:00 AM â€“ 4:45 AM EST)

The post Source2Synth: A New AI Technique for Synthetic Data Generation and Curation Grounded in Real Data Sources appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

Mastering SVG Arcs

CodeSOD: A Set of Mistakes

CodeSOD: While This Works

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Finally, a luxury soundbar that’s compact and delivers immersive audio (and it’s $500 off)

This affordable Lenovo gaming PC is the one I recommend to most people. Here’s why

The last day of ’12 days of OpenAI’ is expected to bring biggest drop yet

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Windows 11 hidden toggle reveals how to turn on or off Administrator protection

10 Must-Have Apps for 3 Monitors You Should Know About

Source2Synth: A New AI Technique for Synthetic Data Generation and Curation Grounded in Real Data Sources

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

What do the State of CSS and HTML surveys tell us?

Wing Security SaaS Pulse: Continuous Security & Actionable Insights â€” For Free

Accenture creates a custom memory-persistent conversational user experience using Amazon Q Business

NetBSDâ€™s New Policy: No Place for AI-Created Code

Data collection for Microsoft AI training is so messy that even Redmond exec doesnâ€™t trust it

Source-Disentangled Neural Audio Codec (SD-Codec): A Novel AI Approach that Combines Audio Coding and Source Separation

Unraveling Multimodal Dynamics: Insights into Cross-Modal Information Flow in Large Language Models

CSS â€“ Advanced Animations and Transitions

Meta AI Releases Meta Lingua: A Minimal and Fast LLM Training and Inference Library for Research

Source2Synth: A New AI Technique for Synthetic Data Generation and Curation Grounded in Real Data Sources

Related Posts