Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Top 10 Use Cases of Vibe Coding in Large-Scale Node.js Applications

      September 3, 2025

      Cloudsmith launches ML Model Registry to provide a single source of truth for AI models and datasets

      September 3, 2025

      Kong Acquires OpenMeter to Unlock AI and API Monetization for the Agentic Era

      September 3, 2025

      Microsoft Graph CLI to be retired

      September 2, 2025

      ‘Cronos: The New Dawn’ was by far my favorite experience at Gamescom 2025 — Bloober might have cooked an Xbox / PC horror masterpiece

      September 4, 2025

      ASUS built a desktop gaming PC around a mobile CPU — it’s an interesting, if flawed, idea

      September 4, 2025

      Hollow Knight: Silksong arrives on Xbox Game Pass this week — and Xbox’s September 1–7 lineup also packs in the horror. Here’s every new game.

      September 4, 2025

      The Xbox remaster that brought Gears to PlayStation just passed a huge milestone — “ending the console war” and proving the series still has serious pulling power

      September 4, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Magento (Adobe Commerce) or Optimizely Configured Commerce: Which One to Choose

      September 4, 2025
      Recent

      Magento (Adobe Commerce) or Optimizely Configured Commerce: Which One to Choose

      September 4, 2025

      Updates from N|Solid Runtime: The Best Open-Source Node.js RT Just Got Better

      September 3, 2025

      Scale Your Business with AI-Powered Solutions Built for Singapore’s Digital Economy

      September 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      ‘Cronos: The New Dawn’ was by far my favorite experience at Gamescom 2025 — Bloober might have cooked an Xbox / PC horror masterpiece

      September 4, 2025
      Recent

      ‘Cronos: The New Dawn’ was by far my favorite experience at Gamescom 2025 — Bloober might have cooked an Xbox / PC horror masterpiece

      September 4, 2025

      ASUS built a desktop gaming PC around a mobile CPU — it’s an interesting, if flawed, idea

      September 4, 2025

      Hollow Knight: Silksong arrives on Xbox Game Pass this week — and Xbox’s September 1–7 lineup also packs in the horror. Here’s every new game.

      September 4, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»3 Questions: The pros and cons of synthetic data in AI

    3 Questions: The pros and cons of synthetic data in AI

    September 3, 2025

    Synthetic data are artificially generated by algorithms to mimic the statistical properties of actual data, without containing any information from real-world sources. While concrete numbers are hard to pin down, some estimates suggest that more than 60 percent of data used for AI applications in 2024 was synthetic, and this figure is expected to grow across industries.

    Because synthetic data don’t contain real-world information, they hold the promise of safeguarding privacy while reducing the cost and increasing the speed at which new AI models are developed. But using synthetic data requires careful evaluation, planning, and checks and balances to prevent loss of performance when AI models are deployed.       

    To unpack some pros and cons of using synthetic data, MIT News spoke with Kalyan Veeramachaneni, a principal research scientist in the Laboratory for Information and Decision Systems and co-founder of DataCebo whose open-core platform, the Synthetic Data Vault, helps users generate and test synthetic data.

    Q: How are synthetic data created?

    A: Synthetic data are algorithmically generated but do not come from a real situation. Their value lies in their statistical similarity to real data. If we’re talking about language, for instance, synthetic data look very much as if a human had written those sentences. While researchers have created synthetic data for a long time, what has changed in the past few years is our ability to build generative models out of data and use them to create realistic synthetic data. We can take a little bit of real data and build a generative model from that, which we can use to create as much synthetic data as we want. Plus, the model creates synthetic data in a way that captures all the underlying rules and infinite patterns that exist in the real data.

    There are essentially four different data modalities: language, video or images, audio, and tabular data. All four of them have slightly different ways of building the generative models to create synthetic data. An LLM, for instance, is nothing but a generative model from which you are sampling synthetic data when you ask it a question.      

    A lot of language and image data are publicly available on the internet. But tabular data, which is the data collected when we interact with physical and social systems, is often locked up behind enterprise firewalls. Much of it is sensitive or private, such as customer transactions stored by a bank. For this type of data, platforms like the Synthetic Data Vault provide software that can be used to build generative models. Those models then create synthetic data that preserve customer privacy and can be shared more widely.      

    One powerful thing about this generative modeling approach for synthesizing data is that enterprises can now build a customized, local model for their own data. Generative AI automates what used to be a manual process.

    Q: What are some benefits of using synthetic data, and which use-cases and applications are they particularly well-suited for?

    A: One fundamental application which has grown tremendously over the past decade is using synthetic data to test software applications. There is data-driven logic behind many software applications, so you need data to test that software and its functionality. In the past, people have resorted to manually generating data, but now we can use generative models to create as much data as we need.

    Users can also create specific data for application testing. Say I work for an e-commerce company. I can generate synthetic data that mimics real customers who live in Ohio and made transactions pertaining to one particular product in February or March.

    Because synthetic data aren’t drawn from real situations, they are also privacy-preserving. One of the biggest problems in software testing has been getting access to sensitive real data for testing software in non-production environments, due to privacy concerns. Another immediate benefit is in performance testing. You can create a billion transactions from a generative model and test how fast your system can process them.

    Another application where synthetic data hold a lot of promise is in training machine-learning models. Sometimes, we want an AI model to help us predict an event that is less frequent. A bank may want to use an AI model to predict fraudulent transactions, but there may be too few real examples to train a model that can identify fraud accurately. Synthetic data provide data augmentation — additional data examples that are similar to the real data. These can significantly improve the accuracy of AI models.

    Also, sometimes users don’t have time or the financial resources to collect all the data. For instance, collecting data about customer intent would require conducting many surveys. If you end up with limited data and then try to train a model, it won’t perform well. You can augment by adding synthetic data to train those models better.

    Q. What are some of the risks or potential pitfalls of using synthetic data, and are there steps users can take to prevent or mitigate those problems?

    A. One of the biggest questions people often have in their mind is, if the data are synthetically created, why should I trust them? Determining whether you can trust the data often comes down to evaluating the overall system where you are using them.

    There are a lot of aspects of synthetic data we have been able to evaluate for a long time. For instance, there are existing methods to measure how close synthetic data are to real data, and we can measure their quality and whether they preserve privacy. But there are other important considerations if you are using those synthetic data to train a machine-learning model for a new use case. How would you know the data are going to lead to models that still make valid conclusions?

    New efficacy metrics are emerging, and the emphasis is now on efficacy for a particular task. You must really dig into your workflow to ensure the synthetic data you add to the system still allow you to draw valid conclusions. That is something that must be done carefully on an application-by-application basis.

    Bias can also be an issue. Since it is created from a small amount of real data, the same bias that exists in the real data can carry over into the synthetic data. Just like with real data, you would need to purposefully make sure the bias is removed through different sampling techniques, which can create balanced datasets. It takes some careful planning, but you can calibrate the data generation to prevent the proliferation of bias.

    To help with the evaluation process, our group created the Synthetic Data Metrics Library. We worried that people would use synthetic data in their environment and it would give different conclusions in the real world. We created a metrics and evaluation library to ensure checks and balances. The machine learning community has faced a lot of challenges in ensuring models can generalize to new situations. The use of synthetic data adds a whole new dimension to that problem.

    I expect that the old systems of working with data, whether to build software applications, answer analytical questions, or train models, will dramatically change as we get more sophisticated at building these generative models. A lot of things we have never been able to do before will now be possible.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCommon Open Source Contribution Myths – Debunked
    Next Article 3 Questions: On biology and medicine’s “data revolution”

    Related Posts

    Artificial Intelligence

    Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

    September 4, 2025
    Repurposing Protein Folding Models for Generation with Latent Diffusion
    Artificial Intelligence

    Repurposing Protein Folding Models for Generation with Latent Diffusion

    September 4, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2024-37777 – O2OA Remote Code Execution Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Windows 11 Recall Adds Data Export for EU Users: Share Snapshots with Third Parties

    Security

    CVE-2025-51464 – Aimhubio Aim XSS

    Common Vulnerabilities and Exposures (CVEs)

    Protecting Your Participants’ Data: A ReOps-Approved Guide for Researchers

    Web Development

    Highlights

    CVE-2025-4666 – Zotpress WordPress Stored Cross-Site Scripting Vuln

    June 11, 2025

    CVE ID : CVE-2025-4666

    Published : June 11, 2025, 4:15 a.m. | 1 hour, 36 minutes ago

    Description : The Zotpress plugin for WordPress is vulnerable to Stored Cross-Site Scripting via the ‘nickname’ parameter in all versions up to, and including, 7.3.15 due to insufficient input sanitization and output escaping. This makes it possible for authenticated attackers, with Author-level access and above, to inject arbitrary web scripts in pages that will execute whenever a user accesses an injected page.

    Severity: 6.4 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Distribution Release: Voyager Live 25.04

    April 18, 2025

    CVE-2025-8426 – Marvell QConvergeConsole Directory Traversal and Information Disclosure/DoS

    July 31, 2025

    Learn React in your Browser – freeCodeCamp Full Stack Curriculum Mid-2025 Update

    June 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.