Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Report: 71% of tech leaders won’t hire devs without AI skills

      July 17, 2025

      Slack’s AI search now works across an organization’s entire knowledge base

      July 17, 2025

      In-House vs Outsourcing for React.js Development: Understand What Is Best for Your Enterprise

      July 17, 2025

      Tiny Screens, Big Impact: The Forgotten Art Of Developing Web Apps For Feature Phones

      July 16, 2025

      Too many open browser tabs? This is still my favorite solution – and has been for years

      July 17, 2025

      This new browser won’t monetize your every move – how to try it

      July 17, 2025

      Pokémon has partnered with one of the biggest PC gaming brands again, and you can actually buy these accessories — but do you even want to?

      July 17, 2025

      AMD’s budget Ryzen AI 5 330 processor will introduce a wave of ultra-affordable Copilot+ PCs with its mobile 50 TOPS NPU

      July 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The details of TC39’s last meeting

      July 18, 2025
      Recent

      The details of TC39’s last meeting

      July 18, 2025

      Reclaim Space: Delete Docker Orphan Layers

      July 18, 2025

      Notes Android App Using SQLite

      July 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      KeySmith – SSH key management

      July 17, 2025
      Recent

      KeySmith – SSH key management

      July 17, 2025

      Pokémon has partnered with one of the biggest PC gaming brands again, and you can actually buy these accessories — but do you even want to?

      July 17, 2025

      AMD’s budget Ryzen AI 5 330 processor will introduce a wave of ultra-affordable Copilot+ PCs with its mobile 50 TOPS NPU

      July 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)

    Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)

    May 25, 2025

    Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used:

    • LLMs train on AI-generated text
    • Fraud systems simulate edge cases
    • Vision models pretrain on fake images

    SDV (Synthetic Data Vault) is an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training.

    In this tutorial, we’ll use SDV to generate synthetic data step by step.

    Copy CodeCopiedUse a different Browser
    pip install sdv

    We will first install the sdv library:

    Copy CodeCopiedUse a different Browser
    from sdv.io.local import CSVHandler
    
    connector = CSVHandler()
    FOLDER_NAME = '.' # If the data is in the same directory
    
    data = connector.read(folder_name=FOLDER_NAME)
    salesDf = data['data']
    

    Next, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data[‘data’].

    Copy CodeCopiedUse a different Browser
    from sdv.metadata import Metadata
    metadata = Metadata.load_from_json('metadata.json')

    We now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes:

    • The table name
    • The primary key
    • The data type of each column (e.g., categorical, numerical, datetime, etc.)
    • Optional column formats like datetime patterns or ID patterns
    • Table relationships (for multi-table setups)

    Here is a sample metadata.json format:

    Copy CodeCopiedUse a different Browser
    {
      "METADATA_SPEC_VERSION": "V1",
      "tables": {
        "your_table_name": {
          "primary_key": "your_primary_key_column",
          "columns": {
            "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
            "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
            "category_column": { "sdtype": "categorical" },
            "numeric_column": { "sdtype": "numerical" }
          },
          "column_relationships": []
        }
      }
    }
    Copy CodeCopiedUse a different Browser
    from sdv.metadata import Metadata
    
    metadata = Metadata.detect_from_dataframes(data)

    Alternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies.

    Copy CodeCopiedUse a different Browser
    from sdv.single_table import GaussianCopulaSynthesizer
    
    synthesizer = GaussianCopulaSynthesizer(metadata)
    synthesizer.fit(data=salesDf)
    synthetic_data = synthesizer.sample(num_rows=10000)
    
    

    With the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records.

    You can control how many rows to generate using the num_rows argument.

    Copy CodeCopiedUse a different Browser
    from sdv.evaluation.single_table import evaluate_quality
    
    quality_report = evaluate_quality(
        salesDf,
        synthetic_data,
        metadata)

    The SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report

    You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns:

    Copy CodeCopiedUse a different Browser
    from sdv.evaluation.single_table import get_column_plot
    
    fig = get_column_plot(
        real_data=salesDf,
        synthetic_data=synthetic_data,
        column_name='Sales',
        metadata=metadata
    )
       
    fig.show()

    We can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets.

    Copy CodeCopiedUse a different Browser
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # Ensure 'Date' columns are datetime
    salesDf['Date'] = pd.to_datetime(salesDf['Date'], format='%d-%m-%Y')
    synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format='%d-%m-%Y')
    
    # Extract 'Month' as year-month string
    salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
    synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)
    
    # Group by 'Month' and calculate average sales
    actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
    synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales')
    
    # Merge the two series into a DataFrame
    avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label='Actual Average Sales', marker='o')
    plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label='Synthetic Average Sales', marker='o')
    
    plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')
    plt.xlabel('Month')
    plt.ylabel('Average Sales')
    plt.xticks(rotation=45)
    plt.grid(True)
    plt.legend()
    plt.ylim(bottom=0)  # y-axis starts at 0
    plt.tight_layout()
    plt.show()

    This chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences.

    In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows.


    Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

    The post Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV) appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCVE-2025-5156 – H3C GR-5400AX Buffer Overflow Vulnerability
    Next Article NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 18, 2025
    Machine Learning

    Implementing on-demand deployment with customized Amazon Nova models on Amazon Bedrock

    July 17, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    How to replace your Windows 11 Start menu with a better alternative – including my favorite

    News & Updates

    3 sticky insights from 3 eng management books

    Learning Resources

    The best online photo printing services of 2025: Expert tested and reviewed

    News & Updates

    CVE-2025-4487 – iSourcecode Gym Management System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-48080 – Uncanny Owl Uncanny Toolkit for LearnDash Stored Cross-site Scripting

    May 16, 2025

    CVE ID : CVE-2025-48080

    Published : May 16, 2025, 4:15 p.m. | 47 minutes ago

    Description : Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability in Uncanny Owl Uncanny Toolkit for LearnDash allows Stored XSS. This issue affects Uncanny Toolkit for LearnDash: from n/a through 3.7.0.2.

    Severity: 6.5 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Citrix Bleed 2 Flaw Enables Token Theft; SAP GUI Flaws Risk Sensitive Data Exposure

    June 26, 2025
    NativePHP v1 is finally here!

    NativePHP v1 is finally here!

    April 9, 2025
    Don’t miss out on these massive discounts for two of our favorite wireless and wired gaming headsets

    Don’t miss out on these massive discounts for two of our favorite wireless and wired gaming headsets

    April 8, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.