Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Error’d: Pickup Sticklers

      September 27, 2025

      From Prompt To Partner: Designing Your Custom AI Assistant

      September 27, 2025

      Microsoft unveils reimagined Marketplace for cloud solutions, AI apps, and more

      September 27, 2025

      Design Dialects: Breaking the Rules, Not the System

      September 27, 2025

      Building personal apps with open source and AI

      September 12, 2025

      What Can We Actually Do With corner-shape?

      September 12, 2025

      Craft, Clarity, and Care: The Story and Work of Mengchu Yao

      September 12, 2025

      Cailabs secures €57M to accelerate growth and industrial scale-up

      September 12, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

      September 28, 2025
      Recent

      Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

      September 28, 2025

      Mastering PHP File Uploads: A Guide to php.ini Settings and Code Examples

      September 28, 2025

      The first browser with JavaScript landed 30 years ago

      September 27, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured
      Recent
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»How to Transform JSON Data to Match Any Schema

    How to Transform JSON Data to Match Any Schema

    July 10, 2025

    Whether you’re transferring data between APIs or just preparing JSON data for import, mismatched schemas can break your workflow. Learning how to clean and normalize JSON data ensures a smooth, error-free data transfer.

    This tutorial demonstrates how to clean messy JSON and export the results into a new file, based on a predefined schema. The JSON file we’ll be cleaning contains a dataset of 200 synthetic customer records.

    In this tutorial, we’ll apply two methods for cleaning the input data:

    • With pure Python

    • With pandas

    You can apply either of these in your code. But the pandas method is better for large, complex data sets. Let’s jump right into the process.

    Here’s what we’ll cover:

    • Prerequisites

    • Add and Inspect the JSON File

    • Define the Target Schema

    • How to Clean JSON Data with Pure Python

    • How to Clean JSON Data with Pandas

    • How to Validate the Cleaned JSON

    • Pandas vs Pure Python for Data Cleaning

    Prerequisites

    To follow along with this tutorial, you should have a basic understanding of:

    • Python dictionaries, lists, and loops

    • JSON data structure (keys, values, and nesting)

    • How to read and write JSON files with Python’s json module

    Add and Inspect the JSON File

    Before you begin writing any code, make sure that the .json file you intend to clean is in your project directory. This makes it easy to load in your script using the file name alone.

    You can now inspect the data structure by viewing the file locally or loading it in your script, with Python’s built-in json module.

    Here’s how (assuming the file name is “old_customers.json”):

    Code to view or print contents of the raw JSON file in terminal

    This shows you whether the JSON file is structured as a dictionary or a list. It also prints out the entire file in your terminal. Mine is a dictionary that maps to a list of 200 customer entries. You should always open up the raw JSON file in your IDE to get a closer look at its structure and schema.

    Define the Target Schema

    If someone asks for JSON data to be cleaned, it probably means that the current schema is unsuitable for its intended purpose. At this point, you want to be clear on what the final JSON export should look like.

    JSON schema is essentially a blueprint that describes:

    • required fields

    • field names

    • data type for each field

    • standardized formats (for example, lowercase emails, trimmed whitespace, etc.)

    Here’s what the old schema versus the target schema looks like:

    A screenshot of the old JSON Schema to be transformed

    The expected JSON Schema

    As you can see, the goal is to delete the ”customer_id” and ”address” fields in each entry and rename the rest from:

    • ”name” to ”full_name”

    • ”email” to ”email_address”

    • ”phone” to ”mobile”

    • ”membership_level” to ”tier”

    The output should contain 4 response fields instead of 6, all renamed to fit the project requirements.

    How to Clean JSON Data with Pure Python

    Let’s explore using Python’s built-in json module to align the raw data with the predefined schema.

    Step 1: Import json and time modules

    Importing json is necessary because we’re working with JSON files. But we’ll use the time module to track how long the data cleaning process takes.

    <span class="hljs-keyword">import</span> json
    <span class="hljs-keyword">import</span> time
    

    Step 2: Load the file with json.load()

    start_time = time.time()
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'old_customers.json'</span>) <span class="hljs-keyword">as</span> file:
        crm_data = json.load(file)
    

    Step 3: Write a function to loop through and clean each customer entry in the dictionary

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_data</span>(<span class="hljs-params">records</span>):</span>
        transformed_records = []
        <span class="hljs-keyword">for</span> customer <span class="hljs-keyword">in</span> records[<span class="hljs-string">"customers"</span>]:
            transformed_records.append({
                    <span class="hljs-string">"full_name"</span>: customer[<span class="hljs-string">"name"</span>],
                    <span class="hljs-string">"email_address"</span>: customer[<span class="hljs-string">"email"</span>],
                    <span class="hljs-string">"mobile"</span>: customer[<span class="hljs-string">"phone"</span>],
                    <span class="hljs-string">"tier"</span>: customer[<span class="hljs-string">"membership_level"</span>],
    
                    })
        <span class="hljs-keyword">return</span> {<span class="hljs-string">"customers"</span>: transformed_records}
    
    new_data = clean_data(crm_data)
    

    clean_data() takes in the original data (temporarily) stored in the records variable, transforming it to match our target schema.

    Since the JSON file we loaded is a dictionary containing a ”customers” key, which maps to a list of customer entries, we access this key and loop through each entry in the list.

    In the for loop, we rename the relevant fields and store the cleaned entries in a new list called ”transformed_records”.

    Then, we return the dictionary, with the ”customers” key intact.

    Step 4: Save the output in a .json file

    Decide on a name for your cleaned JSON data and assign that to an output_file variable, like so:

    output_file = <span class="hljs-string">"transformed_data.json"</span>
    <span class="hljs-keyword">with</span> open(output_file, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
        json.dump(new_data, f, indent=<span class="hljs-number">4</span>)
    

    You can also add a print() statement below this block to confirm that the file has been saved in your project directory.

    Step 5: Time the data cleaning process

    At the beginning of this process, we imported the time module to measure how long it takes to clean up JSON data using pure Python. To track the runtime, we stored the current time in a start_time variable before the cleaning function, and we’ll now include an end_time variable at the end of the script.

    The difference between the end_time and start_time values gives you the total runtime in seconds.

    end_time = time.time()
    elapsed_time = end_time - start_time
    
    print(<span class="hljs-string">f"Transformed data saved to <span class="hljs-subst">{output_file}</span>"</span>)
    print(<span class="hljs-string">f"Processing data took <span class="hljs-subst">{elapsed_time:<span class="hljs-number">.2</span>f}</span> seconds"</span>)
    

    Here’s how long the data cleaning process took with the pure Python approach:

    Script runtime displayed in terminal

    How to Clean JSON Data with Pandas

    Now we’re going to try achieving the same results as above, using Python and a third-party library called pandas. Pandas is an open-source library used for data manipulation and analysis in Python.

    To get started, you need to have the Pandas library installed in your directory. In your terminal, run:

    pip install pandas
    

    Then follow these steps:

    Step 1: Import the relevant libraries

    <span class="hljs-keyword">import</span> json
    <span class="hljs-keyword">import</span> time
    <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
    

    Step 2: Load file and extract customer entries

    Unlike the pure Python method, where we simply indexed the key name ”customers” to access the list of customer data, working with pandas requires a slightly different approach.

    We must extract the list before loading it into a DataFrame because pandas expects structured data. Extracting the list of customer dictionaries upfront ensures that we isolate and clean the relevant records alone, preventing errors caused by nested or unrelated JSON data.

    start_time = time.time()
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'old_customers.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
        crm_data = json.load(f)
    
    <span class="hljs-comment">#Extract the list of customer entries</span>
    clients = crm_data.get(<span class="hljs-string">"customers"</span>, [])
    

    Step 3: Load customer entries into a DataFrame

    Once you’ve got a clean list of customer dictionaries, load the list into a DataFrame and assign said list to a variable, like so:

    <span class="hljs-comment">#Load into a dataframe</span>
    df = pd.DataFrame(clients)
    

    This creates a tabular or spreadsheet-like structure, where each row represents a customer. Loading the list into a DataFrame also allows you to access pandas’ powerful data cleaning methods like:

    • drop_duplicate(): removes duplicate rows or entries from a DataFrame

    • dropna(): drops rows with any missing or null data

    • fillna(value): replaces all missing or null data with a specified value

    • drop(columns): drops unused columns explicitly

    Step 4: Write a custom function to rename relevant fields

    At this point, we need a function that takes in a single customer entry – a row – and returns a cleaned version that fits the target schema (“full_name”, “email_address”, “mobile” and “tier”).

    The function should also handle missing data by setting default values like ”Unknown” or ”N/A” when a field is absent.

    P.S: At first, I used drop(columns) to explicitly remove the “address” and “customer_id” fields. But it’s not needed in this case, as the transform_fields() function only selects and renames the required fields. Any extra columns are automatically excluded from the cleaned data.

    Step 5: Apply schema transformation to all rows

    We’ll use pandas‘ apply() method to apply our custom function to each row in the DataFrame. This will creates a Series (for example, 0 → {…}, 1 → {…}, 2 → {…}), which is not JSON-friendly.

    As json.dump() expects a list, not a Pandas Series, we’ll apply tolist(), converting the Series to a list of dictionaries.

    <span class="hljs-comment">#Apply schema transformation to all rows</span>
    transformed_df = df.apply(transform_fields, axis=<span class="hljs-number">1</span>)
    
    <span class="hljs-comment">#Convert series to list of dicts</span>
    transformed_data = transformed_df.tolist()
    

    Another way to approach this is with list comprehension. Instead of using apply() at all, you can write:

    transformed_data = [transform_fields(row) <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> df.to_dict(orient=<span class="hljs-string">"records"</span>)]
    

    orient=”records” is an argument for df.to_dict that tells pandas to convert the DataFrame to a list of dictionaries, where each dictionary represents a single customer record (that is, one row).

    Then the for loop iterates through every customer record on the list, calling the custom function on each row. Finally, the list comprehension ([…]) collects the cleaned rows into a new list.

    Step 6: Save the output in a .json file

    <span class="hljs-comment">#Save the cleaned data</span>
    output_data = {<span class="hljs-string">"customers"</span>: transformed_data}
    output_file = <span class="hljs-string">"applypandas_customer.json"</span>
    <span class="hljs-keyword">with</span> open(output_file, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
        json.dump(output_data, f, indent=<span class="hljs-number">4</span>)
    

    I recommend picking a different file name for your pandas output. You can inspect both files side by side to see if this output matches the result you got from cleaning with pure Python.

    Step 7: Track runtime

    Once again, check for the difference between start time and end time to determine the program’s execution time.

    end_time = time.time()
    elapsed_time = end_time - start_time
    
    <span class="hljs-comment">#print(f"Transformed data saved to {output_file}")</span>
    print(<span class="hljs-string">f"Transformed data saved to <span class="hljs-subst">{output_file}</span>"</span>)
    print(<span class="hljs-string">f"Processing data took <span class="hljs-subst">{elapsed_time:<span class="hljs-number">.2</span>f}</span> seconds"</span>)
    

    When I used list comprehension to apply the custom function, my script’s runtime was 0.03 seconds, but with pandas’ apply() function, the total runtime dropped to 0.01 seconds.

    Final output preview:

    If you followed this tutorial closely, your JSON output should look like this – whether you used the pandas method or the pure Python approach:

    The expected JSON output after schema transformation

    How to Validate the Cleaned JSON

    Validating your output ensures that the cleaned data follows the expected structure before being used or shared. This step helps to catch formatting errors, missing fields, and wrong data types early.

    Below are the steps for validating your cleaned JSON file:

    Step 1: Install and import jsonschema

    jsonschema is a third-party validation library for Python. It helps you define the expected structure of your JSON data and automatically check if your output matches that structure.

    In your terminal, run:

    pip install jsonschema
    

    Import the required libraries:

    <span class="hljs-keyword">import</span> json
    <span class="hljs-keyword">from</span> jsonschema <span class="hljs-keyword">import</span> validate, ValidationError
    

    validate() checks whether your JSON data matches the rules defined in your schema. If the data is valid, nothing happens. But if there’s an error – like a missing field or wrong data type – it raises a ValidationError.

    Step 2: Define a schema

    As you know, JSON schema changes with each file structure. If your JSON data differs from what we’ve been working with so far, learn how to create a schema here. Otherwise, the schema below defines the structure we expect for our cleaned JSON:

    schema = {
        <span class="hljs-string">"type"</span>: <span class="hljs-string">"object"</span>,
        <span class="hljs-string">"properties"</span>: {
            <span class="hljs-string">"customers"</span>: {
                <span class="hljs-string">"type"</span>: <span class="hljs-string">"array"</span>,
                <span class="hljs-string">"items"</span>: {
                    <span class="hljs-string">"type"</span>: <span class="hljs-string">"object"</span>,
                    <span class="hljs-string">"properties"</span>: {
                        <span class="hljs-string">"full_name"</span>: {<span class="hljs-string">"type"</span>: <span class="hljs-string">"string"</span>},
                        <span class="hljs-string">"email_address"</span>: {<span class="hljs-string">"type"</span>: <span class="hljs-string">"string"</span>},
                        <span class="hljs-string">"mobile"</span>: {<span class="hljs-string">"type"</span>: <span class="hljs-string">"string"</span>},
                        <span class="hljs-string">"tier"</span>: {<span class="hljs-string">"type"</span>: <span class="hljs-string">"string"</span>}
                    },
                    <span class="hljs-string">"required"</span>: [<span class="hljs-string">"full_name"</span>, <span class="hljs-string">"email_address"</span>, <span class="hljs-string">"mobile"</span>, <span class="hljs-string">"tier"</span>]
                }
            }
        },
        <span class="hljs-string">"required"</span>: [<span class="hljs-string">"customers"</span>]
    }
    
    • The data is an object that must contain a key: "customers".

    • "customers" must be an array (a list), with each object representing one customer entry.

    • Each customer entry must have four fields–all strings:

      • "full_name"

      • "email_address"

      • "mobile"

      • "tier"

    • The "required" fields ensure that none of the relevant fields are missing in any customer record.

    Step 3: Load the cleaned JSON file

    <span class="hljs-keyword">with</span> open(<span class="hljs-string">"transformed_data.json"</span>) <span class="hljs-keyword">as</span> f:
        data = json.load(f)
    

    Step 4: Validate the data

    For this step, we’ll use a try. . . except block to end the process safely, and display a helpful message if the code raises a ValidationError.

    <span class="hljs-keyword">try</span>:
        validate(instance=data, schema=schema)
        print(<span class="hljs-string">"JSON is valid."</span>)
    <span class="hljs-keyword">except</span> ValidationError <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">"JSON is invalid:"</span>, e.message)
    

    Pandas vs Pure Python for Data Cleaning

    From this tutorial, you can probably tell that using pure Python to clean and restructure JSON is the more straightforward approach. It is fast and ideal for handling small datasets or simple transformations.

    But as data grows and becomes more complex, you might need advanced data cleaning methods that Python alone does not provide. In such cases, pandas becomes the better choice. It handles large, complex datasets effectively, providing built-in functions for handling missing data and removing duplicates.

    You can study the Pandas cheatsheet to learn more data manipulation methods.

    Source: freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis is the easiest drone I’ve ever flown – and it’s not even close
    Next Article Before AI: How Templates Put the First Dent in Web Design’s Coffin

    Related Posts

    Development

    Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

    September 28, 2025
    Development

    Mastering PHP File Uploads: A Guide to php.ini Settings and Code Examples

    September 28, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Cybercriminals Target AI Users with Malware-Loaded Installers Posing as Popular Tools

    Development

    CodeSOD: The XML Dating Service

    News & Updates

    YouTube: Enhancing the user experience

    Artificial Intelligence

    DeltaProduct: An AI Method that Balances Expressivity and Efficiency of the Recurrence Computation, Improving State-Tracking in Linear Recurrent Neural Networks

    Machine Learning

    Highlights

    Blockchain-Powered Digital Twins: Driving Efficiency & Innovation in Operations

    April 7, 2025

    Post Content Source: Read More 

    High-Severity Flaw Exposes ASUS Armoury Crate to Authentication Bypass

    June 17, 2025

    mvanduijker/laravel-mercure-broadcaster

    August 28, 2025

    Tunnel Run game in 170 lines of pure JS

    June 14, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.