Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Top 10 Use Cases of Vibe Coding in Large-Scale Node.js Applications

      September 3, 2025

      Cloudsmith launches ML Model Registry to provide a single source of truth for AI models and datasets

      September 3, 2025

      Kong Acquires OpenMeter to Unlock AI and API Monetization for the Agentic Era

      September 3, 2025

      Microsoft Graph CLI to be retired

      September 2, 2025

      ‘Cronos: The New Dawn’ was by far my favorite experience at Gamescom 2025 — Bloober might have cooked an Xbox / PC horror masterpiece

      September 4, 2025

      ASUS built a desktop gaming PC around a mobile CPU — it’s an interesting, if flawed, idea

      September 4, 2025

      Hollow Knight: Silksong arrives on Xbox Game Pass this week — and Xbox’s September 1–7 lineup also packs in the horror. Here’s every new game.

      September 4, 2025

      The Xbox remaster that brought Gears to PlayStation just passed a huge milestone — “ending the console war” and proving the series still has serious pulling power

      September 4, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Magento (Adobe Commerce) or Optimizely Configured Commerce: Which One to Choose

      September 4, 2025
      Recent

      Magento (Adobe Commerce) or Optimizely Configured Commerce: Which One to Choose

      September 4, 2025

      Updates from N|Solid Runtime: The Best Open-Source Node.js RT Just Got Better

      September 3, 2025

      Scale Your Business with AI-Powered Solutions Built for Singapore’s Digital Economy

      September 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      ‘Cronos: The New Dawn’ was by far my favorite experience at Gamescom 2025 — Bloober might have cooked an Xbox / PC horror masterpiece

      September 4, 2025
      Recent

      ‘Cronos: The New Dawn’ was by far my favorite experience at Gamescom 2025 — Bloober might have cooked an Xbox / PC horror masterpiece

      September 4, 2025

      ASUS built a desktop gaming PC around a mobile CPU — it’s an interesting, if flawed, idea

      September 4, 2025

      Hollow Knight: Silksong arrives on Xbox Game Pass this week — and Xbox’s September 1–7 lineup also packs in the horror. Here’s every new game.

      September 4, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»How to Read and Write Deeply Partitioned Files Using Apache Spark

    How to Read and Write Deeply Partitioned Files Using Apache Spark

    September 1, 2025

    If you use Apache Spark to write your data pipeline, you might need to export or copy data from a source to destination while preserving the partition folders between the source and destination.

    When I researched online on how to do this in Spark, I found very few tutorials giving an end-to-end solution that worked – especially when the partitions are deeply nested and you don’t know beforehand the values these folder names will take (for example year=*/month=*/day=*/hour=*/*.csv).

    In this tutorial, I have provided one such implementation using Spark.

    Here’s what we’ll cover:

    • Prerequisite

    • Setup

    • False Starts

    • My Solution

    • Conclusion

    Prerequisite

    To follow along in this tutorial, you need to have basic understanding of distributed computing using frameworks like Hadoop and Spark, as well as code that’s programmed in Object Oriented languages like Scala/Java. The code is tested using the below dependencies:

    • Scala 2.12+

    • Java 17 (earlier versions might work)

    • Sbt

    Setup

    I’m assuming you have partition folders that are created at the source with the below pattern (which is a standard partition column involving date-time):

    year/month/day/hour

    Crucially, as I mentioned above, I’m assuming that you don’t know the full name of the folders – except that they have some constant prefix pattern in them.

    False Starts

    1. If you think of using recursiveFileLookup and pathGlobFilter option while both reading and writing, it doesn’t quite work, as the above functions are only available on read API.

    2. If you think of parameterizing the reading and writing based on all the possible year/month/day/hour combination and skip export if the corresponding partition folder is not found, then it might work but won’t be very efficient.

    My Solution

    After a few trials and errors and searching in Stack Overflow and the Spark documentation, I hit upon an idea to use a combination of input_file_name(), regexp_extract(), and partitionBy() API’s on the write side to achieve the end goal. You can find a Scala-based sample code below:

    <span class="hljs-keyword">package</span> main.scala.blog
    
    <span class="hljs-comment">/**
    *  Spark stream example code to read and write from a partitioned folder
    *  to a partitioned folder without explicitly known datetime.
    */</span>
    
    <span class="hljs-keyword">import</span> org.apache.spark.sql.<span class="hljs-type">SparkSession</span>
    <span class="hljs-keyword">import</span> org.apache.spark.sql.types.<span class="hljs-type">StringType</span>
    <span class="hljs-keyword">import</span> org.apache.spark.sql.functions.{udf, input_file_name, col, lit, regexp_extract}
    
    <span class="hljs-class"><span class="hljs-keyword">object</span> <span class="hljs-title">PartitionedReaderWriter</span> </span>{
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span></span>(args: <span class="hljs-type">Array</span>[<span class="hljs-type">String</span>]) {
            <span class="hljs-comment">// 1.</span>
            <span class="hljs-keyword">val</span> spark = <span class="hljs-type">SparkSession</span>
                        .builder
                        .appName(<span class="hljs-string">"PartitionedReaderWriterApp"</span>)
                        .getOrCreate()
    
            <span class="hljs-keyword">val</span> sourceBasePath = <span class="hljs-string">"data/partitioned_files_source/user"</span>
            <span class="hljs-comment">// 2.</span>
            <span class="hljs-keyword">val</span> sourceDf = spark.read
                                .format(<span class="hljs-string">"csv"</span>)
                                .schema(<span class="hljs-string">"State STRING, Color STRING, Count INT"</span>)
                                .option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>)
                                .option(<span class="hljs-string">"pathGlobFilter"</span>, <span class="hljs-string">"*.csv"</span>)
                                .option(<span class="hljs-string">"recursiveFileLookup"</span>, <span class="hljs-string">"true"</span>)
                                .load(sourceBasePath)
    
            <span class="hljs-keyword">val</span> destinationBasePath = <span class="hljs-string">"data/partitioned_files_destination/user"</span>
            <span class="hljs-comment">// 3.</span>
            <span class="hljs-keyword">val</span> writeDf = sourceDf
                            .withColumn(<span class="hljs-string">"year"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"year=(\d{4})"</span>, <span class="hljs-number">1</span>))
                            .withColumn(<span class="hljs-string">"month"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"month=(\d{2})"</span>, <span class="hljs-number">1</span>))
                            .withColumn(<span class="hljs-string">"day"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"day=(\d{2})"</span>, <span class="hljs-number">1</span>))
                            .withColumn(<span class="hljs-string">"hour"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"hour=(\d{2})"</span>, <span class="hljs-number">1</span>))
    
            <span class="hljs-comment">// 4.</span>
            writeDf.write
                    .format(<span class="hljs-string">"csv"</span>)
                    .option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>)
                    .mode(<span class="hljs-string">"overwrite"</span>)
                    .partitionBy(<span class="hljs-string">"year"</span>, <span class="hljs-string">"month"</span>, <span class="hljs-string">"day"</span>, <span class="hljs-string">"hour"</span>)
                    .save(destinationBasePath)
    
            <span class="hljs-comment">// 5.</span>
            spark.stop()        
        }
    }
    

    Here’s what’s going on in the above code:

    1. Inside main method, you begin by adding Spark initialization setup code to create a Spark session.

    2. You read the data from sourceBasePath using spark read() API with the format as csv (you can also optionally provide the schema). Options recursiveFileLookup and pathGlobFilter are needed to recursively read through nested folders and to specify any csv file, respectively.

    3. Th next section contains the core logic where you can use input_file_name() to return the full path of the file and regexp_extract() to extract year , month, day, and hour from the corresponding subfolders in the path and store them as auxiliary columns on the dataframe.

    4. Finally, you write the dataframe using the csv format again and crucially use partitionBy to specify the previously created auxiliary columns as partition columns. Then save the dataframe in the destinationBasePath.

    5. After the copy is done, you stop the Spark session by calling the stop() API.

    Conclusion

    In this article I have shown you how to export / copy a deeply nested data files from source to destination using Apache Spark in an efficient way. I hope you find it useful!

    You can read my other articles at https://www.beyonddream.me.

    Source: freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleFake News Detection using Python Machine Learning (ML)
    Next Article Amazon will sell you the iPhone 16 Pro for $250 off right now – how the deal works

    Related Posts

    Development

    How to Make Bluetooth on Android More Reliable

    September 4, 2025
    Development

    Learn Mandarin Chinese for Beginners – Full HSK 1 Level

    September 4, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-7801 – BossSoft CRM SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Best Crypto Payment Gateway for High Risk

    Development

    6 Lessons Learned: Focusing Security Where Business Value Lives

    Development

    How to take climate action with your code

    News & Updates

    Highlights

    CVE-2025-34032 – Moodle LMS Jmol Reflected Cross-Site Scripting Vulnerability

    June 23, 2025

    CVE ID : CVE-2025-34032

    Published : June 24, 2025, 1:15 a.m. | 46 minutes ago

    Description : A reflected cross-site scripting (XSS) vulnerability exists in the Moodle LMS Jmol plugin version 6.1 and prior via the data parameter in jsmol.php. The application fails to properly sanitize user input before embedding it into the HTTP response, allowing an attacker to execute arbitrary JavaScript in the victim’s browser by crafting a malicious link. This can be used to hijack user sessions or manipulate page content.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    I tested this monitor light bar for two weeks — ASUS ROG ticks all but one of the most important boxes

    May 12, 2025

    Drop into Verdansk with the hottest new accessories from the Corsair and Call of Duty crossover’s new Warzone line

    April 14, 2025

    Save Desktop – saves your Linux desktop environment configuration

    August 10, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.