How to Read and Write Deeply Partitioned Files Using Apache Spark

If you use Apache Spark to write your data pipeline, you might need to export or copy data from a source to destination while preserving the partition folders between the source and destination.

When I researched online on how to do this in Spark, I found very few tutorials giving an end-to-end solution that worked – especially when the partitions are deeply nested and you don’t know beforehand the values these folder names will take (for example year=*/month=*/day=*/hour=*/*.csv).

In this tutorial, I have provided one such implementation using Spark.

Here’s what we’ll cover:

Prerequisite

To follow along in this tutorial, you need to have basic understanding of distributed computing using frameworks like Hadoop and Spark, as well as code that’s programmed in Object Oriented languages like Scala/Java. The code is tested using the below dependencies:

Scala 2.12+
Java 17 (earlier versions might work)
Sbt

Setup

I’m assuming you have partition folders that are created at the source with the below pattern (which is a standard partition column involving date-time):

year/month/day/hour

Crucially, as I mentioned above, I’m assuming that you don’t know the full name of the folders – except that they have some constant prefix pattern in them.

False Starts

If you think of using recursiveFileLookup and pathGlobFilter option while both reading and writing, it doesn’t quite work, as the above functions are only available on read API.
If you think of parameterizing the reading and writing based on all the possible year/month/day/hour combination and skip export if the corresponding partition folder is not found, then it might work but won’t be very efficient.

My Solution

After a few trials and errors and searching in Stack Overflow and the Spark documentation, I hit upon an idea to use a combination of input_file_name(), regexp_extract(), and partitionBy() API’s on the write side to achieve the end goal. You can find a Scala-based sample code below:

<span class="hljs-keyword">package</span> main.scala.blog

<span class="hljs-comment">/**
*  Spark stream example code to read and write from a partitioned folder
*  to a partitioned folder without explicitly known datetime.
*/</span>

<span class="hljs-keyword">import</span> org.apache.spark.sql.<span class="hljs-type">SparkSession</span>
<span class="hljs-keyword">import</span> org.apache.spark.sql.types.<span class="hljs-type">StringType</span>
<span class="hljs-keyword">import</span> org.apache.spark.sql.functions.{udf, input_file_name, col, lit, regexp_extract}

<span class="hljs-class"><span class="hljs-keyword">object</span> <span class="hljs-title">PartitionedReaderWriter</span> </span>{

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span></span>(args: <span class="hljs-type">Array</span>[<span class="hljs-type">String</span>]) {
        <span class="hljs-comment">// 1.</span>
        <span class="hljs-keyword">val</span> spark = <span class="hljs-type">SparkSession</span>
                    .builder
                    .appName(<span class="hljs-string">"PartitionedReaderWriterApp"</span>)
                    .getOrCreate()

        <span class="hljs-keyword">val</span> sourceBasePath = <span class="hljs-string">"data/partitioned_files_source/user"</span>
        <span class="hljs-comment">// 2.</span>
        <span class="hljs-keyword">val</span> sourceDf = spark.read
                            .format(<span class="hljs-string">"csv"</span>)
                            .schema(<span class="hljs-string">"State STRING, Color STRING, Count INT"</span>)
                            .option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>)
                            .option(<span class="hljs-string">"pathGlobFilter"</span>, <span class="hljs-string">"*.csv"</span>)
                            .option(<span class="hljs-string">"recursiveFileLookup"</span>, <span class="hljs-string">"true"</span>)
                            .load(sourceBasePath)

        <span class="hljs-keyword">val</span> destinationBasePath = <span class="hljs-string">"data/partitioned_files_destination/user"</span>
        <span class="hljs-comment">// 3.</span>
        <span class="hljs-keyword">val</span> writeDf = sourceDf
                        .withColumn(<span class="hljs-string">"year"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"year=(\d{4})"</span>, <span class="hljs-number">1</span>))
                        .withColumn(<span class="hljs-string">"month"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"month=(\d{2})"</span>, <span class="hljs-number">1</span>))
                        .withColumn(<span class="hljs-string">"day"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"day=(\d{2})"</span>, <span class="hljs-number">1</span>))
                        .withColumn(<span class="hljs-string">"hour"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"hour=(\d{2})"</span>, <span class="hljs-number">1</span>))

        <span class="hljs-comment">// 4.</span>
        writeDf.write
                .format(<span class="hljs-string">"csv"</span>)
                .option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>)
                .mode(<span class="hljs-string">"overwrite"</span>)
                .partitionBy(<span class="hljs-string">"year"</span>, <span class="hljs-string">"month"</span>, <span class="hljs-string">"day"</span>, <span class="hljs-string">"hour"</span>)
                .save(destinationBasePath)

        <span class="hljs-comment">// 5.</span>
        spark.stop()        
    }
}

Here’s what’s going on in the above code:

Inside main method, you begin by adding Spark initialization setup code to create a Spark session.
You read the data from sourceBasePath using spark read() API with the format as csv (you can also optionally provide the schema). Options recursiveFileLookup and pathGlobFilter are needed to recursively read through nested folders and to specify any csv file, respectively.
Th next section contains the core logic where you can use input_file_name() to return the full path of the file and regexp_extract() to extract year , month, day, and hour from the corresponding subfolders in the path and store them as auxiliary columns on the dataframe.
Finally, you write the dataframe using the csv format again and crucially use partitionBy to specify the previously created auxiliary columns as partition columns. Then save the dataframe in the destinationBasePath.
After the copy is done, you stop the Spark session by calling the stop() API.

Conclusion

In this article I have shown you how to export / copy a deeply nested data files from source to destination using Apache Spark in an efficient way. I hope you find it useful!

You can read my other articles at https://www.beyonddream.me.

Source: freeCodeCamp Programming Tutorials: Python, JavaScript, Git & MoreÂ

Microsoft Graph CLI to be retired

The state of DevOps and AI: Not just hype

A Breeze Of Inspiration In September (2025 Wallpapers Edition)

10 Top Generative AI Development Companies for Enterprise Node.js Projects

I asked AI to modify mission-critical code, and what happened next haunts me

Why you should delete your browser extensions right now – or do this to stay safe

Dolby Vision 2 comes with big upgrades – here’s which TVs get them first

This one small feature makes this travel charger my favorite for business trips

Laracon AU 2025 Talk Titles Revealed

Laracon AU 2025 Talk Titles Revealed

Stop Writing Bad Controllers: Laravel Custom Collections Transform Your Code

Handle ownership relationships between Eloquent models with Laravel Ownable

Lenovo Legion Go 2 confirmed with Ryzen Z2 Extreme, 1200p OLED 144Hz display & 74Wh battery

Lenovo Legion Go 2 confirmed with Ryzen Z2 Extreme, 1200p OLED 144Hz display & 74Wh battery

How to Open Ports in Firewall on Windows Server

Google TV Remote Not Working? 5 Quick Fixes

How to Read and Write Deeply Partitioned Files Using Apache Spark

Here’s what we’ll cover:

Prerequisite

Setup

False Starts

My Solution

Conclusion

Laracon AU 2025 Talk Titles Revealed

Stop Writing Bad Controllers: Laravel Custom Collections Transform Your Code

Gemini just got two of ChatGPT’s best features – and they’re free

Microsoft’s new Surface Laptop 5G can be your new hotspot with six antennas

CVE-2025-6777 – Food Distributor Site SQL Injection Vulnerability

Fluent Object Operations with Laravel’s Enhanced Helper Utilities

CVE-2025-5363 – Campcodes Online Hospital Management System SQL Injection Vulnerability

CVE-2025-47891 – Apache Struts Command Injection

CVE-2025-7487 – JoeyBling SpringBoot_MyBatisPlus Unrestricted File Upload Vulnerability

Amazon will sell you the iPhone 16 Pro for $250 off right now – how the deal works

How to Read and Write Deeply Partitioned Files Using Apache Spark

Here’s what we’ll cover:

Prerequisite

Setup

False Starts

My Solution

Conclusion

Related Posts