Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 31, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 31, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 31, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 31, 2025

      How to install SteamOS on ROG Ally and Legion Go Windows gaming handhelds

      May 31, 2025

      Xbox Game Pass just had its strongest content quarter ever, but can we expect this level of quality forever?

      May 31, 2025

      Gaming on a dual-screen laptop? I tried it with Lenovo’s new Yoga Book 9i for 2025 — Here’s what happened

      May 31, 2025

      We got Markdown in Notepad before GTA VI

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025
      Recent

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025

      Filament Is Now Running Natively on Mobile

      May 31, 2025

      How Remix is shaking things up

      May 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      How to install SteamOS on ROG Ally and Legion Go Windows gaming handhelds

      May 31, 2025
      Recent

      How to install SteamOS on ROG Ally and Legion Go Windows gaming handhelds

      May 31, 2025

      Xbox Game Pass just had its strongest content quarter ever, but can we expect this level of quality forever?

      May 31, 2025

      Gaming on a dual-screen laptop? I tried it with Lenovo’s new Yoga Book 9i for 2025 — Here’s what happened

      May 31, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Databases»Run complex queries on massive amounts of data stored on your Amazon DocumentDB clusters using Apache Spark running on Amazon EMR

    Run complex queries on massive amounts of data stored on your Amazon DocumentDB clusters using Apache Spark running on Amazon EMR

    April 10, 2024

    In this post, we demonstrate how to set up Amazon EMR to run complex queries on massive amounts of data stored in your Amazon DocumentDB (with MongoDB compatibility) clusters using Apache Spark.

    Amazon DocumentDB (with MongoDB compatibility) is a fully managed native JSON document database that makes it easy and cost effective to operate critical document workloads at virtually any scale without managing infrastructure. You can use the same application code written using MongoDB API (versions 3.6, 4.0, and 5.0) compatible drivers, and tools to run, manage, and scale workloads on Amazon DocumentDB without worrying about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it straightforward to store, query, and index JSON data.

    Apache Spark is an open source, distributed processing system used for big data workloads. It uses in-memory caching and optimized query runs for fast analytic queries against data of any size. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.

    Amazon EMR is the industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning (ML) using open source frameworks such as Apache Spark, Apache Hive, and Presto.

    Based on your cluster usage, your Amazon DocumentDB storage can grow, up to 128 TiB. With Amazon DocumentDB elastic clusters, storage can grow up to 4 PiB. In some use cases, you may need to run an analytics workload on this massive amount of data stored in one or more Amazon DocumentDB clusters, often with ad hoc query patterns or the need to scan large volumes of data. Apache Spark is a great choice for running ad hoc queries on such massive data, especially when they require memory-intensive non-indexed field filtering, multiple stages of processing, fault tolerance between stages, and joining multiple data sources. Because Amazon DocumentDB is MongoDB API compatible, you can run these data jobs on Amazon DocumentDB using the MongoDB Spark connector.

    Some examples where Apache Spark where you can use to run data jobs against Amazon DocumentDB are the following:

    Data analytics teams wants to augment ad-hoc reports with unstructured unindexed data stored in DocumentDB without replicating the data to another storage system.
    Data science teams want to run complex DS/ML pipelines using Apache Spark and subsequently write the results to a downstream storage.
    Data analytics team might want to combine multiple sources of data, for example live data from Amazon DocumentDB and archived data from Amazon Simple Storage Service (Amazon S3), to generate unified reports.

    All of these jobs have some common characteristics:

    They need computation to be done over large volumes of data that don’t fit into memory of a single machine
    They all need ad-hoc access patterns not always supported by the indices created in Amazon DocumentDB for the application usage
    To complete successfully within acceptable run-time, they need to be executed in a distributed environment with ample compute capacity and failure recovery mechanism in place.

    Solution overview

    In the following sections, we show how to create an Amazon DocumentDB cluster and load data, and create and configure an EMR cluster with Apache Spark. After the solution is in place, you can run a sample Spark application and run a sample query.

    Prerequisites

    We assume that you have the following prerequisites:

    Basic service knowledge regarding operating and developing applications with Amazon DocumentDB and Amazon EMR
    The appropriate permissions to create, delete, modify the service and resources we will be using in your AWS account
    An AWS Keypair

    This solution involves setting up and using AWS resources, so it will incur costs in your account. Refer to AWS Pricing for more information. We strongly recommend that you set this up in a non-production instance and run end-to-end validations before you implement this solution in a production environment.

    Create an Amazon DocumentDB cluster

    You can use an existing instance-based cluster or create a new Amazon DocumentDB instance-based cluster. You can also use Amazon DocumentDB elastic clusters that elastically scale to handle millions of reads and writes per second with petabytes of storage.

    Load data into the Amazon DocumentDB cluster

    To load data into the Amazon DocumentDB cluster, use the following code from an Amazon Elastic Compute Cloud (Amazon EC2) instance that can connect to Amazon DocumentDB and has mongorestore utility installed(this data was generated using mongodump version 100.9.4):

    Download the Amazon DocumentDB Certificate Authority (CA) certificate required to authenticate to your cluster

    wget https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem

    Download the data.gz file from GitHub location. This file contains mock data generated using Steps to run NoSQLBench against Amazon DocumentDB.
    Run the following mongorestore command to load the data to your Amazon DocumentDB cluster. This loads 1 Million documents.

    mongorestore –gzip –uri=”<<documentdb_uri>>” –nsInclude=”test_emr_db.test_emr_coll” –drop –numInsertionWorkersPerCollection=10 –archive=data.gz

    You should see the following message if this command runs successfully.

    The following is a sample document snippet:

    {
    “_id” : “18566647”,
    “user_id” : “3dcc6cbb-4864-493b-a8cb-f35b06c09a0c”,
    “created_on” : 1414491227,
    “gender” : “F”,
    “full_name” : “Anthony Gammon”,
    “married” : true,
    “address” : {
    “primary” : {
    “city” : “Port Gamble”,
    “cc” : “VC”
    },
    “secondary” : {
    }
    } – – – – – –

    Create IAM Roles for EMR

    If this is the first time, you’re launching an EMR cluster, then create the default IAM service roles, which will be required to launch our EMR Cluster, using the following aws cli command

    aws emr create-default-role

    The command creates the following two IAM roles

    EMR_DefaultRole
    EMR_EC2_DefaultRole

    Create an EMR cluster with Apache Spark

    In this section, we walk through the steps to create and configure an EMR cluster with Apache Spark.

    Choose the MongoDB Spark connector

    The MongoDB Spark connector version 10.2.x uses the $collStats operator to create default read partitions. Because Amazon DocumentDB doesn’t support this operator as of this writing, you need to use the MongoDB Spark connector version 3.0.2. Note that you can also write your own partitioner class with a higher version of the Spark connector, but that is beyond the scope of this post.

    Choose the Apache Spark version

    The MongoDB Spark connector version 3.0.2 supports Spark versions 3.1.x. Therefore, you need to choose an Amazon EMR version that supports this version of Spark. Amazon EMR version 6.5.0 is the latest version of Amazon EMR, as per the chart provided in Application versions in Amazon EMR 6.x releases.

    Prepare and upload the bootstrap script to load CA certs to the default Java trust store

    When connecting to a TLS-enabled Amazon DocumentDB cluster from a Java Spark application on an EMR cluster, the Spark driver node as well as each of the Spark worker nodes must have the certificates uploaded to the default Java trust store, as described in Connecting Programmatically to Amazon DocumentDB. To achieve that, you need to create a script file and upload it to an Amazon Simple Storage Service (Amazon S3) bucket. You provide the S3 URI of this file as a bootstrap script while creating the EMR cluster.

    Complete the following steps:

    Hostinger

    Create an S3 bucket using the instructions in Store and Retrieve a File with Amazon S3.
    Create a file in your local machine and name it docdbcerts.sh. Copy the following content into the file and save it.

    mydir=/tmp/
    truststore=/etc/pki/java/cacerts
    storepassword=changeit
    curl -sS “https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem” > ${mydir}/global-bundle.pem
    awk ‘split_after == 1 {n++;split_after=0} /—–END CERTIFICATE—–/ {split_after=1}{print > “rds-ca-” n “.pem”}’ < ${mydir}/global-bundle.pem
    for CERT in rds-ca-*; do
    alias=$(openssl x509 -noout -text -in $CERT | perl -ne ‘next unless /Subject:/; s/.*(CN=|CN = )//; print’)
    echo “Importing $alias”
    sudo keytool -import -file ${CERT} -alias “${alias}” -storepass ${storepassword} -keystore ${truststore} -noprompt
    rm $CERT
    done
    rm ${mydir}/global-bundle.pem
    echo “Trust store content is: “
    keytool -list -v -keystore “$truststore” -storepass ${storepassword} | grep Alias | cut -d ” ” -f3- | while read alias
    do
    expiry=`keytool -list -v -keystore “$truststore” -storepass ${storepassword} -alias “${alias}” | grep Valid | perl -ne ‘if(/until: (.*?)n/) { print “$1n”; }’`
    echo ” Certificate ${alias} expires in ‘$expiry'”
    done

    Upload the docdbcerts.sh file to the S3 bucket using the instructions in Store and Retrieve a File with Amazon S3.
    Note the S3 URI of the file that you uploaded.

    Create an EMR cluster

    Complete the following steps to create your EMR cluster:

    On the Amazon EMR console, choose Create cluster.

    For Name, enter docdb-spark-emr-cluster.
    For Amazon EMR release, choose emr-6.5.0.
    For Application bundle, keep the default Spark software.

    In the Node configuration section, change the value for EBS root volume to 50 GB. This is just an example – depending on your job local storage requirement, you will need to adjust the value.

    In the Cluster scaling and provisioning section, set the core and task instance size to 3.

    In the Networking section, for Virtual private cloud (VPC), enter you Amazon DocumentDB cluster VPC. Choose an existing public subnet or create one.

    Choose Add to add a bootstrap action.
    For Name, enter a name.
    For Script location, enter the S3 URI of the bootstrap script you uploaded earlier.
    Choose Add bootstrap action.

    For Amazon EC2 key pair for SSH into the cluster, enter a key pair to be able to log in to the primary node of the EMR cluster.

    In the Identity and Access Management (IAM) roles section, select EMR_DefaultRole for Amazon EMR service role and EMR_EC2_DefaultRole for EC2 instance profile for Amazon EMR.

    Leave the remaining settings as default and choose Create cluster.

    You can monitor the status of your cluster on the Amazon EMR console.

    Choose the cluster ID, scroll down to Network and security, and expand EC2 security groups (firewall).
    Note down the EMR managed security groups for the primary node and core and task nodes.

    Modify security groups

    Complete the following steps to modify your security groups:

    Add inbound rules of your Amazon DocumentDB cluster to allow incoming traffic from the security groups you noted earlier on port 27017.
    Add an inbound rule to the Primary node security group to allow SSH from your local machine IP.

    Run a sample Spark application

    Complete the following steps to run a sample Spark application:

    SSH to the primary node of the EMR cluster.

    Launch the Spark shell with the MongoDB Spark connector:

    spark-shell –conf “spark.mongodb.output.uri=<<docdb_uri>>” –conf “spark.mongodb.input.uri=<<docdb_uri>>” –packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.2

    The Amazon DocumentDB URI that you provide to Spark should contain the default database and collection name. Also, remove the tlscafile option from the uri – Java looks up TLS certificates from the truststore we populated during bootstrap actions. In our case, it would look like the following code:

    mongodb://<< documentdb _user>>:<< documentdb _password>>@<< documentdb _cluster_endpoint>>:27017/test_emr_db.test_emr_coll?tls=true&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false

    Run a query

    Let’s find the distribution of the data over the cities identified by the nested field address.primary.city. This a straightforward query for demonstration purposes; in real-world applications, you would use Spark to run much more complex and compute-intensive scenarios. Spark is an effective tool when the query requires distributed computing on large volumes of data requiring resources (CPU and memory) that don’t fit in a single machine.

    Import the mongodb Spark library:

    import com.mongodb.spark._

    Create a Spark data frame abstracting the database and collection provided while launching the Spark shell (in the Amazon DocumentDB URI):

    val df = MongoSpark.load(spark)

    Run the following code to view the structure of data as interpreted by Spark:

    df.printSchema()

    Before you run SQL queries on your dataset, you must register a temporary view for the dataset:

    df.createOrReplaceTempView(“data”)

    Run a query on the view:

    val city_data = spark.sql(“SELECT address.primary.city,count(*)FROM data group by address.primary.city order by count (*) “)

    city_data.show()

    After the task is complete, the results will be displayed. Note that, by default, only 20 rows are displayed, but you can override this setting.

    In this section we showed you how to use distributed query processing capabilities of Apache Spark running on an EMR cluster, to execute a query on the data stored in Amazon DocumentDB. As next step, identify the analytics jobs you run on your large Amazon DocumentDB datasets and run them using the approach we described. This exercise will help you understand the potential benefits of running these jobs using Spark.

    Clean up

    Complete the following steps to clean up your resources:

    Terminate the EMR cluster.
    Remove the security group settings from the Amazon DocumentDB cluster that allows incoming traffic from the EMR cluster.
    Optionally, delete the Amazon DocumentDB test data. Use the following commands from mongo shell

    use test_emr_db

    db.dropDatabase()

    Optionally, delete the Amazon DocumentDB cluster.

    Conclusion

    This post we showed you how to set up an EMR cluster with the appropriate software versions and connect to a TLS-enabled Amazon DocumentDB cluster using a simple Spark application. All the best practices relating to Amazon EMR, Amazon DocumentDB, Apache Spark, and the MongoDB Spark connector should be applied while creating a production-grade application.

    Leave your thoughts and questions in the comments.

    About the author

    Sourav Biswas is a Senior DocumentDB Specialist Solutions Architect at Amazon Web Services (AWS). He has been helping AWS DocumentDB customers successfully adopt the service and implement best practices around it. Before joining AWS, he worked extensively as an application developer and solutions architect for various noSQL vendors

     

    Source: Read More

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuild an active learning pipeline for automatic annotation of images with AWS services
    Next Article Mistral AI Shakes Up the AI Arena with Its Open-Source Mixtral 8x22B Model

    Related Posts

    Security

    China-Linked Hackers Exploit SAP and SQL Server Flaws in Attacks Across Asia and Brazil

    May 31, 2025
    Security

    New Apache InLong Vulnerability (CVE-2025-27522) Exposes Systems to Remote Code Execution Risks

    May 31, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Microsoft Defender VPN is retiring on Windows 11, macOS, Android and iOS

    Operating Systems

    Employees Searching Payroll Portals on Google Tricked Into Sending Paychecks to Hackers

    Development

    “We’re here for the long haul.” Diablo manager confirms 10-year road map

    News & Updates

    The Rise of Generative AI: Real-World Uses and Why It Matters in 2025

    Development

    Highlights

    Development

    Fine-Tuning an Open-Source LLM with Axolotl Using Direct Preference Optimization (DPO)

    December 7, 2024

    Read Fine-Tuning an Open-Source LLM with Axolotl Using Direct Preference Optimization (DPO) and learn AI…

    CVE-2024-56524 – Radware Cloud Web Application Firewall (WAF) URL Filter Bypass Vulnerability

    May 12, 2025

    CVE-2025-4559 – Netvision ISOinsight SQL Injection

    May 12, 2025

    U.S. Fraud Losses Soared 25% in 2024 to $12.5 Billion

    March 16, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.