Supplementing Salesforce with Databricks as an enterprise Lakehouse solution brings advantages for various personas across an organization. Customer experience data is highly valued when it comes to driving personalized customer journeys leveraging company-wide applications beyond Salesforce. From enhanced customer satisfaction to tailored engagements and offerings that drive business renewals and expansions, the advantages are hard to miss. Databricks maps data from a variety of enterprise apps, including those used by Sales, Marketing and Finance. Consequently, layering Databricks Generative AI and predictive ML capabilities provide easily accessible best-fit recommendations that help eliminate challenges and highlight success areas within your company’s customer base.
In this blog, I elaborate on the different methods whereby Salesforce data is made accessible from within Databricks. While accessing Databricks data from Salesforce is possible, it is not the topic of this post and will perhaps be tackled in a later blog. I have focused on the built-in capabilities within both Salesforce and Databricks and have therefore excluded 3rd party data integration platforms. There are three main ways to achieve this integration:
- Databricks Lakeflow Ingestion from Salesforce
- Databricks Query Federation from Salesforce Data Cloud
- Databricks Files Sharing from Salesforce Data Cloud
Choosing the best approach to use depends on your use case. The decision is driven by several factors, such as the expected latency of accessing the latest Salesforce data, the complexity of the data transformations needed, and the volume of Salesforce data of interest. And it may very well be that more than one method is implemented to cater for different requirements.
While the first method copies the raw Salesforce data over to Databricks, methods 2 and 3 offer no-copy alternatives, thus leveraging Salesforce Data Cloud itself as the raw data layer. The no-copy alternatives are great in that they leverage Salesforce’s native capability of managing its own data lake thus eliminating overhead by redoing that effort. However, there are limitations to doing that, depending on the use case. The matrix below presents how each method compares when factoring in the key criteria for integration.
Method | Lakeflow Ingestion | Salesforce Data Cloud Query Federation | Salesforce Data Cloud File Sharing |
---|---|---|---|
Type | Data Ingestion | Zero-Copy | Zero-Copy |
Supports Salesforce Data Cloud as a Source? | ![]() |
![]() |
![]() |
Incremental Data Refreshes | ![]() |
![]() (Requires custom handling if copying to Databricks) |
![]() (Requires custom handling if copying to Databricks) |
Processing of Soft Deletes | ![]() |
![]() (Requires custom handling if copying to Databricks) |
![]() (Requires custom handling if copying to Databricks) |
Processing of Hard Deletes | ✘ Requires a full refresh | ![]() (Requires custom handling if copying to Databricks) |
![]() (Requires custom handling if copying to Databricks) |
Query Response Time | ![]() |
![]() |
![]() |
Supports Real-Time Querying? | ✘ No
The pipeline runs on a schedule to copy data for example, hourly, daily, etc. |
![]() Live query execution on SF Data Cloud |
![]() Live data sourced from SF Data Cloud |
Supports Databricks Streaming Pipelines? | ![]() |
✘ No | ✘ No |
Suitable for High Data Volume? | ![]() SF Bulk API is called for high data volumes such as initial loads, and SF REST API is used for lower data volumes such as limited data volume incremental loads. |
✘ No Reliant on JDBC Query Pushdown limitations and SF performance |
![]() This method is more suitable than Query Federation when it comes to zero-copy with high volumes of data. |
Supports Data Transformation | ![]() |
![]() |
![]() |
Protocol | SF REST API and Bulk API over HTTPS | JDBC over HTTPS | Salesforce Data Cloud DaaS APIs over HTTPS (file-based access) |
Scalability | Up to 250 objects per pipeline. Multiple pipelines are allowed. | Depending on SF Data Cloud performance when running transformation with multiple objects | Up to 250 Data Cloud objects may be included in a data share. Up to 10 data shares. |
Salesforce Prerequisites | API-enabled Salesforce user with access to desired objects | Salesforce Data Cloud must be available.
Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population. Enable JDBC API access to Data Cloud. |
Salesforce Data Cloud must be available.
Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population. Data share target is created in SF with shared objects. |
If you’re looking for guidance on leveraging Databricks with Salesforce, reach out to Perficient for a discussion with Salesforce and Databricks specialists.
Source: Read MoreÂ