I wanted to discuss the top 5 mistakes that make your Databricks queries slow as a prequel to some of my FinOps blogs. Premature optimization may or may be the root of all evil, but we can all agree optimization without a solid foundation is not an effective use of time and resources. Predictive optimization cannot currently address data skew, select the best join strategy (although Photon can), optimize merge operations, or optimize most streaming operations. Databricks is a system with a lot of dials. Let’s look at the top five mistakes that I regularly see in practice.
1. Ignoring Data Skew
Mistake: Uneven distribution of data leading to some tasks taking significantly longer than others.
Solution: Monitor stages in the Spark UI to detect straggler tasks and check skewed columns with high cardinality or frequent NULLs.
- Use Salting to distribute skewed keys.
- Use Broadcast Join for small, skewed tables.
- Leverage Range Join Hints to optimize inequality joins.
2. Suboptimal Join Strategies
Mistake: Using expensive join techniques without optimization, especially with large datasets or streaming data.
Solution: Take advantage of tools and techniques specifically for issues with size and speed.
-
Range Join Optimization: Ideal for joins using inequality conditions (e.g., timestamp ranges).
-
Bloom Filter Indices: Efficiently filters unnecessary data during joins.
-
Materialized Views: Optimize incremental join computation.
3. Inefficient Streaming Joins
Mistake: Improper handling of stream-stream and stream-static joins, leading to increased state management and latency.
Solution: Set appropriate watermarks to prevent unbounded state growth.
-
For Stream-Stream Joins: Specify watermarks for both sides to manage state efficiently.
-
For Stream-Static Joins: Use Delta Lake for the static side to benefit from stateless joins.
Mistake: Triggering high shuffle during merge operations by not using low shuffle techniques.
Solution: Use Low Shuffle Merge. , preferably by switing over to DLTs or revisiting operations built before 10.4.
-
Merges only the changed data, reducing I/O and shuffle.
5. Ignoring Join Performance Best Practices
Mistake: Using default settings without leveraging advanced features.
Solution: Use Photon to dynamically select the best join type as Unity Catalog helps maintain statistics but does not always handle join order effectively
-
Enable Photon for vectorized execution.
-
Optimize Join Order: Always join smaller tables first and avoid cross joins.
-
Maintain Fresh Statistics: Use
ANALYZE TABLE
to help the optimizer make better decisions. Or, better yet, automate.
Conclusion
As an Elite Databricks Partner, we are here to help organizations keep costs under control as the get meaningful value from the data and AI assets.
Contact us to explore how we can help build performance and cost optimization tools and techniques into your data and AI pipeline.
Source: Read MoreÂ