Performance, Benchmarks, and Optimization Tips for Databricks Users
Josue had the opportunity of chatting with Jeremy Lewallen from Databricks' Performance team on benchmarks, performance optimization tips, and more! Follow the link to our article with video, as well as see some of the highlights below.
Importance of Benchmarks:
- Benchmarks have some importance and Databricks generally performs well in them, but they are not the most important thing.
- Databricks competes against itself, seeking to be continually more and more performant.
- A lot of improvements across different workloads including BI, ETL, and exploratory. About 14% faster in just the last 4 months.
- Why does Databricks care about making their product faster? Because you have choices, and Databricks wants to remain as the best choice for your data needs.
Storage Cost Optimizations Tips
- Storage is not always cheap, so best practices are important.
- Three main best practices:
--Enable Liquid Clustering
--Use Managed Tables with Predictive Optimization.
--Use the latest Databricks Runtime (DBR).
Impact of Databricks Runtime on Performance
- New DBRs have newer features, so use the latest!
Tips for Rightsizing SQL Serverless Compute:
- If you need concurrency, increased cluster count, not warehouse size.
- In terms of warehouse size, it is a balancing act with trial and error.
Editorial note: A nice thing about the "Serverless" offering is that you don't have to worry about this.
Common Compute Mistakes:
- Playing with the minimum clusters is unlikely to be worth it. Setting max clusters is key.
Bonus: Josue's Take On Warehousing Space
- Databricks performance is improving.
- Fast performance is good, but a balance of fast + fantastic developer experience is best, as long as the performance is still very good.