Cost Saving Best Practices For Databricks Workflows
Introduction
Managing pipeline costs effectively is crucial when using Databricks’ Workflows. This article provides practical tips to help you reduce your total cost of ownership without sacrificing performance, as well as tips to help you have a better understanding of your costs. These insights will guide you in optimizing resource usage, ensuring you get the most out of your Databricks environment.
1. Use job compute and spot instances to pay less for the same performance as all purpose clusters
Job clusters are significantly less expensive than all-purpose clusters, consuming less DBUs, which is the portion of the compute that is paid out to Databricks. Couple that with using spot instances when you build a fault-tolerant pipeline, and your costs can sometimes nearly cut in half. For example, the all purpose DS5 v2 cost $4.47 per hour, but if you use job compute, that drops to $3.42. Furthermore, if you add spot instances, that bring the cost down to $2.367, a nearly 50% discount.
2. Leverage “Warnings” to alert you of pipelines taking longer than expected
3. Safeguard against worst-case scenarios with “Timeouts”
4. Leverage task dependencies to orchestrate code execution efficiently
Utilize task dependencies in Databricks to streamline your workflows by ensuring tasks are executed in the correct order. By setting up dependencies, you can prevent downstream tasks from starting until all necessary upstream tasks have completed successfully. This method reduces errors and unnecessary compute usage by avoiding execution of tasks when criteria is not met.
To implement this, map out your workflow to identify dependent tasks, and configure your job orchestrations accordingly. This will help you maintain a smooth, efficient execution sequence, minimizing bottlenecks and optimizing resource allocation throughout your data pipelines.
5. Use workflows to execute code that takes a long time to execute, even if you don’t run it on a schedule
Executing one-off tasks that you expect will take several hours/days to execute inside a workflow means that you can leverage job compute to make your task more affordable.
Additionally, if you are doing some pipeline development, executing your code inside workflows is great A/B testing of your pipeline’s execution time and configuration.
6. Autoscaling your job compute helps deliver the right power, at the right time
Implementing autoscaling allows for near-time rightsizing of compute resources to match workload demands, minimizing your spend on unused capacity. This dynamic adjustment minimizes costs while maintaining optimal performance as the code your workflow is executing demands it.
One key advice: Test different cluster configurations, including min and max number of workers to find the optimal performance to cost balance.
7. Understand your costs better by implementing tags
Incorporate tagging to achieve clearer insight into your Databricks workflow expenses. Tags allow you to assign metadata to jobs, clusters, and other resources, which simplifies the tracking of costs by department, project, etc. By understanding where and how your resources are being consumed, you can make informed decisions about budget allocation and cost optimization.
Start by defining a consistent tagging strategy across all resources to ensure that every element of your expenditure is accurately monitored. This practice not only aids in cost management but also enhances reporting and accountability within your team.
Closing
I hope you found this guide to lowering your Databricks costs using Workflows helpful. Looking ahead, Databricks’ introduction of “Serverless” compute will introduce some additional dynamics as it relates to Workflows and other areas, and we will be covering those very soon!