Hadoop to Databricks Lakehouse Migration Approach and Guide
Big Data and Hadoop : The end of Big Data as we know it?
Our goal with this blog is to lay the foundation of how/why Hadoop data processing platforms were/are used, the challenges and the big promise that was never realized and why customers should consider Databricks.
The questions we aim to answer in this journey include: Why should one consider migrating from Hadoop to Databricks?
What motivations have driven other companies to make this change?
What benefits does this transition bring to the company?
What is the architecture and vision of Databricks in this data ecosystem?
As we continue this exploration, we will dive deeper into the technical details in future posts - providing more targeted insights and guidance for a successful transition.
From Spark to the rise of Databricks
Databricks is known for being the creators of Apache Spark, Delta Lake, and MLflow, with the first being particularly significant within the Hadoop ecosystem. Today, Databricks has 10,000 plus customers is a Cloud native solution that is hosted on multi-cloud- AWS, GCP and Azure. Due to Databricks deep ties to Apache Spark, Databricks becomes a natural migration target for Hadoop workloads ( Spark and Hive workloads), with some gotchyas !!
Although Hadoop was revolutionary in parallel data processing and marked a turning point in the industry, it wasn't without its flaws. Hadoop, from its inception, kept adding new components to its ecosystem, making it very technical and challenging to manage. and get projects to production state. Hadoop’s self-managed and onPremise deployment model was very expensive and costly to scale compared to what the Cloud offers today. There is some irony with the name Cloudera (Did you say the Cloud?) !!
The architectural challenge, operational costs ( infrastructure, labor costs) and devops costs(upgrades, new capabilities that had to be installed, POC’d) are the primary drivers for customers considering a modern cloud based solution to migrate from Hadoop. While there are a lot of different options for customers looking to migrate away from Hadoop, such as Snowflake, AWS Redshift , Azure ADLS + Synapse or Databricks, we will make the case in the blog that Databricks is the most natural path for migration - due to compatibility, cost of migration and the overall use cases Databricks supports for various personas it supports.
The vision and goal of Databricks was to offer customers a highly elastic cloud data processing platform on all 3 major clouds (AWS, GCP, Azure) that provided all benefits of Hadoop ( scale, Spark and Hive) without having to deal with all the pain ( infrastructure, costs, devOps), while supporting multiple workloads (SQL, Data Science, GenAI ) and personas and driving business outcomes through insights.
Pains of an On-Premises Architecture
The cost of maintaining and operating Hadoop, for example, with Hortonworks or Cloudera, can be very high and burdensome in terms of licenses and infrastructure.
The reality of on-premises infrastructure is limited and rigid in scalability, resulting in difficulty responding to changes in the environment (e.g., during times like the COVID pandemic, data system demand significantly increased, and many companies struggled to adapt quickly). On the other hand, during reduced demand, having excessive infrastructure running on servers (data centers) becomes unnecessary.
The last pain point is a consequence of the above. When no longer tied to legacy licenses or providers, there is more flexibility for ML/predictive work. Tying incoming data (batch or stream) to top-notch data science tools, having access to the right libraries, the right processing environment, the ability to train large-scale models, and other factors enable a company to train models and leverage predictive analytics. This is the reality behind why companies face this change
Hadoop is Expensive, Complex, and Ineffective
As we embark on a comprehensive examination, it becomes evident that Hadoop, despite having represented a significant shift in the data processing industry, faces formidable challenges that extend beyond its capabilities. The complexities, cost implications, and inefficiencies associated with Hadoop raise critical concerns, particularly in today's dynamic technological landscape.
DevOps Intensive: Managing Hadoop is costly, requiring significant DevOps or administration resources. This becomes evident when a Hadoop cluster goes down, forcing the entire engineering team to halt normal tasks and dedicate a day or even a week to fix the issue and keep it operational. This incurs costs and results in reduced productivity.
Rigid and Inelastic: Clusters must be sized to support peak usage, leading to unused capacity. There are limitations to scaling infrastructure over time, making it prohibitively expensive for businesses to always have a cluster of the correct size. During periods of high demand, we cannot increase capacity, and the time and cost of implementing it make it impractical. During periods of reduced demand, we end up paying for something not in use, making it prohibitive to have a cluster of the right size for the right moment.
Lack of AI/ML Capacity: The absence of native and integrated support for Machine Learning in the Hadoop ecosystem means having to integrate various tools separately. These additional layers above Hadoop, not being seamlessly integrated, slow down the movement of data scientists. Databricks aims to reduce friction between teams (engineers, scientists, and analysts) and the friction that comes with many disparate services (common in the Hadoop ecosystem)
Databricks Vision and Architecture
Let's revisit the earlier discussion: Databricks' goal was to unify multiple types of use cases (EDW, Data Science), personas ( data engineers, busienss users, data scientists) and support all types of data files, formats through a Cloud offering. The Lakehouse continues to evolve and add more capabilties for governance, observability and advanced GenAI capabilities that are just a click away for end-users without all the complexity of managing the infrastructure.
The Databricks Lakehouse :
The first layer in the Databricks Lakehouse is the file layer, which could either be ADLS or S3 supporting various file formats like Parquet or images or audio files. The Data Lake allows us to store all types of data, whether it's batch, streaming, semi-structured, unstructured, etc. The idea here is that we shouldn't decide whether to store it, its value, or its structure when writing it to the data lake. We have the flexibility to perform completely raw data ingestion.
Next, the second layer is a transactional layer called Delta Lake. Based on our learnings from the last 20 years of data warehousing (both on-premises and in the cloud), we understand the need for a transactional layer to build and process this data. Data Warehouses excel at storing data in a very specific structure, connecting statistics, maintaining indices—essential elements for data reliability in tables and achieving good performance. The Delta Lake format was created to bring this transactional layer to a Data Lake.
Finally, there is the Delta Engine, the engine optimizing the performance of Spark SQL, Databricks SQL, and DataFrame operations by shifting the computation to the data. This enables the development of use cases such as data streaming, Business Intelligence (BI), exploratory analysis, or analytical research. The processing is much more optimized than if done directly on file formats like Parquet, CSV, Avro, etc.
Key benefits of a modern data platform with Databricks:
Easy to manage: Improved productivity. The utilization of the Delta format in Databricks provides a user-friendly approach, enhancing productivity. This format enables the utilization of cost-effective data storage layers such as Amazon S3, Azure Data LakeStore, or Google Cloud Storage as our Data Lake. This not only ensures low storage costs but also offers the flexibility to handle various data types efficiently.
Massive scaling: Lower cost at scale. Databricks, leveraging Spark and the Delta data format, facilitates distributed processing, leading to lower costs at scale. The Spark engine, adept at handling substantial data volumes, allows for elastic and optimized processing, distributed across multiple nodes. This scalability ensures efficient operations even when dealing with extensive datasets.
AI-Enabled innovation: New insights faster. The Data Science Workspace in Databricks serves as a focal point for scientists and data engineers, promoting AI-enabled innovation. Teams can collaborate seamlessly using web applications like notebooks, Spark API, or SQL, without the need for manual configuration of environments on local machines or the setup of dependencies/libraries on clusters. This managed and collaborative environment accelerates the generation of new insights.
Data Democratization: In addition to these, it's essential to underscore Databricks' pivotal role in democratizing data. The Unity catalog provides an easy data discovery ( semantic search) interface allowing non-technical users to search any data in the Databricks platform, understand business context, lineage. Data can also easily be queried using SQL or soon in natural language, which will be converted to SQL queries for data retrieval.
With the recent acquisition of Mosaic for Generative AI technology combined with the comprehensive data in the Databricks Lakehouse, customers should have the “Data Intelligence Platform” they need to derive business insights, and have a solid foundation for AI use cases to help innovate and compete in the market.
Databricks continues to acquire, innovate and bring new capabilities to market for the entire data management life-cycle, which should give customers the confidence they need for the future.
Summary and next blog:
We will include more depth and complexity required to cover the migration process in our next few blogs.
This first blog aimed to provide a comprehensive introduction to the concepts that will be further explored in the upcoming phases: the motivations for change, gaining a partial understanding of why we find ourselves where we are, platform characteristics, and, most importantly, laying the groundwork for the forthcoming deliveries.
SUNNYDATA Team