Migrating Hadoop to Databricks - a deeper dive

Feb 14

Introduction

Migrating from a large Hadoop environment to Databricks is a complex and large project. In this blog we will dive into different areas of the migration process and the challenges that the customer should plan in these areas: Administration, Data Migration, Data Processing, Security and Governance, and Data Consumption (tools and processes)

Technology Mapping from Hadoop to the Databricks Lakehouse platform

The Hadoop platform is a collection of technologies that span several categories: batch processing, columnar storage, real-time data capture, governance and security, query processing with SQL, which use the same cluster nodes for storage and compute.

The image below shows the mapping of various Hadoop technologies to their equivalent( but advanced in performance and functionality) in the Databricks platform from simplest to more complex. Some of these technologies (like Spark processing) will not require a lot of changes and lift, but other technologies will require a migration or alternate approach.

Spark Workloads:
The Spark workloads will operate in a Databricks environment without any changes—there's no need to configure Spark. It is considerably faster and optimized to use fewer resources (leveraging Databricks' own version of Spark). Spark versions in the Hadoop platform vs. your the latest Databricks platform should be compared for any changes.
Hive/Impala/Pig:
Hadoop customers use some or a combination of various ETL or query tools within Hadoop like Hive, Impala, Pig/Hive.

Hive queries have a high degree of compatibility with the Databricks execution engine. However, it is more than likely that changes to DDL, complex Hive-dependent logic and user-defined functions will be required. With the migration, the common benefits of Databricks are inherited. In addition Delta Lake provides ACID transactions and schema evolution.
Real Time Event Processing ( Spark ) → Databricks Structured Streaming:
Technically, migrating from Spark to Databricks Structured Streaming is a cleaner process because they share the same underlying engine. It will require adjustments and will need to validate compatibility between Spark versions, UDFs and libraries. The migration to Databricks Structured Streaming will give us seamless integration with other Databricks services, such as Delta Lake and collaborative notebooks, as well as speeding up event processing.
Batch Processing (MapReduce) → Spark jobs
The MapReduce processes will need to be converted to Spark ETL. This task will require analyzing and defining an approach for each process (independent processes for flexibility or integrating them into a workflow for greater efficiency) and either converting or rewriting the logic. The benefits include improvements in performance, scalability, and functionality.

If other commercial ETL technologies ( like Informatica PowerCenter or Ab Initio or Talend) are used extract and transform (ET) or they push down to MapReduce jobs ( ELT) in Hadoop, then it may be worthwhile to find a way to convert the ETL native logic to Spark ETL leveraging various conversion utilities in the market, depending on the cost benefit analysis.

So… What should customers plan for their migration efforts from Hadoop to Databricks?

The bulk of the work lies in the migration of different types of workloads in the Hadoop clusters supporting various end-use cases ( datamart/EDW style reporting, real time applications, data science use cases)

For Spark migrations, attention should be given to different versions of Spark ( Hadoop vs Databricks); If RDD code exists, it needs to be migrated to DataFrames, and there might be changes in jobs via spark-submit.
The biggest time and cost for a Hadoop → Databricks migration project will be on non-Spark load migrations. The time and effort ( which translates to cost) will depend on the number of data pipelines, complexity of the jobs and the number of legacy MapReduce jobs that the data pipelines are running in the Hadoop clusters, which need to be converted to Spark jobs.

Sqoop could be replaced by either open source CDC technology like Redhat Debezium or commercial products like FiveTran or Matillion.
Apache Nifi or could be replaced by ADF (in Azure Deployment).
Commercial ETL tools like Informatica PowerCenter, Talend or Ab Initio could be replaced by native data engineering pipelines in Databricks or modern graphical data pipeline platforms like Prophecy, which natively execute Spark jobs on Databricks.

What does the migration process look like?

Now, let's’ explore the different areas of the migration process to Databricks: Administration, Data Migration, Data Processing, Security and Governance, and Data Consumption (tools and processes)

It is crucial to note that there will most likely be many mission critical and production workflows and we should be careful about any end-user or business process disruption. Extensive project planning and sequencing of events in the migration process is required to minimize risk and avoid any disruption to business processes. The typical recommendation is to start by creating an inventory of pipelines, data types, expected run times, Hadoop components in use, and how to transition without disrupting business processes or reinventing the wheel.

1- Administration

Now, let's delve into the administrative components of both platforms that focus on the management of clusters, access control, metadata, and information consumption.

Hadoop: A Network of Interconnected Nodes

Hadoop's architecture is characterized by a distributed computing paradigm, where a network of interconnected nodes collaborates to process and store vast amounts of data. Each node possesses dedicated resources, including disks, cores, and memory, which are shared among running jobs. These nodes typically undergo jobs that compete for these resources, necessitating a restricted service architecture to ensure resource availability.

Databricks: Elasticity and Purpose-Built Clusters

In contrast, Databricks takes a more flexible approach to resource allocation. By leveraging cloud infrastructure, Databricks dynamically adjusts resource allocation based on service demands. This elastic scaling significantly enhances cost-efficiency and allows for the creation of purpose-built clusters, such as ML, ETL, or SQL clusters, that come into existence only when needed and dissolve when no longer in use.

Data Storage: Disconnection between Storage and Compute

Unlike Hadoop, Databricks does not provide data warehouse services like HBase. Data resides in file storage, within cloud object storage, allowing for disconnection between storage and compute. While HBase is a prominent option within the Hadoop ecosystem, there are several alternative data warehouse technologies available in the cloud, such as Amazon DynamoDB and Microsoft Azure Cosmos DB.

Cluster Isolation for Strict SLA Compliance

Each cluster node in Databricks is completely isolated, corresponding to a Spark driver or worker. This segregation allows strict compliance with Service Level Agreements (SLAs) for specific projects.

User Interaction with Data: Familiar Pathways

Databricks offers similar capabilities to Hadoop in how users interact with data. Data stored in the cloud can be accessed through various paths, such as SQL Endpoints, Databricks SQL, and Databricks notebooks for data engineering and machine learning.

Metadata Management: A Managed Hive Metastore

Databricks includes a managed Hive metastore by default, storing structured information about data assets in cloud storage. Migrating Hive metadata to a managed Hive metastore is a critical step in the migration process from Hadoop to Databricks. This ensures data consistency, simplifies metadata management, and unlocks the full potential of Databricks' capabilities. Additionally, Databricks supports the use of an external metastore for enhanced flexibility.

Access Control: Securing Data at Multiple Levels

Thorough access control measures are paramount for protecting sensitive data. Databricks offers robust security controls, including table ACLs, object storage permissions, and user authentication mechanisms. Additionally, table-level and storage layer access control lists (ACLs) ensure granular access control and compliance with regulatory requirements.

Configuring the SQL Layer for Use Case Development

Databricks SQL provides a user-friendly interface for querying and analyzing data. To fully utilize this capability, it is essential to configure an SQL layer that aligns with the specific use cases of the organization. This involves defining data schema and access permissions, as well as ensuring compatibility with existing data pipelines and applications.

Infrastructure as Code

Databricks makes it very easy to deploy or make any changes using Infrastructure as Code (IaaC) using Terraform

Using an Infrastructure as Code platform like Terraform for infrastructure has many benefits. It allows for automated deployment, easier infrastructure definition and management, easier code understanding (thanks to the interpretive language) and collaborative development. It also speeds up the replication of the environment in multiple environments (development, production, etc.), as well as version control or even the reuse of code from the community. The vast majority of Databricks resources, as well as its scalability, can be efficiently managed from Terraform, maintaining the aforementioned features.

And why Terraform? In the market we have several alternatives to create infrastructure such as AWS CloudFormation, but Terraform is known for its multi-cloud compatibility. If at some point you want to change cloud providers (for example, from Azure to AWS), the same Terraform code can be used (with minor adjustments) across any cloud provider

2) Data migration

There are two aspects to consider in data migration. First, the existing data that we are going to migrate, and secondly, how we will capture new data in Databricks. It is recommended that historical data be ingested into cloud service tools (such as Amazon S3, Azure Data Lake, or Google Cloud Storage) at the file layer.

On the other hand, a dual ingestion approach is suggested. Instead of modifying the current process of writing data to HDFS, it is advisable to add a secondary process that writes to cloud storage. This provides flexibility for implementing cloud use cases while using the second copy of the data as a backup for the other. It's logical for both datasets to coexist for a certain period (disconnecting and reconnecting is not feasible due to dependent processes).

How to Migrate Data to Spark in the Cloud

There are basically two ways to accomplish this:

Push Model: Direct copy to the lake

The push model involves copying data directly from the source to the cloud-based data lake. This method is suitable for smaller data volumes, typically within the range of a few terabytes. Hadoop DistCP, a distributed data copy tool, is commonly used to orchestrate this process. It efficiently transfers data blocks in parallel, ensuring a seamless migration process.

For larger data volumes, ranging from hundreds of terabytes to petabytes, cloud-based solutions like AWS Snowmobile or Azure Data Box offer a more scalable and efficient approach. These solutions physically transport data storage devices to the cloud, enabling the transfer of massive amounts of data at once.

Pull Model: The lake requests the data.

In contrast to the push model, the pull model relies on the lake to actively request data from the source. By leveraging Spark's distributed processing capabilities, data can be pulled from the source in real-time or during scheduled intervals. This method is particularly well-suited for continuously updating data streams.

To facilitate the pull model, there must be a reliable connection between the data lake and the source. Spark's built-in APIs, such as spark.read.jdbc, enable seamless data retrieval from various data sources, including relational databases, NoSQL stores, and cloud-based data services.

External Hive Metastore:

In addition to data migration, we cannot forget about metadata, which is often considered more important than the data itself (according to some consultants). In most cases, it is necessary to implement a coexistence period during the migration process, in which the data catalog remains accessible both on-premises and in the cloud. This allows for seamless operation of the Parquet tables from both the on-premises cluster and the cloud.

To facilitate this coexistence, an external metastore configurable within Databricks is employed. This is typically the way to go for most migrations since it is unlikely that all Hadoop ETL processes can be migrated (this should be done in phases and with as little impact on operations as possible). We haven't seen a project yet with a significant legacy in Hadoop that we could migrate all the ETLs in one go.

Conclusion and What to Expect in Part 3

In Part 3 of the blog, we will dive deeper into the remaining areas : Data Processing, Security and Governance, and Data Consumption (tools and processes). Our intention is to create practical use cases that can bring this knowledge to life and assist customers in their migration projects.

Kailash Thapa

Migrating Hadoop to Databricks - a deeper dive

Hadoop to Databricks: A Guide to Data Processing, Governance and Applications

Hadoop to Databricks Lakehouse Migration Approach and Guide