Hadoop to Databricks: A Guide to Data Processing, Governance and Applications

Introduction

In the intricate landscape of migration planning, it is imperative to map processes and prioritize them according to their criticality. This implies a strategic process to determine the sequence in which processes should be migrated according to business. In addition, organizations will have to define whether to follow a "lift and shift" approach or a "refactor" approach. The good news is that here we will help you choose which option is best for the scenario.

All programming languages associated with Hadoop, including MapReduce, Pig, Hive QL, and even Java, can be adapted for execution on Spark. This adaptation can be achieved through various interfaces such as Pyspark, Scala, Spark SQL, or even R. Zeppelin and Jupyter notebooks can be transformed into Databricks notebooks. 

1.1 Migrating Spark Workloads

When migrating Spark jobs, paying special attention to the versions is crucial. It is common for the Spark version in the Hadoop cluster to be outdated. Migrating to a 3.X version is strongly recommended, not only for infrastructure savings by running workloads more efficiently but also because it may entail minor adjustments in the code of some processes. Some older versions of Spark may not be included in the Databricks runtime environments.

  • If there is existing code utilizing RDDs, it is recommended to transition to DataFrames. RDDs were prevalent in Spark versions up to 2.x, and although they remain compatible with Spark 3.x, continuing to use them means not harnessing the complete capabilities of the Spark optimizer. This can lead to increased costs and suboptimal performance.

  • Adaptive Query Execution (AQE) is specifically designed to work with DataFrames in Apache Spark 3.x. AQE increases query performance by up to 8 times and 32 of them by more than 1.1 times.

  • When migrating Spark jobs, the method of submitting these jobs via spark-submit may need adjustments. This command is used to launch applications on a Spark cluster. Changes might include modifying parameters, configurations, or other details related to how the jobs are submitted and executed.

  • Hard-coded references to local Hadoop environment: During migration, specific references or paths related, such as file paths or configurations, must be updated to match the configurations of the new environment. Failure to update these references can lead to errors or breakages in the code when it is run in the new setup.

1.2 Migrating Non-Spark Workloads

Non-Spark workloads typically require rewriting the code. 

  • Legacy MapReduce: The goal is to transform into Spark ETL processes. When transitioning from MapReduce to Spark, in some instances, if shared logic is implemented using Java libraries, Spark can leverage existing code. However, there might still be a need to rewrite certain parts of the code to make it compatible and optimize its execution in the Spark environment.

  • Apache Sqoop: Migrating Sqoop is a streamlined process as it involves executing a series of Spark commands.Utilizing a JDBC source, you can seamlessly transfer parameters between Sqoop and Spark code. This consistency allows for a straightforward transition, where the familiar parameter specifications in Spark closely mirror those in Sqoop. You might consider replacing it with a CDC solution like Redhat Debezium, which is open source and includes a message queue for dumping data into Azure Blob Storage, AWS S3, or Google Cloud Storage.

  • Apache HiveQL: Migrating from HiveQL to Spark SQL is relatively straightforward due to the high compatibility between the two. The majority of queries can typically run seamlessly on Spark SQL without significant modifications. Although there are some minor differences in DDL syntax. We recommend updating the code to adhere to the Spark SQL format, as it enhances the optimizer's ability to prepare the most efficient execution plan, particularly when working with Databricks. Notably, during the migration process, you can still make use of Hive Serdes (Serializer/Deserializer) and UDFs, providing additional convenience when transitioning from HiveQL to Databricks.

  • Apache Flume: This should be analysed on a case-by-case basis. Most Flume applications are related to consuming Kafka data and writing it to HDFS. This is a task that can be done with Spark streaming (or Azure Data Factory or similar). The shift involves translating the Flume configuration into Spark code. In Flume, configurations are typically specified in files, defining the data flow and processing steps. However, Spark favors a more programmatic approach, where these configurations and data processing logic are expressed in code rather than static files.

  • Nifi: The vast majority of users are replacing Apache NiFi with a cloud-based data integration service tool such as, again, Azure Data Factory or equivalent on AWS and GCP.

  • Oozie: For data orchestration, it is ideal to support Databricks with a tool like Apache Airflow or Azure Data Factory for automation/scheduling. While Databricks may not be robust in terms of integration, it integrates seamlessly with native cloud integration tools. Additionally, Databricks APIs can be leveraged to integrate with other schedulers.

2. Security & Governance

In the Hadoop realm, security measures involve LDAP integration for connectivity to various admin consoles like Ambari, Cloudera Manager, Impala, and Solr. Authentication is managed through Kerberos, while authorization commonly relies on tools like Ranger and Sentry. These tools play a pivotal role in controlling access and permissions within the Hadoop environment. 

Contrastingly, Databricks adopts a more streamlined approach with SSO integration, supporting Identity Providers compatible with SAML 2.0. Authorization in Databricks is facilitated through Access Control Lists (ACLs) at the object level, enabling the granular definition of permissions for notebooks, jobs, and clusters. Data permissions and access control extend to table ACLs and views, restricting column and row access. Databricks also offers credential passthrough, ensuring that workspace login credentials are passed to the underlying storage layer (S3, ADLS, Blob Storage) for authorization checks.

For advanced governance features, Databricks provides integration options with enterprise data catalogs like AWS Glue, Alation, and Collibra. Additionally, if attributes-based controls or data masking are required, Databricks allows the incorporation of partner tools such as Immuta and Privacera.

Main Areas of Interest:

  • Authentication (SSO with SAML 2.0): During cloud migration, it is an excellent opportunity to adopt a Single Sign-On (SSO) provider (such as Azure Active Directory, Google Workspace SSO, AWS SSO, and Microsoft Active Directory) for a unified login experience.

  • Authorization (ACL, IAM, AAD, etc.): Managing access to data and ensuring correct functionality within the environment is crucial. It is important to have Access Control Lists within Databricks, allowing the definition of roles (access restrictions) and restricting access to tables and views. For example, the HR team may have access to sensitive employee information, ensuring that only they can access it, while others won't see the table, column, or encrypted/tokenized information. Planning should be done in collaboration with the business.

  • Metadata Management (Glue, Alation, Collibra): Managing the metadata of tables and the environment is essential. Databricks provides a database for metadata (Unity Catalog), and additionally, it integrates with external governance and quality tools like Alation or Collibra. For instance, if consistency is desired, a Hive Metastore in AWS Glue Catalog can be configured to ensure alignment between Databricks and other AWS services.

If necessary, until the migration is complete, a coexistence period should be considered where the data catalog is visible both on-premises and in the cloud. This allows operations on Parquet tables from the on-premises cluster and the cloud. To achieve this, an externally configurable metastore in Databricks would be used.


3. Demystifying Data Consumption in Databricks: SQL & BI

Let's delve into the area of data consumption, a fundamental aspect with two key avenues. 

Firstly, a vital component of the Databricks environment is an intuitive visual interface for SQL querying, fostering exploratory analysis. In the Hadoop landscape, interfaces like Hive and Impala serve a dual role in ETL processes and ad hoc queries. In the Databricks domain, similar functionality is achieved through Databricks SQL, which stands out for its performance fueled by the robust Delta engine and Photon engine. Additionally, Databricks offers an integrated SQL UX, providing SQL users with a dedicated workbench for data transformations and exploratory analysis directly in the data lake.

Furthermore, Databricks has introduced a new feature called Databricks Assistant, enhancing the SQL experience. This contextual artificial intelligence assistant seamlessly integrates into Databricks Notebooks and the SQL editor. With Databricks Assistant, users can pose questions in natural language, and it provides suggested SQL queries in response. Also, users have the option to ask the assistant to explain the query in natural language, enabling them to grasp the logic behind the query results. This novel functionality adds another layer to the integrated SQL UX, enriching the overall SQL querying and analysis capabilities within the Databricks environment.

In the second avenue, Databricks seamlessly integrates with popular BI tools such as Tableau, Qlik, PowerBI, and Looker. The platform features highly optimized JDBC/ODBC connectors ensuring efficient communication with these BI tools. Databricks JDBC/ODBC drivers, with minimal overhead and faster transfer speeds using Apache Arrow, facilitate accelerated metadata retrieval operations, enhancing overall performance. This eliminates the need to unnecessarily move data to a data warehouse or other platforms, streamlining the entire analytical process within the Databricks ecosystem. 

The integration of Delta Engine's execution prowess, coupled with BI connectors, paves the way to reduce reliance on traditional data warehouses, encompassing data volumes, reports, and more, ultimately contributing to the creation of a more optimized data warehouse environment.


Final Conclusion 

This concludes our exploration into the migration journey from Hadoop to Databricks. While there's much more to delve into, we're excited to venture into diverse topics in our upcoming blogs. Stay tuned for our weekly insights – your continued engagement fuels our passion for knowledge-sharing. See you in the next compelling chapter!

Previous
Previous

What is Photon in Databricks and, Why should you use it?

Next
Next

Migrating Hadoop to Databricks - a deeper dive