Moving from IBM DB2 & DataStage to Databricks (Pt. 1)

Introduction

For many years, IBM products have been the preferred choice for large institutions (mainly banks). Many professionals have shared stories about the durability of the IBM “irons” (AS/400), even resisting fires, falls from second floors, and other surprising anecdotes. The exceptional reliability of these machines, together with their robustness, security, performance, and fully integrated architecture, made their customers not only trust their servers fully for the operation of critical applications but also adopt new products from the IBM portfolio, prepared to operate in perfect harmony, like IBM DB2.

Over time, clients have ended up in a relationship of deep dependency on software and hardware. This relationship is similar to a marriage in which both parties have accumulated numerous assets in common, making it difficult to split up, even when it might be beneficial to go their separate ways. Many financial institutions face this situation with a heavy technological legacy. However, a striking fact is that while the decommissioning of platforms such as Cloudera, SAS, or Hortonworks has become a norm in the sector, the same can’t be said for IBM. Migrations from DB2 are infrequent, and, in my opinion, this is due to a dependency that goes beyond the software.

Despite all the barriers to exiting this ecosystem, it will be a matter of time before institutions adopt new technologies. And once that trend is set, there will be no turning back (personal opinion). This shift will not happen solely for greater efficiency or lower costs, but because competitors will be offering services they cannot match, ultimately pushing them out of the market. This is why today we go deep into a new episode focused on one of the most used data stacks in financial institutions and two of its main components: IBM DB2 and IBM InfoSphere DataStage.


Migration Strategy Overview

We’ll begin by presenting the typical data ecosystem at IBM, detailing its main components and how they work. Next, we’ll analyze the equivalences of these components within Databricks. This analysis will allow us to understand how to transition from the IBM stack to Databricks efficiently.

With this context established, we will focus exclusively on two of the IBM ecosystem's most critical and widely used components: IBM DB2 and InfoSphere DataStage. We’ll examine how to approach your migration to Databricks in-depth, detailing considerations, alternatives, and best practices needed to ensure an efficient and seamless transition.

For now, we’ll omit other components of the InfoSphere vertical to keep the focus on DB2 and DataStage, which constitute the core of the data stack and represent approximately 80% of the migration project.


IBM Data Ecosystem

The situation is similar to our previous blog about SAP: we are facing an extensive portfolio of products and services, some of which have been replaced by others, adding a layer of complexity. However, by focusing only on the “data” aspect, the explanation is simplified and much more manageable.

  • IBM DB2: It’s a relational database ideal for transactional and operational applications requiring high reliability, performance, and security. Released in 1982, DB2 has evolved to support various operating systems and deployment modalities, including cloud and hybrid or virtualized environments (VMWare).

    Not only is it one of IBM's most iconic and oldest products still in existence, but it’s also highly optimized for IBM mainframes and the IBM iSeries platform, as it’s deeply integrated with the operating system and hardware. 

    To avoid confusion, it’s key to clarify that although IBM DB2 can be used within a data warehouse environment, and it’s common to see it this way, IBM has developed a specific version called DB2 Warehouse. This version is based on massively parallel processing and designed for analytical queries.

  • IBM InfoSphere DataStage and QualityStage: They are two components that work together to offer a comprehensive solution in data management and modeling. DataStage is the ETL tool that allows you to integrate data from multiple sources and design complex transformation processes. For its part, QualityStage focuses on managing and improving data quality, with objectives such as cleaning, standardizing, and enriching information.

    DataStage and QualityStage are tightly integrated. Both tools share a standard interface, making it easy for users to design, manage, and monitor processes encompassing data integration and quality in a single environment. In other words, although they are two theoretically separate tools, they usually work together within the same interface.

The above screenshot shows the tool, whose design is practical and functional. Visually, it’s not too different from other ETL tools, although its appearance may seem somewhat more dated compared to modern ones.

  • IBM InfoSphere Governance Catalog: IBM's solution for data governance is designed to catalog, manage, and ensure compliance with data policies. Provides traceability and access control to sensitive information.

It is essential to highlight the existence of IBM Cloud Then for Data, which has integrated many of the capabilities seen in classic IBM components, becoming what we could consider its “evolution” within the data stack, especially in hybrid and cloud environments. In my experience, most IBM clients still use traditional tools such as Governance Catalog or DataStage. However, it’s crucial to keep in mind that Cloud Then for Data offers a partially equivalent solution, with differences in scope and functionalities, which may generate certain variations from what was previously explained.

  • IBM Information Analyzer: It’s a tool designed to analyze and evaluate the data quality within an organization. Its main purpose is to provide companies with a clear view of their data structure, content, and quality, allowing them to identify problems and opportunities for improvement. Information Analyzer does not replace IBM QualityStage but complements it by addressing different aspects of the data lifecycle.

While Information Analyzer focuses on analyzing and generating detailed metrics on data quality, QualityStage uses these metrics to execute cleaning, standardization, and deduplication processes. Together, both tools form a comprehensive solution that guarantees reliable, consistent data suitable for analysis and strategic decision-making.

  • IBM InfoSphere Information Server: The central platform integrates various tools we have seen before for data management, integration, quality, and governance. In other words, DataStage, QualityStage, and Information Analyzer are part of this ecosystem or grouping of products from an architectural point of view called Information Server

  • Others: IBM also has tools within its analytics and data science suite, such as IBM Cognos Analytics for BI and IBM Watson Studio for data science and AI development. We won't delve into these tools in this article, as they tend to have a limited presence in customer data stacks (although Cognos is more common). 

Migration to Databricks

One key activity when designing a target architecture is understanding the purpose and role each component plays in the source architecture. This ensures that when proposing a new solution, we maintain the functionality necessary to meet business needs.

To achieve this, it’s key to compare the functionalities of the current architecture’s components, described in the previous section, with their counterparts (in this case) in Databricks. It’s important to note that some functionalities may have multiple alternatives within the Databricks ecosystem and can be complemented with cloud or third-party services, allowing you to choose the most appropriate option depending on the case.

This versatility is not exclusive to platforms like Databricks or Snowflake, although it’s more natural in cloud-agnostic environments. For example, combining different components is possible in an ecosystem like IBM. Still, they usually present barriers inherent to the manufacturer ecosystem, often leading to adopting "the full suite" to avoid integration complications or technical limitations.

Source: SunnyData

In contrast, on platforms like Databricks or Snowflake, there is a greater freedom to integrate tools from various vendors (and make them work well). For example, any ETL solution can be used, such as Azure Data Factory, AWS Glue, Fivetran, or even the platform's native tools. This flexibility allows you to select the best option for each case without being limited to a single ecosystem.

There will always be dependencies, but it is almost impossible that the same level of dependency is reached for environments such as IBM, due to the wide range of alternatives available, both internal and external, for each situation and also by not having hardware, operating system, and ecosystem dependency. In fact, to prove the point, performing a migration from Snowflake to Databricks is pretty simple.

Databricks and IBM Component Map

Here, we explore the natural migration options available for each component of the IBM stack to the Databricks ecosystem. We also highlight particular scenarios where alternative solutions can be utilized within the Databricks environment, allowing you to choose what best fits your unique needs.

Source: SunnyData

  • IBM DB2/DB2 Warehouse → Delta Lake en Databricks SQL. Databricks unifies transactional and analytical processing within its Lakehouse platform, leveraging Delta Lake to support ACID transactions and advanced analytical queries in a single environment. This enables structured and semi-structured data to be managed efficiently, offering modern real-time storage and analysis capabilities, all integrated into a scalable, cloud-ready ecosystem.


  • IBM DataStage → Databricks Workflows & Delta Live Tables. At Databricks, the development of data pipelines differs significantly from classic interfaces such as IBM DataStage. Instead of a traditional graphical interface, Databricks uses interactive notebooks, which are more flexible and user-friendly, especially for modern users who prefer to work with code. In addition, Databricks includes an assistant based on generative artificial intelligence, capable of providing contextualized support to the user. This wizard can suggest, convert, fix, and optimize code, simplifying development and improving productivity.

The platform also expands data ingestion options by integrating with specialized tools such as Fivetran or cloud-native services, such as Azure Data Factory and AWS Glue, offering scalable and efficient alternatives to cover a wide range of data integration and transformation needs.


  • IBM QualityStage → DQX by Databricks Labs. Recently released by Databricks Labs on January 21, 2025. This tool is designed to integrate seamlessly into data pipelines developed on notebooks or DLT, allowing you to evaluate data quality efficiently and providing detailed information on why rows or specific columns present quality problems. Additionally, it supports batch workloads and real-time data streams, making integrating into various data processing environments easy.

Additionally, Databricks allows DQX to be combined with external tools or specialized libraries, such as Great Expectations, to cover advanced needs, such as schema validation, data cleansing, and enrichment. This ensures that quality processes are robust and scalable across modern data architectures


  • IBM InfoSphere Information Analyzer → Databricks Data Profile. Information Analyzer's data profiling and analysis capabilities can be partially addressed using Databricks Data Profile. This tool allows users to perform quick and detailed analysis of data stored in tables or DataFrames, providing insights such as value distribution, identification of null columns, unique values, and anomaly detection. For more advanced scenarios such as detecting complex patterns or dependencies between columns, it can be complemented with specialized tools such as Great Expectations. 


  • IBM InfoSphere Governance Catalog → Unity Catalog It’s the star component (along with Genie) of recent years and will probably gain more functionality and scope in the future. Centralizes governance and data cataloging in Lakehouse environments, allowing you to manage metadata, trace data lineages, manage access controls, and provide comprehensive governance in a native, transversal, and extensible way to all data products developed on the platform.


  • IBM Cognos Analytics → Databricks Dashboards/Genie. The BI tool may be outside the direct scope of the migration since if the same data model is maintained, the transition may be reduced to a “plug-and-play” activity. That is, it’s possible to migrate to Databricks and continue using PowerBI, Sigma, or any BI tool you choose without significant interruptions.

Alternatively, Databricks also offers internal options, such as the integrated Databricks Dashboard, which allows you to create visualizations directly (using natural language). Advanced tools like Genie will enable you to interact with data conversationally, adding accessibility and simplicity for non-technical users.

  • IBM Watson Studio → Databricks Machine Learning.  Databricks is the platform par excellence for data scientists, offering a complete and unified solution to manage the lifecycle of ML and AI models. The team can work collaboratively on interactive notebooks, using the programming language of their choice and their favorite libraries. 

    Additionally, Databricks provides access to CPU and GPU resources for model training, integrating tools such as MLflow for model tracking, registration, and deployment, and AutoML for rapid testing and automatic optimization. The platform also includes Apps to publish applications, facilitating interaction, and the deployment of solutions, among many other functionalities. 


  • IBM Streams → Structured Streaming to Databricks It is its equivalent, although they are not precisely the same. IBM Streams is a solution for data streams, while Structured Streaming goes further by unifying batch and real-time processing on a single platform. It dramatically simplifies management and complexity. It is completely integrated into the platform and is simply another “method” for data management. It’s a more modern, scalable, and performing solution.

Now that we have analyzed and compared both ecosystems, it’s time to address how to migrate the two main components, IBM DB2 and DataStage, to the Databricks platform.

IBM DB2 to Databricks

Source: SunnyData

Like any migration, the challenge is not found in the copying of data or the connection between systems but in the planning, the good management of the variables that usually complicate or delay these projects (which many times don’t depend on one factor, such as access by the client, bandwidth, maintenance windows, dependencies, etc.), and the definition of a controlled strategy that does not affect the client's operations and allows for gradual progress in phases. 

In this sense, it’s essential to consider aspects such as the coordination of maintenance windows with the client, the management of dependencies between systems that share the database, the quality of the connection or the available bandwidth so as not to saturate the network, the contingency plans in case of failures, the validation and testing of information (counts, checksums, referential integrity), and the responsibilities of each team involved in the migration. 

Only in this way can an orderly transition be ensured, minimizing operational risks and avoiding delays caused by factors external to the mere copying of data. From a technical and data integration point of view, IBM Db2 is simply another database, so it does not present any exceptional complexity regarding access to information. 

There are three main alternatives: 

  1. Create a JDBC connection to Db2

  2. Use an ETL tool (ADF, Fivetran, Talend, or equivalent)

  3. Perform a direct export of flat files for later loading into the Data Lake 

Usually, the most recommended option is to use an ETL tool since it offers ready-to-use connectors and makes it easier to manage some of the environment variables; Furthermore, it is rare for migration to be limited to a single process (“one-shot”). The configuration related to networking and security will likely cause the biggest headaches, especially if the client has a high level of bureaucracy (and it will be because your client is probably a bank). To do this, it is essential to define with the client how the connection to the Db2 instance will be carried out (whether through a VPN, a VPC, or a secure connection).

Once the data is already in the data lake, the usual activities that apply to any migration process between platforms must be carried out (you can consult our other blogs). For example: mapping data types, ensuring that they match the original attributes, validating the need to clean data, formatting columns, normalizing encodings or eliminating columns, registering tables in catalogs, applying quality tests and validations, among others.

IBM InfoSphere DataStage to Databricks

Source: SunnyData

The DataStage migration process is more consultative than technical. Let me explain, the main activity here will be more linked to the analysis of processes, review of transformations, analysis of dependencies (scripts, UNIX/Windows sequences, triggers, and other external processes), and then the refactoring of flows in the new platform. 

We will probably encounter transactional and other analytical ETL processes. In this type of client, normally the critical ones are the transactional ones, and they are the most complex, but at the same time, the ones that demand the most resources and that can benefit the most from a platform like Databricks. Hence, it is often advisable to start there.

It’s very useful to rely on accelerators that allow you to map and interpret source processes and help in code refactoring. At SunnyData, we have our own accelerator migration based on GenIA, which reduces project deadlines by between 60% and 80% by automating process discovery and mapping, process conversion, and validation - testing (currently in the certification process as Brickbuilder Solutions).

In the next post we will delve into the architectural perspective, offering a detailed “step by step” to face the migration from IBM Db2 and DataStage to Databricks. Looking forward to seeing you next week!

Previous
Previous

Databricks Genie: Our Experience with & Lessons Learned

Next
Next

Databricks: An Insider’s Perspective with Franco Patano