Teradata to BigQuery Migration: Challenges, Best Practices, and Practical Guide

We’re not going to sugarcoat it: Migrating from Teradata to BigQuery is complex. Don’t worry—our guide will simplify and streamline this process for you. We’ll illuminate the path, highlighting the key contrasts between Teradata’s traditional on-premises system and Google BigQuery's innovative cloud-based platform. We’ll explore these differences in depth, focusing on storage solutions, data processing capabilities, and scalability options.

We’ll also address critical business topics, ensuring you make a smooth transition to the cloud. And no cloud database migration guide would be complete without guidance to secure stakeholder buy-in, optimize your budget, and minimize operational disruptions. You’ll also get practical tips to make this migration easy, giving you everything you need to prove you did it right. (Pre-emptive high five!)

Common Teradata to BigQuery migration challenges

When data engineers move their data from Teradata to BigQuery, they usually run into a few challenges. It takes a lot of detailed planning and careful work to make sure it’s done correctly. So, we’re going to dive into the stuff that usually throws these migrations off track: architecture and SQL dialects. When you understand how they’re different, you’ll have a better shot at a smooth transition. You’ll also see how these differences affect data distribution, query performance, and scalability.

Architecture differences

The architectural differences between Teradata and BigQuery are substantial, starting with the basics of how they work. Cloud-based platforms like BigQuery have very different approaches to storage and computation. Legacy platforms like Teradata have those two things all intertwined like spaghetti.

As you can imagine, going from an on-premise solution to a cloud-based solution is a big deal. Transitioning from physical infrastructure to virtual, service-oriented architecture requires a big shift in how you think about and manage data scalability, storage, and computing resources.

Teradata uses a Massively Parallel Processing (MPP) architecture, which spreads data across multiple nodes for concurrent processing. It closely integrates data storage with computational power, which is important to understand as you migrate to BigQuery's distinct architecture.

Teradata architecture diagram courtesy of Javatpoint

Teradata’s architecture explained

Teradata clusters are composed of a network of Access Module Processors (AMPs), which serve as the core drivers of its data processing prowess. Each AMP operates on an individual server instance, bearing a share of both computational and storage duties for the cluster. In this configuration, storage and compute resources are tightly coupled, with specific data segments being stored and processed on the same server. Run out of one or the other and the whole ship goes down. :( 

This integrated approach is at the heart of how Teradata's data management and query performance work. Similar to other Massively Parallel Processing (MPP) systems, such as AWS Redshift, Teradata uses a hashing algorithm to evenly distribute data across its AMPs.

However, this method presents several challenges in terms of scalability and performance:

  1. Reliance on hashing key for query efficiency: The effectiveness of queries heavily relies on their alignment with the hashing key. If joins and filters in a query do not correspond with the hashing key, it can lead to diminished performance due to the need for redistributing data across different AMPs.
  2. Scaling constraints: Expanding the system’s storage or computational power, necessary for handling more queries, involves adding more nodes. Scaling your database means scaling the entire infrastructure. Aside from being costly, adding nodes also necessitates redistributing data among both new and existing nodes. Although the recent introduction of Teradata VantageCloud offers compute-only clusters, these still rely on a Primary Cluster for storage, which can become a scalability bottleneck.
  3. Limited elasticity in compute demand: Teradata's architecture isn’t very flexible when it comes to scaling for query processing. Adjusting the cluster’s available storage is a complex task, often leading to either underuse of resources or performance bottlenecks. It gets expensive quickly and isn’t ideal as an enterprise data warehouse.

BigQuery architecture explained

BigQuery’s architecture is cloud native and serverless. It separates compute and storage functions, making it more scalable and flexible, while providing efficient resource usage.

BigQuery’s standout feature is its fully-managed, serverless data warehouse. Unlike Teradata, it automatically scales. As a result, compute resources are dynamically allocated and adjusted based on the workload. During times of high usage, such as business hours when data analysts are running complex queries or generating data analytics reports, BigQuery automatically ramps up to handle the load and scales down during periods of lower activity. Cheap, fast, and easy! Sweet!

Its adaptability is perfectly suited to meet varying data processing requirements. BigQuery's management is user-friendly and accessible (bonus!), offering intuitive controls through an API or a user interface. Even users with limited technical expertise can efficiently manage and interact with their data.

Benefits of migration to BigQuery

Embracing BigQuery as your data warehousing solution has many advantages, like operational efficiencies and enhanced data processing capabilities. Here are just a few big benefits:

  1. Reduced DevOps effort: Transitioning to BigQuery significantly eases the burden associated with DevOps tasks. And thank goodness, because DevOps people are always super busy. BigQuery simplifies maintenance and optimization. Its serverless nature means spending less time on maintenance and tuning.
  2. Shift to ELT (Extract, Load, Transform) approach: With Teradata, you typically need to filter and transform your data before you load it (ETL vs. ELT). This is how pipelines were designed pre-cloud because storage and compute were limited and not dynamically scalable. But BigQuery, with its cloud native architecture, supports a more efficient ELT model. Data can be loaded in its raw form and transformed on its way out of BigQuery as needed, giving data analysts more flexibility.

Simplified data management and flexibility: Efficient data management in a Teradata data warehouse requires careful planning of table layouts, indices, and data distribution strategies. BigQuery automates most of this complexity. While it does table partitioning and clustering for performance optimization, it doesn’t require intricate setup and planning that Teradata does. This streamlined approach makes it easier to organize and manage data.

Dialect differences between Teradata SQL and BigQuery SQL

The SQL used by Teradata is pretty different from the SQL used by BigQuery. It’s different enough that they have distinct approaches to basic queries. This makes it difficult to run SQL on one system and expect it to work in the other. You’ll especially see these compatibility issues when transferring your stored procedures, functions, and complex queries. When you understand the differences, you can carefully adapt your queries from one system to the other.

Teradata uses its variant of SQL, known as Teradata SQL. It includes unique syntax and functions tailored to its architecture. For instance, Teradata features specialized join types like MERGE JOIN, optimized for its parallel processing environment. It also offers proprietary functions for data manipulation and analysis and specific syntax for managing its unique indexing and partitioning features.

BigQuery standard SQL, based on ANSI SQL, enables efficient querying of large cloud datasets. It has modern features like array aggregation and user-defined functions (UDF). Tailored for big data analytics, it offers comprehensive functions for structured and semi-structured data like JSON and arrays. It also plays quite nicely with Google's cloud ecosystem. BigQuery standard SQL dialect supports real-time analysis and machine learning, enhancing BigQuery's scalability for complex queries on large datasets.

Dialect differences between Teradata and BigQuery: The QUALIFY ROW_NUMBER() Function

A notable distinction between Teradata and BigQuery is in handling queries that select a top number of rows with potential ties. In Teradata, the TOP WITH TIES clause is commonly used, which returns the top N rows from a result set, along with any additional rows that tie for the last place in the top N based on the ordering criteria.

In BigQuery, you can get a similar result using the QUALIFY ROW_NUMBER() function. The function assigns a unique row number to each row in the result set, which can then be used to include additional rows that tie with the Nth row. For example, to return the top 10 orders by OrderAmount, including any orders that tie with the 10th order, the query in BigQuery would be structured differently, using the QUALIFY ROW_NUMBER() function to achieve the same outcome as the TOP WITH TIES clause in Teradata.

Example query: Teradata SQL and BigQuery SQL

Let's explore a typical query used in the Teradata environment to illustrate its SQL capabilities. The following example demonstrates a basic query to select the top 10 orders by amount:


SELECT TOP 10 WITH TIES *
FROM Orders
ORDER BY OrderAmount DESC;

To achieve the same result in BigQuery as the TOP 10 WITH TIES clause in Teradata, you can use a combination of the ROW_NUMBER() window function along with a subquery or Common Table Expression (CTE). Here's how you can construct the equivalent query in BigQuery SQL:


SELECT *
FROM Orders
QUALIFY ROW_NUMBER() OVER (ORDER BY OrderAmount DESC) <= 10;

BigQuery achieves a similar result to Teradata's TOP WITH TIES when it uses window functions like RANK(). Using this function assigns a rank to each order based on OrderAmount, with the WHERE clause filtering for ranks up to 10, including ties. 

BigQuery's SQL dialect, while comprehensive, differs notably from Teradata's, particularly in Data Definition Language (DDL) features and command syntax. Teradata's DDL intricacies and shortcuts like SEL for SELECT aren't directly mirrored in BigQuery. 

Using a SQL translator to migrate from Teradata SQL to BigQuery SQL

A SQL translator tool can streamline the transition from one SQL dialect to another, making it much easier and faster to migrate extensive codebases. A good SQL translator preserves the original business logic within the legacy SQL code. It also minimizes manual code rewriting and decreases the likelihood of errors.

For example, Datafold's SQL Dialect Translator simplifies the conversion and automatically adapts code to the new dialect. It streamlines the migration process by eliminating the time-consuming search for dialect-specific functions, like `date_trunc` in BigQuery, significantly saving time and resources.

Datafold's SQL translator can translate SQL scripts across dialects with the click of a button

Optimizing Schema Management in BigQuery Migrations

Before you move any data to BigQuery, make sure you understand the table schema on both systems. Look at your column names, data types, and every other attribute. It’s possible you’ll need a custom schema for your migration, so we encourage you to look at schema files so you can precisely match the structures between your legacy Teradata system and BigQuery.

This is a process known as schema mapping. Doing this in BigQuery involves two key steps. First, the schema file must precisely reflect the data structures of your enterprise data needs. Second, after you move the data, you need to test and verify that the schema mapping worked as expected.

Test everything!

Leveraging Google Cloud Storage 

Consider using Google Cloud Storage as a preliminary step for staging your data. Staging might facilitate a more seamless and efficient data transfer to BigQuery. Google Cloud Storage is robust and scalable, especially when handling large volumes of data.

BigQuery Data Transfer Service 

For organizations looking to automate and streamline their data transfer process, the BigQuery Data Transfer Service can be a valuable asset when migrating from Teradata. It simplifies the movement of data from various sources into BigQuery and enhances the efficiency of schema mapping and data integration tasks.

Business challenges in migrating from Teradata to BigQuery

Transitioning to BigQuery presents not only technical but also non-technical business challenges: stakeholder alignment, budget management, and the impact to ongoing operations. You’ll want to focus on minimizing disruption and aligning the migration with business goals. Here are three of the most common business challenges for data migrations:

  • Stakeholder alignment: Getting all your stakeholders to actually agree is as difficult as it is important. Engage people from different departments early in the planning stage. Secure their support and address any concerns or misunderstandings about the transition. Communicate and manage expectations to keep the effort on track.
  • Budget considerations: Ah, the classic problem of money. The costs of a data migration go way beyond tools and resources. You’ll need to run both the old and new systems concurrently for a while, which also means potential downtime costs, training for staff, and all the time and energy to keep the business running.  Do as much budget planning as you’re able before you do the migration, doing your best to be comprehensive and inclusive of everyone’s requirements (within reason, of course).
  • Minimizing downtime: Downtime can have a profound impact on business operations and revenue. You must maintain continuity during the migration. Consider a phased data migration, which involves moving data incrementally, can aid in reducing downtime. We highly recommend a lift-and-shift migration for this and a dozen other reasons.

Test absolutely everything before you turn off the old system. It’s the only way to ensure a seamless transition with minimal disruption. 

4 best practices for Teradata to BigQuery migration

The transition from Teradata to BigQuery involves a complex landscape of database migration challenges. To succeed, you need to analyze your current Teradata setup, considering its structure, size, and complexity. Then you can plan a more effective and tailored migration strategy. 

The following strategies can help you improve the efficiency and increase your chances of success:

  1. Have a strategy and prioritize your data assets: Begin by identifying and ranking your critical data assets using column-level lineage analysis. This’ll help you determine which assets are most important. Migrate your data consumption endpoints before your data production pipelines to reduce the burden on the old system during the migration process.
  2. Lift-and-shift approach: Implement a lift-and-shift strategy, leveraging tools like Datafold’s SQL Dialect Translator. Try to move your data in its current form without remodeling or refactoring. It’ll simplify the migration, lower risks and resource demands, and speed up the entire process.
  3. Document your strategy and action plan: A comprehensive, well-documented migration strategy and action plan will provide clarity, facilitate stakeholder agreement, and allow for the early identification and correction of potential issues. Good documentation keeps teams aligned throughout the migration journey.

Use automated validation tools: Use automated validation tools like Datafold Cloud’s cross-database data diffing. They help with validating data quality and consistency between the legacy and new systems by identifying the value-level differences between two tables across systems. Automated validation streamlines the process, speeds up approval from stakeholders, and confirms the accuracy and integrity of the migrated data.

An example of te value-level differences produced by a cross-database comparison in Datafold

Putting it all together: Teradata to BigQuery migration guide

A successful migration from Teradata to BigQuery requires both technical expertise and solid project management. Using the best practices outlined in our guide will help you to effectively manage this complex transition. Here’s a structured approach to put it all together:

  1. Detailed migration planning: Develop a specific plan for the Teradata to BigQuery migration, which should include timelines, resource allocation, and stakeholder communication strategies. Begin by thoroughly documenting your current Teradata environment, understanding data dependencies, and setting clear objectives.
  2. Use column-level lineage for prioritization: Employ column-level lineage to identify and prioritize critical data assets. Start by migrating data consumption points such as BI tools and dashboards to BigQuery. This provides a seamless transition with minimal interruption to data access.
  3. Implement lift-and-shift strategy: Adopt a lift-and-shift approach in the initial phase of migration to ease the transition. Implement this approach to address the architectural and SQL dialect differences between Teradata and BigQuery.
  • Begin by transferring data in its current form using a SQL translator tool like Datafold which can automate the conversion of SQL code from Teradata to BigQuery
  • Conduct a data diff to quickly check for parity and ensure that there is a 1-to-1 correspondence between your legacy Teradata database and the new BigQuery database using Datafold’s cross-database data diffing
  • Consider a BigQuery migration service as part of the lift-and-shift strategy to simplify and speed up the process (and refactor once you’ve migrated off of Teradata)
  1. Get stakeholder approval: Once the migration process is stable, get approval from stakeholders. Use data diffing to demonstrate complete parity between the Teradata and BigQuery systems, assuring stakeholders of the migration’s success.
  2. Deprecate old Teradata assets: In the final phase, issue deprecation notices for the old Teradata assets. Provide stakeholders with parity reports through data diffs to support their transition to the new BigQuery environment. You’ll need to do some hand-holding to make sure every user moves to the new system without any issues. Then you can decommission Teradata without any interruptions.

Conclusion

If you’re ready to learn more about how Datafold is accelerating data teams to work during a migration like one from Teradata to BigQuery, there’s a few options to getting started:

  • Talk to a data migrations expert and tell us about your migration, tech stack, scale, and concerns. We’re here to help you understand if data diffing is a solution to your migration concerns.
  • For those who are ready to start playing with cross-database diffing today, we have a free trial experience of Datafold Cloud, so you can start connecting your databases as soon as today.

As we said in the beginning of this blog: migrations are hard, potentially years-long projects. Datafold is here to make them as automated as possible, so your team can focus on what matters: providing high quality data to your organization.