How to Migrate from Redshift to Snowflake
Snowflake has become one of the most popular tools in the modern data stack for data lakes, data warehouses, and general-purpose data workloads. With more and more organizations looking to take advantage of Snowflake, that means a lot of data migrations from popular tools like AWS Redshift.
In this post, we’ll walk through the steps for a typical data migration from Redshift to Snowflake. Then, we’ll look at how a tool like Datafold’s Data Diff can help with data validation and verification, protecting data teams from common pitfalls in data migrations.
How to approach a Redshift to Snowflake migration
The typical data migration from Redshift to Snowflake can be broken down into the following key steps:
- Connect to Redshift with Python.
- Migrate the Data Definition Language (DDL) for tables.
- Export the data out of Redshift to AWS S3.
- Import the data into Snowflake.
- Validate and verify the data migration (with Data Diff).
Let’s walk through these steps one by one.
Connect to Redshift with Python
The first step in any migration away from Redshift is setting up a connection so that you can automate the pulling of the DDL for tables in Redshift. One of the simplest ways to do this is with Python and a well-known package called psycopg2, which is used extensively for Postgres.
With connection information configured for the Redshift cluster, obtaining a connection using Python and psycopg2 is straightforward.
With a connection to Redshift established, the next step is to pull the DDL for the table(s) that need to be migrated and created inside Snowflake. There are, of course, some slight differences in syntax for items like <span class="code">timestamps</span> between the two systems, and these will need to be modified for use in Snowflake.
We will be migrating a sample table called <span class="code">sales_reports</span> from Snowflake to Redshift. Shown below is the DDL for that table in each stack. As you can see, the DDL statements are similar, with a slight variation in how the <span class="code">DECIMAL</span> value in Redshift is called <span class="code">NUMBER</span> in Snowflake.
With a connection created to the Redshift cluster, a cursor can be created to query the information schema for the table(s) needed for the migration.
With the DDL for the migration tables in hand, making any modifications needed and running the DDL create statements in Snowflake are straightforward tasks.
Export Data out of Redshift to S3
With our schemas created in Snowflake, we’re ready for the ingestion of data. The next step in the migration process is to export the data from each of the Redshift tables. This can be done easily with the <span class="code">UNLOAD</span> command.
The default format from the <span class="code">UNLOAD</span> command is pipe-delimited text files. The <span class="code">FORMAT AS</span> option can be used to specify other file types like Parquet, JSON, or CSV.
Two of the main file format options commonly used are CSV or Parquet. If the data size is small, around 6GB, the <span class="code">PARALLEL OFF</span> option can be used to unload the data to a single file; otherwise, the <span class="code">UNLOAD</span> command will create the same number of files as the number of slices on the Redshift cluster.
Import Data into Snowflake
The importing of the S3 data from Redshift into Snowflake is a multistep process. First, we need to create a File Format Object in Snowflake, based on the previous files that were exported. In most instances, this will already exist in Snowflake. If CSV files were created, the new File Format Object would indicate CSV delimiter options.
The next step is to create a Named Stage Object that references the location of the Redshift migration files in their S3 bucket.
Lastly, we can run a <span class="code">COPY INTO</span> statement to push our files from the S3 bucket into each Snowflake table.
Validate and Verify with Data Diff
With our data successfully migrated from Redshift and into Snowflake, we come to the most important part of any migration process: validating and verifying that all the data has been pulled and that it has been moved correctly.
Data validation is the most time consuming and challenging part of a migration project, typically done by a large team manually writing many different SQL commands to verify data and counts.
Data Diff can verify tables of all sizes—from hundreds to billions of records—providing row-level verification, monitoring and alerting, and even self-healing. Data Diff can be easily installed with Python’s pip.
Running <span class="code">data-diff</span> at the command line requires four arguments:
- <span class="code">DB1_URI</span>
- <span class="code">TABLE1_NAME</span>
- <span class="code">DB2_URI</span>
- <span class="code">TABLE2_NAME</span>
The database URI formats are familiar. In the case of our example migration, we want to compare Redshift to Snowflake, and so we specify each database URI along with the corresponding table.
Using <span class="code">data-diff</span> at the command line is a powerful and easy way to automate and run various verifications and checks for data migrations. As a CLI, <span class="code">data-diff</span> is perfect for ad-hoc data verification as well as automated tasks built into CI/CD pipelines.
A common migration verification
When carrying out a large migration, missing even a single row can lead to inaccurate downstream results and mistrust in the migration process and new platform. Data Diff enables row-by-row comparisons for the initial migration as well as for ongoing data being loaded.
Two of the most common questions and problems related to data migrations are:
- Is the data between the two systems the same?
- If there are differences, exactly which rows are different?
Data Diff can easily answer these questions for migrations of both large and small datasets, even across many different tables.
Before running validation, it’s important to understand some of the command-line options available when using <span class="code">data-diff</span>:
- The <span class="code">bisection-factor</span> is the “segments per iteration.” For example, if a table has 1 million rows and we set the bisection factor to 10, then we will get 10 segments of 100,000 records each.
- The <span class="code">key-column</span> option indicates the primary key for a table.
- The <span class="code">threads</span> option lets you increase the number of worker threads used per database to improve performance.
- The <span class="code">min-age</span> and <span class="code">max-age</span> options allow the diff to compare rows older or younger than a configurable parameter. The valid units for these options are:
- The <span class="code">update-column</span> option is an indicator for which column in the table holds the last updated timestamp for rows.
Let’s return to our example migration from Redshift to Snowflake. If our <span class="code">sales_report</span> table had 1 million records, we might run <span class="code">data-diff</span> with the following options:
This command would compare the <span class="code">sales_report</span> table in Snowflake to that of Redshift, with a million records split into 10 segments of at most 100,000 records each. Data Diff will compare the <span class="code">sales_id</span> and <span class="code">amount</span> columns between the two tables, using <span class="code">sales_id</span> as the primary key column. Data Diff will also ignore rows inserted in the last five minutes.
For each row that is different, Data Diff will output the values for those columns passed in as <span class="code">key-column</span> and <span class="code">update-column</span> arguments, allowing for detailed verification and inspection of any possible migration issues.
The Benefits of Using Data Diff
It’s easy to see that adding a tool like Data Diff is essential for any large data migration. The ability to use a single tool to compare any number of tables during a complex migration saves engineering time and effort and reduces the possibility of missing or incorrect data validations.
Because Data Diff is easy to install and run as a CLI tool, integration into CI/CD and other data pipelines for automated execution is simple. Continuously running data validation commands over several days and weeks during a large migration can ensure any problems are found and addressed quickly.
More and more organizations are adopting Snowflake, and many of them are coming from data stacks dependent on Redshift. Although the steps for migrating data from Redshift to Snowflake are clear and simple, the main challenge is the data validation and verification that is necessary to confirm a successful migration. In the past, comprehensive data validation required an immense amount of engineering resources and time. Now, with Data Diff from Datafold, post-migration data validation can be automated, yielding high-quality results in a fraction of the time.
Find out more about Data Diff here.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.