Customer story

Flatiron Health migrated Terabytes of data from Postgres to Snowflake using PeerDB

Flatiron Health is a healthtech company expanding the possibilities for point of care solutions in oncology and using data for good to power smarter care for every person with cancer. Through machine learning and AI, real-world evidence, and breakthroughs in clinical trials, Flatiron continues to transform patients’ real-life experiences into knowledge and create a more modern, connected oncology ecosystem.

Flatiron is Transitioning from Postgres to Snowflake to Centralize Data Discovery and Optimize Compute/Storage Costs

Historically, Flatiron has standardized on PostgreSQL as the go-to database for storing and analyzing clinical research datasets. As the sizes and types of datasets increased, they became more costly to store and harder to find across different company databases, each needing its own access rules.
To centralize the discovery of and permissions for all Flatiron datasets in one place, and to reduce compute costs associated with paying for PostgreSQL instances even when they are not in use, Flatiron has initiated efforts to migrate data from PostgreSQL to Snowflake. This strategy enables Flatiron to utilize Snowflake's pay-as-you-go pricing model for computing and centralized access management.

Challenge: Migrate Terabytes of Data with Security, Performance, and Ease of Use

Flatiron datasets have accumulated over multiple years, subject to retention requirements for compliance and data lineage reasons. Finding a cost-effective solution for migrating data from PostgreSQL to Snowflake that would meet these requirements proved challenging.
  • Datasets contain protected health information (PHI) and must remain within their closed network.
  • Have good performance as Flatiron have terabytes of data spread across thousands of tables.
  • Be easy to use, requiring minimal engineering resources
Most Data Movement and ETL tools Flatiron evaluated failed to meet their requirements because they were either fully hosted services requiring extensive security review, or they couldn't handle multi-TB migrations reliably at scale. These tools were generalized and not specifically built for Postgres to Snowflake migrations.

Flatiron used PeerDB to Securely migrate Terabytes from Postgres to Snowflake

Flatiron discovered PeerDB through its content on Postgres to Snowflake replication. They landed on PeerDB Open Source, which provided easy-to-use Docker images that allowed for the installation of PeerDB within minutes. Within a few hours, they deployed PeerDB on an AWS EC2 instance in their private network and established connectivity between their source PostgreSQL database and the private link Snowflake environment. They ran a proof-of-concept, migrating a few tables with 100s of GBs of data from Postgres to Snowflake, and verified that data was accurately replicated. This helped prove that PeerDB met Flatiron’s requirements.
Here's how PeerDB was able to deliver value to Flatiron:

PeerDB Can Run in Fully Private and Secure Environments

PeerDB provides Open Source and Enterprise options to operate within customers' private environments, securing data within Flatiron’s network. Flatiron chose PeerDB Open Source for the migration and intends to expand usage to PeerDB Enterprise which comes with production-grade Helm charts with features like HA, autoscaling, air-gapped environment support and round the clock 24/7 support, for other use cases.

PeerDB’s SQL Layer and UI helped Flatiron easily manage the migration

PeerDB provides a Postgres-compatible SQL Layer and a Simple UI to manage replication from Postgres to Snowflake. Flatiron used the SQL Layer to create and manage Peers (data sources) and MIRRORS (replication jobs).

They wrote Python scripts against PeerDB's Query layer to automate the creation of Peers and MIRRORS. Through this, they were able to smoothly manage the migration of multiple Postgres databases with 500 schemas and 35,000 tables. Without the SQL Query Layer, using UI or REST APIs would have made it difficult and unintuitive to manage these many database objects.
### Sample Python script for managing PeerDB to migrate data from Postgres to Snowflake.

# src_full_table_names: list of fully qualified table names (["schema.table_name", ...]) to replicate from source to destination

formatted_table_names = ",\n".join(f"{table_name}:{table_name}" for table_name in src_full_table_names) + "\n"

query = f"""
CREATE MIRROR {mirror_name}
FROM {self.source_peer} TO {self.target_peer}
WITH TABLE MAPPING
(
  {formatted_table_names}
)
WITH (
  do_initial_copy = true,
  snapshot_sync_mode='avro',
  snapshot_num_rows_per_partition = {cdc_config.snapshot_num_rows_per_partition},
  snapshot_max_parallel_workers = {cdc_config.snapshot_max_parallel_workers},
  snapshot_num_tables_in_parallel = {cdc_config.snapshot_num_tables_in_parallel},
  initial_copy_only = true
);
"""


PeerDB's UI
provided operational visibility throughout the migration process. Flatiron could monitor the status of migration per table, analyze logs, and set up alerts for errors. They were also able to monitor the load on the Postgres databases, analyzing wait events, monitoring the number of connections, and more.
“PeerDB made it extremely easy to migrate PostgreSQL data to Snowflake. Our team managed to quickly set up and begin using PeerDB in our local environment within an hour. In just a week, we successfully migrated 35,000 tables, totaling 2.5TB in size, saving a substantial amount of engineering resources it would take to develop an in-house solution. The PeerDB team was also very helpful in providing support throughout our migration journey and addressed our feature requests with a short turnaround time.” - Daniel Tatarkin, Software Engineer, Flatiron Health

Flatiron migrated 2.5TB from Postgres to Snowflake within a day


Through PeerDB's Parallel Snapshotting feature, Flatiron was able to migrate 2.5TB from Postgres to Snowflake within a day. Parallel Snapshotting is a unique feature to PeerDB that enables 2x to 16x faster data-movement compared to other ETL tools. Other tools would have taken several days to a few weeks to migrate 2.5TB of data.

Try out PeerDB

We hope you enjoyed reading this customer story. If you have a use case similar to Flatiron Health and want to use PeerDB to replicate data from Postgres to Snowflake:
About
Flatiron Health is a healthtech company expanding the possibilities for point of care solutions in oncology and using data for good to power smarter care for every person with cancer. Through machine learning and AI, real-world evidence, and breakthroughs in clinical trials, Flatiron continues to transform patients’ real-life experiences into knowledge and create a more modern, connected oncology ecosystem.
Use case
Healthcare, Biotechnology Research, Data Migrations
Solutions
Postgres to Snowflake CDC
Authors
Daniel Tatarkin, Software Engineer, Flatiron
Sai Srirampur, Co-founder, PeerDB
PeerDB
© 2024 PeerDB. All rights reserved.
By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.