The problem: The small, mighty, and jack-of-all-trades data team
Norm was hired as data person number two at Trainual, a company that centralizes employee policies, standardizes training processes, systemizes operations, and creates accountability at scale—for organizations of all kinds. Trainual is an amazing company, with values like "no red tape", "make ideas happen", and "everyone has a key", which helps adopting useful tools much easier.
Trainual has grown in the last six months to a year to three analysts and a data engineer. What does that mean for a small startup? It means that as Trainual grows, their entire data team wears several hats. Here is a short list of what Norm is responsible for:
- Define what supporting decision making with trusted data actually looks like. Deliver on it.
- Determine what Trainual's data pipelines look like, and what they should become. Then evangelize best practices around dbt and data warehousing. For speed of execution, many models that Trainual's data team creates are "good-enough"
- Incorporate smart and less-noisy alerting into Slack for dbt tests that need action and put into a separate channel more noisy alerts such as third-party product status updates.
What Trainual's data team pushed until later:
- Refactor anything for a more tidy and performant data warehouse.
- Some test coverage, including custom dbt tests
One task Norm was given was to build greater observability into the data pipelines to answer these and other questions.:
- Is the data up-to-date?
- Is the data complete?
- Are fields within expected ranges?
- Is the null rate higher or lower than it should be?
- Has the schema changed?
All along the way Norm continually looked for tools that would deliver the most value with the least amount of effort. One of those tools Norm searched for was something to compare datasets from new SQL code against production SQL data results. Norm wanted this tool in large part to check for data completeness, number of rows resulting with new code versus production, and schema changes.
While looking for a tool to compare datasets, Norm reviewed datacompy, a tool he used previously in Databricks. For the company size, he could not justify Databricks or the infrastructure to run Scala for faster comparisons between larger datasets. He also tried out the dbt package audit_helper, which didn’t quite fit where Trainual wanted to incorporate into continuous development.
The task to find such a tool came up again when Norm made sweeping changes to underlying data connectors and he wanted to compare many tables from code against tables in production. After searching again, he kept seeing Datafold’s open source tool data-diff mentioned and tried it out.
Afterwards, Norm tried Datafold Cloud’s Deployment Testing and this met the need of a table comparison tool that delivered more value than they expected with little effort!
I typically had to schedule one to two weeks of testing without Datafold Cloud, that is manual ad-hoc queries to check counts, null values, primary keys between dev and production. I was at a breaking point with how much wasted time and energy this was causing. I’m glad we got Datafold Cloud just in time!
The results: Accelerated deployment timelines
As a result of Datafold Cloud, Trainual's small, but lean team has seen incredible time savings around PR testing and reviews.
- Faster PR reviews: After adding a few more primary keys, analyst engineers were able to see in Datafold Cloud, at-a-glance, comparisons between Table A and Table B to check if data is up-to-date, complete, fields in expected ranges, null rates, and schema changes. This has made it so 30 minutes of manual testing turned into a few minutes, allowing us to review and deploy with much greater efficiency (and confidence!).
- Reduction of manual testing: Trainual's datateam operates more quickly and saves approximately 1-2 days of ad-hoc testing, sometimes after changes are made, down to 20 minutes of at-a-glance, one-stop shop data comparison. This is a considerable amount of time-back for their lean data team that has many different responsibilities and priorities.
All of the above with Datafold Cloud helps Trainual's data team to manage:
- 614 dbt models (and growing!)
- Over 1,000 source tables
- 561 Gigabytes of data
- 2,252,454,238 rows of data
- 10-15 pull requests per month
Testing speed has increased by 20-30%! Trainual's data team feels confident they’ll continue to grow and scale with efficiency and confidence using the automated testing supported by Datafold Cloud.