Our fourth Data Quality Meetup took place on May 20th, 2021 with 7 guest speakers and over 120 live participants. If you haven't had a chance to do so yet, check out our digest and recordings of the lightning talks from Data Quality Meetup# 3. You can RSVP for the upcoming session here.
By Gleb Mezhanskiy, Founder & CEO @ Datafold
What Breaks Data?
1. Changes in data that we do not control: These are the external factors that are typically beyond the control of data teams
2. Changes in code that transforms the data: These are typically internal changes executed from within the data team that may go on to impact internal stakeholders
The optimal approach to improving data quality is preventing its failure in the first place. Comparisons can be drawn to nuclear power generation, whose operation is inherently safer than other forms of power generation including fossil fuels and natural gas.
Similar to nuclear power generation, the cost of fixing data failures after deployment can be orders of magnitude higher than catching and addressing issues at the development stage. Effective risk mitigation is best achieved through front-loading measures such as change management to minimize the probability of a failure risk in production. Fixing issues in production can be significantly more expensive because part of the damage is already done, downstream systems may have been impacted due to error propagation, and to add to that - the challenge of fighting the existing business direction momentum to rectify the root cause.
The Best Data Quality Investment: Establish a robust change management process
We suggest three principles of a reliable change management process:
Version-control all code in your data products starting from event tracking, through warehousing and transformation all the way to BI. Most modern stacks in each of these segments allow easy version-control and it helps to isolate, track, and quickly revert those changes along with an audit (change management) trail.
When changes are made, both the change maker and affected stakeholders should be aware of the full impact of the change. That means being able to confidently answer questions such as:
Such visibility can be achieved with a combination of the following tools:
1. Data Assertions (e.g. great_expectations, dbt test, etc.) that help validate business assumptions about the data
2. Data Diff (e.g. Datafold’s Diff) that works like a “git diff” for data and provides a visual and statistical report on the changes in the underlying data
3. Data Lineage – visualizes column-level dependencies across tables and BI assets
Having the right tools is not enough. The only way to have a reliable change management process is to have a clear step-by-step playbook for implementing every change. That process also needs to be enforced: i.e. no one can deploy a change to production without following it.
By Jason Carey, Data Platform Technical Lead @ Lyft
What is Verity?
Data quality is a big challenge at Lyft’s scale. To help data producers and consumers such as data engineers, data scientists and analysts ensure the quality of their data, Lyft built Verity – a check-based system that monitors the quality of offline data with specific attention given to the semantic, or contextual, correctness of the data. From a high level, Verify works through a 3 step model:
The customer ETL pipeline feeds into the gRPC-based Verity API. Checks can be run in both blocking and non-blocking manner depending on the risk tolerance associated with potential downstream cascading impact. The “brains” is an executor which functions as a transpiler to feed into an execution engine like Presto. Finally, it all ties with all the standard notifiers to notify stakeholders of data irregularities.
By Michelle Ark, Senior Data Engineer @ Shopify
Shopify’s data platform operates at a large scale:
Over the last 8 months, Shopify has introduced a modeling tool – dbt – that enables the Data team to develop simple reporting data pipelines faster, while ensuring high data quality. To democratize data pipeline development at scale, Shopify standardized on dbt as a SQL modeling framework and BigQuery as the primary data warehouse. Since data quality is of critical importance, the team uses a combination of tools and processes to ensure data reliability. As dbt power users, Shopify data developers have been using dbt tests (data assertions) quite extensively. However, the team faced a number of challenges with that approach:
To complement dbt tests, the Shopify team extended the dbt toolkit internally with an in-house framework called Seamster that allows data developers to easily implement SQL unit tests.
A SQL unit test is similar in nature to classic software unit tests and is performed in three simple steps:
Note that unlike dbt tests or great_expectations/deequ assertions, SQL unit tests do not read actual production data!
SQL Unit Testing: Why Do It?
The central object in this framework is a ‘mock’ data model. It enables constructing input MockModels from static data with a schema - initialized from either a Python dictionary, Pandas dataframe, or in-line csv-style string to provide flexibility for the users. The actual MockModel is built from input MockModels using BigQuery. Actual and expected MockModels can then be compared to one another to assert equality or through any of the open-source suite of Great Expectations assertions.
By Caitlin Colgrove, Co-Founder and CTO @ Hex
As data products have evolved from the more static type reports to fully interactive highly complex data over the last few decades, the expectations and requirements around data security have evolved in equal measures.
Hex has been working on data security challenges that may be triggered by the users themselves - intentionally or accidentally:
One of the most common and dangerous attack vectors on systems that use SQL is SQL injections. And while this problem has been largely eradicated in the software world through built-in protections on the framework/database layer, it still poses a great risk to data applications that (a) primarily rely on SQL, and (b) are often built outside of professional software engineering teams.
Pitfalls of SQL Injection
Often, web apps that are designed to allow users to query simple information use SQL injection through a Python interface. A common pitfall of this is that users can intentionally or unintentionally cause data loss by adding in the “DROP TABLE” SQL statement in their query pointing to a valid table in the database. Using a similar technique, an attacker may also be able to change database passwords and obtain full access to the database.
Protecting your data application against SQL injections
1. Disallow arbitrary input: use drop-down instead of free text fields
2. Use PreparedStatements
3. Allowlisting / Whitelisting: checks the input against a specific set of constraints
Data Security Rule# 1: Don't Build it Yourself
Data security is extremely complex and the scope can quickly creep up. Caitlin suggests always looking out for tools and libraries that already support secure querying per your requirements, rather than building solutions in-house. For example: psycopg (query engines), JinjaSQL (templating libraries), Looker/Hex (BI tools).
By Alex Viana, VP of Data @ HealthJoy
The data team at HealthJoy serves a strategic and central role in decision-making. As a result of working with high-level stakeholders, the asks are often quite generic and vague, and require large-scale implementations. As this poses a significant risk of scope creep, mismatched expectations, and wasted resources, Alex and his team designed a process that allows them to reduce the uncertainty in strategic projects by “working backwards with data”.
The approach taken involved... “faking your data”. This meant using a mock dataset to quickly prototype a solution and then working backward from it together with the stakeholders. Such an approach helped draw a box around the deliverable, communicate that to the end-users, and then start actual implementation with real data only once the alignment is achieved. This approach worked well for HealthJoy by minimizing iteration costs and time.
Examples of Implementation
1. Company Annual Recurring Revenue (ARR) Model
2. Data-Driven Feature Development
Panelists (from left to right)
What is your point of view on investing first in reliable change management? Where have you found the highest ROI in your organizations in terms of improving trust and minimizing downtime?
How do you feel about enabling versus restricting and maintaining the right balance when scaling up?
How do you maintain a high speed of development in companies that operate within highly regulated industries such as Healthcare?
What are the things that data quality vendors and developers still have to deliver to the data community in terms of reliable data product building?
 As demonstrated in a research study published in The Lancet