Data Quality Meetup #4

Datafold Team

June 22, 2021

Our fourth Data Quality Meetup took place on May 20th, 2021 with 7 guest speakers and over 120 live participants. If you haven't had a chance to do so yet, check out our digest and recordings of the lightning talks from Data Quality Meetup# 3. You can RSVP for the upcoming session here.

Lightning Talks

The Best Data Quality Investment in 2021

By Gleb Mezhanskiy, Founder & CEO @ Datafold

What Breaks Data?

1. Changes in data that we do not control: These are the external factors that are typically beyond the control of data teams

  • App engineers change event tracking
  • Third-party vendor issues
  • Data infrastructure failures

2. Changes in code that transforms the data: These are typically internal changes executed from within the data team that may go on to impact internal stakeholders

  • Changes in business logic (e.g. SQL in ETL/BI)
  • Changes in database schemas
  • Discontinued tools, features, and functionalities

The optimal approach to improving data quality is preventing its failure in the first place. Comparisons can be drawn to nuclear power generation, whose operation is inherently safer than other forms of power generation including fossil fuels and natural gas[1].

Similar to nuclear power generation, the cost of fixing data failures after deployment can be orders of magnitude higher than catching and addressing issues at the development stage. Effective risk mitigation is best achieved through front-loading measures such as change management to minimize the probability of a failure risk in production. Fixing issues in production can be significantly more expensive because part of the damage is already done, downstream systems may have been impacted due to error propagation, and to add to that - the challenge of fighting the existing business direction momentum to rectify the root cause.

The Best Data Quality Investment: Establish a robust change management process

We suggest three principles of a reliable change management process:

  • Principle 1: Version-control everything

Version-control all code in your data products starting from event tracking, through warehousing and transformation all the way to BI. Most modern stacks in each of these segments allow easy version-control and it helps to isolate, track, and quickly revert those changes along with an audit (change management) trail.

  • Principle 2: Know the impact of every change

When changes are made, both the change maker and affected stakeholders should be aware of the full impact of the change. That means being able to confidently answer questions such as:

  1. How will the dataset/report I am modifying look after the changes?
  2. What other datasets/reports will be affected and how?
  3. Will there be any changes to downstream reporting and ML workflows?
  4. Do those changes require proactive refactoring, model retraining, or, at the very least, notifying stakeholders?

Such visibility can be achieved with a combination of the following tools:

1. Data Assertions (e.g. great_expectations, dbt test, etc.) that help validate business assumptions about the data

2. Data Diff (e.g. Datafold’s Diff) that works like a “git diff” for data and provides a visual and statistical report on the changes in the underlying data

3. Data Lineage – visualizes column-level dependencies across tables and BI assets

  • Principle 3: Design, implement, and enforce the process

Having the right tools is not enough. The only way to have a reliable change management process is to have a clear step-by-step playbook for implementing every change. That process also needs to be enforced: i.e. no one can deploy a change to production without following it.

Verity: Data Quality as a Service

By Jason Carey, Data Platform Technical Lead @ Lyft

Stack:

  • AWS S3 as data lakes
  • Apache Hive for meta storage
  • Spark and Presto as execution engines

Scale:

  • 110k datasets
  • 40 PB of data
  • 1100+ daily active users at Lyft

What is Verity?

Data quality is a big challenge at Lyft’s scale. To help data producers and consumers such as data engineers, data scientists and analysts ensure the quality of their data, Lyft built Verity – a check-based system that monitors the quality of offline data with specific attention given to the semantic, or contextual, correctness of the data. From a high level, Verify works through a 3 step model:

  1. Verity Check: Configure the checks to monitor your data by defining the query (what to count) and its conditions (expected value)
  2. Schedule: Orchestrate checks when data changes
  3. Notification: Get notified when a check fails

Architecture

The customer ETL pipeline feeds into the gRPC-based Verity API. Checks can be run in both blocking and non-blocking manner depending on the risk tolerance associated with potential downstream cascading impact. The “brains” is an executor which functions as a transpiler to feed into an execution engine like Presto. Finally, it all ties with all the standard notifiers to notify stakeholders of data irregularities.

Adoption

  • MVP completed in 2021Q1
  • 10+ internal teams comprising multiple different roles
  • 1,000+ checks currently in production
  • 65% of core data sets coverage e.g. uniqueness guarantees, volume guarantees

Targets

  • 100% data set coverage by 2021Q4
  • Transition to automated passive monitoring; away from human-built checks-based monitoring

Unit Tests for SQL with dbt & Python

By Michelle Ark, Senior Data Engineer @ Shopify

Shopify’s data platform operates at a large scale:

  • 200+ data scientists
  • 60+ data platform engineers
  • 1M+ data consumers (Shopify users)

Over the last 8 months, Shopify has introduced a modeling tool – dbt – that enables the Data team to develop simple reporting data pipelines faster, while ensuring high data quality. To democratize data pipeline development at scale, Shopify standardized on dbt as a SQL modeling framework and BigQuery as the primary data warehouse. Since data quality is of critical importance, the team uses a combination of tools and processes to ensure data reliability. As dbt power users, Shopify data developers have been using dbt tests (data assertions) quite extensively. However, the team faced a number of challenges with that approach:

  • Tests on production data take a lot of time and compute resources, slowing down the iteration cycle and inhibiting high cost
  • Difficult to isolate edge cases
  • Since the production data always changes, tests are prone to generate noise, or, on the opposite, miss important edge cases, if they weren’t present in the production data at the time the tests were written

To complement dbt tests, the Shopify team extended the dbt toolkit internally with an in-house framework called Seamster that allows data developers to easily implement SQL unit tests.

A SQL unit test is similar in nature to classic software unit tests and is performed in three simple steps:

  1. Define a mock dataset with as many failure and edge cases as possible
  2. Run production SQL code against the mock dataset
  3. Compare the output of the run with the expected result

Note that unlike dbt tests or great_expectations/deequ assertions, SQL unit tests do not read actual production data!

SQL Unit Testing: Why Do It?

  1. Static Data allows to test all edge cases and do it very fast, improving iteration speed and reducing compute costs.
  2. Regression Testing increases user confidence in change management: even with frequent updates by multiple collaborators, everyone is confident that changes won’t introduce regressions and break existing behavior

The central object in this framework is a ‘mock’ data model. It enables constructing input MockModels from static data with a schema - initialized from either a Python dictionary, Pandas dataframe, or in-line csv-style string to provide flexibility for the users. The actual MockModel is built from input MockModels using BigQuery. Actual and expected MockModels can then be compared to one another to assert equality or through any of the open-source suite of Great Expectations assertions.

Building Security Conscious Data Apps

By Caitlin Colgrove, Co-Founder and CTO @ Hex

As data products have evolved from the more static type reports to fully interactive highly complex data over the last few decades, the expectations and requirements around data security have evolved in equal measures.

Hex has been working on data security challenges that may be triggered by the users themselves - intentionally or accidentally:

  1. Denial of service: running a query large enough to crash the service
  2. Data loss or corruption
  3. Unauthorized data access

One of the most common and dangerous attack vectors on systems that use SQL is SQL injections. And while this problem has been largely eradicated in the software world through built-in protections on the framework/database layer, it still poses a great risk to data applications that (a) primarily rely on SQL, and (b) are often built outside of professional software engineering teams.

Pitfalls of SQL Injection

Often, web apps that are designed to allow users to query simple information use SQL injection through a Python interface. A common pitfall of this is that users can intentionally or unintentionally cause data loss by adding in the “DROP TABLE” SQL statement in their query pointing to a valid table in the database. Using a similar technique, an attacker may also be able to change database passwords and obtain full access to the database.

Protecting your data application against SQL injections

1. Disallow arbitrary input: use drop-down instead of free text fields

2. Use PreparedStatements

  • Allows you to separate out the query and values
  • Helps prevent strings that may represent SQL queries from being executed if the user puts them in
  • Built into most standard RDBMS (e.g. PostgreSQL)

3. Allowlisting / Whitelisting: checks the input against a specific set of constraints

  • Valid identifier characters ([a-zA-Z_0-9]) as opposed to PreparedStatements, which only work for the value part of the query
  • Is one of a small known set of keywords
  • Is an identifier within the schema
  • Parametrize ones not allowed by PreparedStatement (e.g. column name)

Data Security Rule# 1: Don't Build it Yourself

Data security is extremely complex and the scope can quickly creep up. Caitlin suggests always looking out for tools and libraries that already support secure querying per your requirements, rather than building solutions in-house. For example: psycopg (query engines), JinjaSQL (templating libraries), Looker/Hex (BI tools).

Fake It Till You Make It: A Backward Approach to Data Products

By Alex Viana, VP of Data @ HealthJoy

The data team at HealthJoy serves a strategic and central role in decision-making. As a result of working with high-level stakeholders, the asks are often quite generic and vague, and require large-scale implementations. As this poses a significant risk of scope creep, mismatched expectations, and wasted resources, Alex and his team designed a process that allows them to reduce the uncertainty in strategic projects by “working backwards with data”.

Working Backwards

The approach taken involved... “faking your data”. This meant using a mock dataset to quickly prototype a solution and then working backward from it together with the stakeholders. Such an approach helped draw a box around the deliverable, communicate that to the end-users, and then start actual implementation with real data only once the alignment is achieved. This approach worked well for HealthJoy by minimizing iteration costs and time.

Examples of Implementation

1. Company Annual Recurring Revenue (ARR) Model

  • Ask: Forecast the company’s future cash flow through ARR
  • Challenge: ARR is almost all corner cases
  • Method: Present fake/minimal ARR products to stakeholders to gauge usability and to discover corner and edge cases
  • Value: Identified pitfalls and bugs early on in the process through stakeholder input on the “fake data”, resulting in faster development

2. Data-Driven Feature Development

  • Ask: Help build a high-value feature around a new healthcare cost data set
  • Challenge: Data would not be live for another 18 months
  • Method: Fake the feature (instead of data) to figure out data constraints and wider considerations
  • Value: Focussed product discussions with product managers and owners a year before real data was available

Panel Discussion: Lessons from Shopify, Lyft, HealthJoy & Devoted Health

Panelists (from left to right)

  • Julia Schottenstein, Product Manager @ Fishtown Analytics
  • Alex Viana, VP of Data @ HealthJoy
  • Karla Goodreau, Director of Data Engineering @ Devoted Health
  • Jason Carey, Data Platform Technical Lead @ Lyft
  • Michelle Ark, Senior Data Engineer @ Shopify
  • Chad Sanderson, Head of Product, Data Platform @ Convoy
  • Caitlin Colgrove, Co-Founder and CTO @ Hex
  • Gleb Mezhanskiy [Moderator], Co-founder & CEO @ Datafold

Questions

What is your point of view on investing first in reliable change management? Where have you found the highest ROI in your organizations in terms of improving trust and minimizing downtime?

  • Understanding of the Data: In addition to the two main types of data failures, there may be a third type of problem. This typically arises at the user level due to the difference in the user’s understanding of what the data represents versus what it actually represents in reality. While these ad-hoc challenges are attempted to be addressed through collaboration tools, there may still be differences in understanding among the users themselves during the collaboration process.
  • Disconnect between data producers and consumers: The introduction of change may cause an inadvertent impact on downstream processes. Lyft has been working on enforcing a data culture and identifying and investing in the core data assets and then expanding out from there.
  • Controls: Shopify enables and empowers anyone to build data but restricts others from building on top of such data until/unless it becomes officially verified and endorsed. There is always a trade-off between full empowerment to develop versus having the right controls in place to protect the very models that are being built.

How do you feel about enabling versus restricting and maintaining the right balance when scaling up?

  • Graded Approach: Lyft has begun to implement tiers of data products/assets and taken on a graded approach for controls/restrictions based on the risk and consequences of their failures.
  • Review Process: When everyone has access to do everything, the cost of creating something new becomes lower than leveraging prior obtained data. With the explosion of information and blurred lines between data owners, it becomes a challenge to identify the source of truth. There is value in establishing a process where users can commit their deliverables and communicate their use to the organization and undergo a review process before any dependencies are set up.

How do you maintain a high speed of development in companies that operate within highly regulated industries such as Healthcare?

  • Early Decision Making: As a new startup, Devoted Health uses mono repos for everything and uses generated code (e.g. Protocol Buffers). This helps maintain a single source of truth and the ability to flag items as particularly sensitive.
  • Approvals: Lean into tooling functionalities (e.g. dbt) to ensure there are checks in place per health data control and security protocols such as HIPAA, PII, and PHI. If direct access to the data is required by other teams or stakeholders, HealthJoy is implementing a formal process for approvals and sign-offs required to grant those accesses.

What are the things that data quality vendors and developers still have to deliver to the data community in terms of reliable data product building?

  • Interoperability within niches: There is room for growth in focussing on specialized niche industries and domains. But the challenge arises in terms of interoperability between data tools and still being able to appeal to a larger market to maintain business viability.
  • Integrations with client data stack: Hex is working on delivering in a format that integrates with the maximum number of tools and processes within their client’s workflow without having to dig into every single permutation and combination of integrations with all the different tools out there.

[1] As demonstrated in a research study published in The Lancet