June 22, 2021

Data Quality Best Practices, Events

Data Quality Meetup #4

Lyft vs. Shopify in testing ETL at scale, using fake data to align your stakeholders, and how to avoid nuclear meltdowns in your data platform.

No items found.

Datafold Team

Our fourth Data Quality Meetup took place on May 20th, 2021 with 7 guest speakers and over 120 live participants. If you haven't had a chance to do so yet, check out our digest and recordings of the lightning talks from Data Quality Meetup# 3. You can RSVP for the upcoming session here.

Lightning Talks

The Best Data Quality Investment in 2021

By Gleb Mezhanskiy, Founder & CEO @ Datafold

What Breaks Data?

1. Changes in data that we do not control: These are the external factors that are typically beyond the control of data teams

App engineers change event tracking
Third-party vendor issues
Data infrastructure failures

2. Changes in code that transforms the data: These are typically internal changes executed from within the data team that may go on to impact internal stakeholders

Changes in business logic (e.g. SQL in ETL/BI)
Changes in database schemas
Discontinued tools, features, and functionalities

The optimal approach to improving data quality is preventing its failure in the first place. Comparisons can be drawn to nuclear power generation, whose operation is inherently safer than other forms of power generation including fossil fuels and natural gas[1].

Similar to nuclear power generation, the cost of fixing data failures after deployment can be orders of magnitude higher than catching and addressing issues at the development stage. Effective risk mitigation is best achieved through front-loading measures such as change management to minimize the probability of a failure risk in production. Fixing issues in production can be significantly more expensive because part of the damage is already done, downstream systems may have been impacted due to error propagation, and to add to that - the challenge of fighting the existing business direction momentum to rectify the root cause.

The Best Data Quality Investment: Establish a robust change management process

We suggest three principles of a reliable change management process:

Principle 1: Version-control everything

‍

Version-control all code in your data products starting from event tracking, through warehousing and transformation all the way to BI. Most modern stacks in each of these segments allow easy version-control and it helps to isolate, track, and quickly revert those changes along with an audit (change management) trail.

Principle 2: Know the impact of every change

When changes are made, both the change maker and affected stakeholders should be aware of the full impact of the change. That means being able to confidently answer questions such as:

How will the dataset/report I am modifying look after the changes?
What other datasets/reports will be affected and how?
Will there be any changes to downstream reporting and ML workflows?
Do those changes require proactive refactoring, model retraining, or, at the very least, notifying stakeholders?

Such visibility can be achieved with a combination of the following tools:

1. Data Assertions (e.g. great_expectations, dbt test, etc.) that help validate business assumptions about the data

2. Data Diff (e.g. Datafold’s Diff) that works like a “git diff” for data and provides a visual and statistical report on the changes in the underlying data

3. Data Lineage – visualizes column-level dependencies across tables and BI assets

Principle 3: Design, implement, and enforce the process

Having the right tools is not enough. The only way to have a reliable change management process is to have a clear step-by-step playbook for implementing every change. That process also needs to be enforced: i.e. no one can deploy a change to production without following it.

Verity: Data Quality as a Service

By Jason Carey, Data Platform Technical Lead @ Lyft

Stack:

AWS S3 as data lakes
Apache Hive for meta storage
Spark and Presto as execution engines

Scale:

110k datasets
40 PB of data
1100+ daily active users at Lyft

What is Verity?

Data quality is a big challenge at Lyft’s scale. To help data producers and consumers such as data engineers, data scientists and analysts ensure the quality of their data, Lyft built Verity – a check-based system that monitors the quality of offline data with specific attention given to the semantic, or contextual, correctness of the data. From a high level, Verify works through a 3 step model:

Verity Check: Configure the checks to monitor your data by defining the query (what to count) and its conditions (expected value)
Schedule: Orchestrate checks when data changes
Notification: Get notified when a check fails

‍

‍

Architecture

The customer ETL pipeline feeds into the gRPC-based Verity API. Checks can be run in both blocking and non-blocking manner depending on the risk tolerance associated with potential downstream cascading impact. The “brains” is an executor which functions as a transpiler to feed into an execution engine like Presto. Finally, it all ties with all the standard notifiers to notify stakeholders of data irregularities.

‍

Adoption

MVP completed in 2021Q1
10+ internal teams comprising multiple different roles
1,000+ checks currently in production
65% of core data sets coverage e.g. uniqueness guarantees, volume guarantees

Targets

100% data set coverage by 2021Q4
Transition to automated passive monitoring; away from human-built checks-based monitoring

Unit Tests for SQL with dbt & Python

By Michelle Ark, Senior Data Engineer @ Shopify

Shopify’s data platform operates at a large scale:

200+ data scientists
60+ data platform engineers
1M+ data consumers (Shopify users)

Over the last 8 months, Shopify has introduced a modeling tool – dbt – that enables the Data team to develop simple reporting data pipelines faster, while ensuring high data quality. To democratize data pipeline development at scale, Shopify standardized on dbt as a SQL modeling framework and BigQuery as the primary data warehouse. Since data quality is of critical importance, the team uses a combination of tools and processes to ensure data reliability. As dbt power users, Shopify data developers have been using dbt tests (data assertions) quite extensively. However, the team faced a number of challenges with that approach:

Tests on production data take a lot of time and compute resources, slowing down the iteration cycle and inhibiting high cost
Difficult to isolate edge cases
Since the production data always changes, tests are prone to generate noise, or, on the opposite, miss important edge cases, if they weren’t present in the production data at the time the tests were written

To complement dbt tests, the Shopify team extended the dbt toolkit internally with an in-house framework called Seamster that allows data developers to easily implement SQL unit tests.

A SQL unit test is similar in nature to classic software unit tests and is performed in three simple steps:

Define a mock dataset with as many failure and edge cases as possible
Run production SQL code against the mock dataset
Compare the output of the run with the expected result

Note that unlike dbt tests or great_expectations/deequ assertions, SQL unit tests do not read actual production data!

SQL Unit Testing: Why Do It?

Static Data allows to test all edge cases and do it very fast, improving iteration speed and reducing compute costs.
Regression Testing increases user confidence in change management: even with frequent updates by multiple collaborators, everyone is confident that changes won’t introduce regressions and break existing behaviour

The central object in this framework is a ‘mock’ data model. It enables constructing input MockModels from static data with a schema - initialized from either a Python dictionary, Pandas dataframe, or in-line csv-style string to provide flexibility for the users. The actual MockModel is built from input MockModels using BigQuery. Actual and expected MockModels can then be compared to one another to assert equality or through any of the open-source suite of Great Expectations assertions.

Building Security Conscious Data Apps

By Caitlin Colgrove, Co-Founder and CTO @ Hex

As data products have evolved from the more static type reports to fully interactive highly complex data over the last few decades, the expectations and requirements around data security have evolved in equal measures.

Hex has been working on data security challenges that may be triggered by the users themselves - intentionally or accidentally:

Denial of service: running a query large enough to crash the service
Data loss or corruption
Unauthorized data access

One of the most common and dangerous attack vectors on systems that use SQL is SQL injections. And while this problem has been largely eradicated in the software world through built-in protections on the framework/database layer, it still poses a great risk to data applications that (a) primarily rely on SQL, and (b) are often built outside of professional software engineering teams.

Pitfalls of SQL Injection

Often, web apps that are designed to allow users to query simple information use SQL injection through a Python interface. A common pitfall of this is that users can intentionally or unintentionally cause data loss by adding in the “DROP TABLE” SQL statement in their query pointing to a valid table in the database. Using a similar technique, an attacker may also be able to change database passwords and obtain full access to the database.

Protecting your data application against SQL injections

1. Disallow arbitrary input: use drop-down instead of free text fields

2. Use PreparedStatements

Allows you to separate out the query and values
Helps prevent strings that may represent SQL queries from being executed if the user puts them in
Built into most standard RDBMS (e.g. PostgreSQL)

3. Allowlisting / Whitelisting: checks the input against a specific set of constraints

Valid identifier characters ([a-zA-Z_0-9]) as opposed to PreparedStatements, which only work for the value part of the query
Is one of a small known set of keywords
Is an identifier within the schema
Parametrize ones not allowed by PreparedStatement (e.g. column name)

Data Security Rule# 1: Don't Build it Yourself

Data security is extremely complex and the scope can quickly creep up. Caitlin suggests always looking out for tools and libraries that already support secure querying per your requirements, rather than building solutions in-house. For example: psycopg (query engines), JinjaSQL (templating libraries), Looker/Hex (BI tools).

Fake It Till You Make It: A Backward Approach to Data Products

By Alex Viana, VP of Data @ HealthJoy

The data team at HealthJoy serves a strategic and central role in decision-making. As a result of working with high-level stakeholders, the asks are often quite generic and vague, and require large-scale implementations. As this poses a significant risk of scope creep, mismatched expectations, and wasted resources, Alex and his team designed a process that allows them to reduce the uncertainty in strategic projects by “working backwards with data”.

Working Backwards

The approach taken involved... “faking your data”. This meant using a mock dataset to quickly prototype a solution and then working backward from it together with the stakeholders. Such an approach helped draw a box around the deliverable, communicate that to the end-users, and then start actual implementation with real data only once the alignment is achieved. This approach worked well for HealthJoy by minimizing iteration costs and time.

Examples of Implementation

1. Company Annual Recurring Revenue (ARR) Model

Ask: Forecast the company’s future cash flow through ARR
Challenge: ARR is almost all corner cases
Method: Present fake/minimal ARR products to stakeholders to gauge usability and to discover corner and edge cases
Value: Identified pitfalls and bugs early on in the process through stakeholder input on the “fake data”, resulting in faster development

2. Data-Driven Feature Development

Ask: Help build a high-value feature around a new healthcare cost data set
Challenge: Data would not be live for another 18 months
Method: Fake the feature (instead of data) to figure out data constraints and wider considerations
Value: Focussed product discussions with product managers and owners a year before real data was available

Panel Discussion: Lessons from Shopify, Lyft, HealthJoy & Devoted Health

Panelists (from left to right)

Julia Schottenstein, Product Manager @ Fishtown Analytics
Alex Viana, VP of Data @ HealthJoy
Karla Goodreau, Director of Data Engineering @ Devoted Health
Jason Carey, Data Platform Technical Lead @ Lyft
Michelle Ark, Senior Data Engineer @ Shopify
Chad Sanderson, Head of Product, Data Platform @ Convoy
Caitlin Colgrove, Co-Founder and CTO @ Hex
Gleb Mezhanskiy [Moderator], Co-founder & CEO @ Datafold

Questions

What is your point of view on investing first in reliable change management? Where have you found the highest ROI in your organizations in terms of improving trust and minimizing downtime?

Understanding of the Data: In addition to the two main types of data failures, there may be a third type of problem. This typically arises at the user level due to the difference in the user’s understanding of what the data represents versus what it actually represents in reality. While these ad-hoc challenges are attempted to be addressed through collaboration tools, there may still be differences in understanding among the users themselves during the collaboration process.
Disconnect between data producers and consumers: The introduction of change may cause an inadvertent impact on downstream processes. Lyft has been working on enforcing a data culture and identifying and investing in the core data assets and then expanding out from there.
Controls: Shopify enables and empowers anyone to build data but restricts others from building on top of such data until/unless it becomes officially verified and endorsed. There is always a trade-off between full empowerment to develop versus having the right controls in place to protect the very models that are being built.

How do you feel about enabling versus restricting and maintaining the right balance when scaling up?

Graded Approach: Lyft has begun to implement tiers of data products/assets and taken on a graded approach for controls/restrictions based on the risk and consequences of their failures.
Review Process: When everyone has access to do everything, the cost of creating something new becomes lower than leveraging prior obtained data. With the explosion of information and blurred lines between data owners, it becomes a challenge to identify the source of truth. There is value in establishing a process where users can commit their deliverables and communicate their use to the organization and undergo a review process before any dependencies are set up.

How do you maintain a high speed of development in companies that operate within highly regulated industries such as Healthcare?

Early Decision Making: As a new startup, Devoted Health uses mono repos for everything and uses generated code (e.g. Protocol Buffers). This helps maintain a single source of truth and the ability to flag items as particularly sensitive.
Approvals: Lean into tooling functionalities (e.g. dbt) to ensure there are checks in place per health data control and security protocols such as HIPAA, PII, and PHI. If direct access to the data is required by other teams or stakeholders, HealthJoy is implementing a formal process for approvals and sign-offs required to grant those accesses.

What are the things that data quality vendors and developers still have to deliver to the data community in terms of reliable data product building?

Interoperability within niches: There is room for growth in focussing on specialized niche industries and domains. But the challenge arises in terms of interoperability between data tools and still being able to appeal to a larger market to maintain business viability.
Integrations with client data stack: Hex is working on delivering in a format that integrates with the maximum number of tools and processes within their client’s workflow without having to dig into every single permutation and combination of integrations with all the different tools out there.

[1] As demonstrated in a research study published in The Lancet

‍