How the Linux Foundation improved data quality with Datafold to accelerate open source community building
The Linux Foundation is unique in its role as a hub for collaboration and innovation across a diverse range of projects in the open-source community. Their extensive network of events, chapters, and community engagements serve as catalysts for the advancement of open source initiatives. By integrating Datafold’s data diffing capabilities into their continuous integration (CI) process and dbt project, the Foundation was able to enhance data quality, streamline workflows, and make informed, data-driven decisions that bolstered their community outreach efforts and supported their mission of fostering open source collaboration.
Introduction
The Linux Foundation is unique in its role as a hub for collaboration and innovation across a diverse range of projects in the open-source community. Their extensive network of events, chapters, and community engagements serve as catalysts for the advancement of open source initiatives. By integrating Datafold’s data diffing capabilities into their continuous integration (CI) process and dbt project, the Foundation was able to enhance data quality, streamline workflows, and make informed, data-driven decisions that bolstered their community outreach efforts and supported their mission of fostering open source collaboration.
Datafold was key to understanding the scope of changes made and gave the reviewers visibility into the downstream impact.
The Linux Foundation is unique in its role as a hub for collaboration and innovation across a diverse range of projects in the open-source community. Their extensive network of events, chapters, and community engagements serve as catalysts for the advancement of open source initiatives. By integrating Datafold’s data diffing capabilities into their CI process and dbt project, the Foundation was able to enhance data quality, streamline workflows, and make informed, data-driven decisions that bolstered their community outreach efforts and supported their mission of fostering open-source collaboration.
Linux Foundation
The Linux Foundation is a non-profit organization dedicated to fostering the growth and development of Linux and other open-source technologies by providing a neutral collaboration platform, enabling enterprise adoption, and fostering diverse and inclusive communities. They promote and support Linux development, offer standardization services, and provide legal protection, while also serving as a hub for open-source projects, facilitating collaboration, community building, and ecosystem curation.
The challenge: Modernizing their data quality platform
1. Challenges with a data lake architecture
Prior to implementing Datafold, the Linux Foundation grappled with significant pain points around the lack of controls and visibility within their data lake. This not only hindered the PR review process to develop on the existing data lake, but also prevented them from effectively managing and governing their data. As the Foundation could not understand how any code and data changes would affect their downstream assets, it led to the failure of their previous data lake initiative.
2. Low quality metrics
Unlike many organizations that focus solely on their internal operations, the Linux Foundation serves as a central hub supporting a multitude of open source projects, which operate like independent entities with their own stakeholders. One key way that the Foundation supports these projects is through the generation of metrics and data insights that allow project stakeholders to assess the health and performance of their projects.
These metrics included key performance indicators such as such as sales, memberships, event attendance, code contributions, and training participation. The Foundation faced challenges in effectively managing and governing the data necessary to generate these metrics. Inadequate visibility into data changes, inconsistencies in data quality, and inefficiencies in data review processes hindered their ability to provide accurate and reliable metrics to project stakeholders.
The solution: Shifting data testing left with Datafold
The Linux Foundation wanted a solution with a "shift-left" approach that integrated data quality practices closer to the "point of origin" – anything created by a developer or author of the PR. They also wanted to find an approach that aligned with their broader organizational objectives around data governance and security that proactively managed risk and detected issues early. This led them to implementing Datafold for their data quality needs.
Reviewing any downstream impact of changes with Data Diffs
Datafold provided the Linux Foundation with comprehensive visibility into the changes introduced through pull requests within their data modeling processes. By leveraging Datafold's capabilities, the Foundation's team gained deeper insights into the scope and impact of proposed changes, allowing for more informed decision-making during the review process.
Whenever the team opened a new PR with some code changes, the team could view Data Diff’s summary statistics and value-level diffs quickly to assess potential downstream issues. Datafold also offers a way to view changes in greater detail. For example, the Data Explorer’s column-level lineage built via the query logs in their new data warehouse Snowflake helped the Foundation identify any downstream tables and dashboards impacted by data changes if the code in the pull request (PR) is deployed to production.
Streamlining review processes
Because Datafold made it easy for reviewers to analyze the downstream impact of changes through Datafold’s intuitive UI and the bot comments with every PR, this sped up review time and unlocked greater development velocity.
The outcome: A "slam dunk" for greater data transparency
With Datafold, the Linux Foundation recognized that it was a “slam dunk” solution for addressing their needs and advancing their data quality initiatives.
Improved data quality
By providing comprehensive visibility and impact analysis capabilities, Datafold improved the quality of data within the Linux Foundation's infrastructure. This ensured that the data used for community outreach initiatives, such as event management and social media engagement, was accurate and reliable, empowering project stakeholders to make informed decisions and take proactive measures to address any challenges or red flags that may arise.
Supercharged the Foundation’s community outreach
Datafold streamlined the process of managing events by facilitating more efficient review and approval of pull requests related to event data. This allowed the Linux Foundation to quickly and accurately update event information, register attendees, and track engagement metrics, ultimately enhancing the success of their community events and outreach efforts.
Datafold also fostered greater collaboration among team members involved in community outreach initiatives by providing a centralized platform for reviewing and approving data changes. This improved communication and coordination, ensuring that all stakeholders were aligned on data-related decisions and enabling the Foundation to more effectively engage with its community members.
Cultivating a data-driven culture
As Datafold enabled the Foundation’s data team to view the downstream impact of any code changes, developers were empowered to effectively manage and govern data assets. Over time, the Foundation began incorporating Datafold as part of their onboarding and training of new developers as part of their modernized data quality culture. This fostered a culture of data-driven decision-making within the organization, aligning with broader objectives related to modernization and data quality improvement.
Better data quality is a slam dunk for community-building
As a leader in the open source Linux ecosystem, the Linux Foundation is committed to supporting community-building, technical innovation, and growth through its collaborative events across the world. Datafold’s capability to automate value-level data diffs with every pull request not only ensures that the Foundation best supports its downstream partners with high-quality metrics, but frees up their focus away from resolving data discrepancies and towards fostering greater collaboration and innovation within their communities.