Not All Column-Level Lineage Is Created Equal
Companies are handling more data than ever before. And more people are handling that data than ever before. All this data and all these users mean data lineage has become a pressing issue. There are various methods to build a data lineage – from flow-based lineage, business process-based lineage, source code basis and others. By far the most common way to do so these days is via metadata. But also extracting the metadata has some distinct techniques in the form of how granular and accurate you can get.
Column-level lineage (CLL) has significant advantages over table-level lineage since it allows a more granular view of the data, driven by exposing the link between two columns from different tables. This type of intricate lineage can be a challenge to visualize. And since manual column-level lineage is a laborious and time-consuming process, a number of tools have been developed over the years to automate the process.
Yet not all of these column-level lineage tools are the same. Some are better than others at accurately visualizing the relationships between columns at scale. Here, we aim to provide a suggested framework for evaluating the merits of any given column-level lineage solution.
What Characterizes a Powerful Column-Level Lineage Tool?
Finding the relationship between columns, identifying the original data source, understanding what changes have been made, and surfacing dependencies. These are just a few of the jobs that data engineers are required to do. According to some estimates, data analysts spend significant time sifting through data and finding connections between data points. But parsing means it can sometimes take weeks to find the source of a data anomaly or connection.
As companies increase the amount of data they handle, and more employees are required to interact with that data, they naturally start to think about automating this process. Enter column-level lineage tools. These tools provide a map of the data’s path as it travels between upstream source systems (such as data lakes and warehouses) and downstream dependencies (such as BI dashboards and reports).
With column-level lineage tools, data engineers can easily identify the source-to-target of the anomaly and uncover what downstream dependencies may have been affected as a result. Ultimately, this record of how data moves supports the greater data observability goals that companies are increasingly prioritizing.
The best column-level lineage solutions fully visualize the data in one place.
According to a Gartner report, “the platform’s user interface can provide a single vantage point for any data movement that runs through that platform.” The report goes on to claim that the tool should account for “input parameters, runtime information and default values, and [parse] all of these elements to automate end-to-end data lineage delivery.” [source: Gartner, 2022]
Of course, some column-level lineage tools do a better job at mapping and parsing data at scale. In the next section, we’ll drill down into the specific features to look out for when considering an automated solution.
Great Column-Level Lineage Deserves a Great Team:
Build a Better Data Team Today »
What Features Should a Column-Level Lineage Tool Include?
Companies today “Are working hard to extract some business value from data lineage,” according to a McKinsey report. For example, businesses are “using it as a basis to simplify their data architecture or to spot unauthorized data-access points, or even to identify inconsistencies among data in different reports.” [Source: Mckinsey & Company 2020, Optimizing data controls in banking]
There are a number of features and capabilities that can increase the probability of extracting value from column-level data lineage. Companies considering a data lineage tool should check if the solution includes the following:
● Automation of capturing data lineage: At least as frequent as once a day, the tool should map out the current lineage so it reflects the situation in real-time
● Metadata sources: The metadata sources should be revealed to provide granular information about datasets' dependencies
● Verification of metadata sources: Systems that handle data, such as ETL software and database management tools, all create metadata. Bringing metadata sources into data lineage is critical to gaining visibility into how data is altered throughout its lifecycle.
● Accurate parsing: An essential part of the data lineage process is parsing, which refers to taking a query language and breaking it down to its components. The end-to-end data lineage needs to be parsed across systems, both upstream and downstream. This includes CRMs, ETLs, CDP, data warehouses, data lakes, BI tools, and more. Only then can the data source and links between systems be revealed.
● Extract metadata and lineage: Column-level lineage involves using forward and/or backward data flows throughout the system. This simplifies the process of mapping connections, relationships, and dependencies among systems and within data. For example, one can start from the CRM system and work forwards, or start from the analytics dashboards and work backwards.
● Real-time visibility: Users should be able to understand the changes made to the data. A dashboard that shows previous versions of tables and columns, or provides alerts when important changes are made, allows users to stay up-to-date. A real-time component It also makes it easier to do better data cataloging and metadata management.
Next-Generation Column-Level Lineage
Clearly, automating column-level lineage is a significant step up over time-consuming manual processes. It is also a step up from table-level lineage tools, which are less equipped to identify anomalies between systems.
At the same time, not all column-level lineage tools are created equal. Next-generation column-level lineage solutions are built for scale. They are able to support multiple users and frequent data changes without degradations in accuracy. Once a user is connected to such a solution, every column is checked in regular intervals through an automated system.
In fact, data is mapped every time it moves from one system to another. The most sophisticated solutions, such as Aggua, rescans the data at least every 24 hours –– and up to every five minutes if the user predefined it as such. All of this happens behind the scenes, without disrupting users’ activities.
Next-generation column-level lineage understands the dependencies between systems, and allows users to update the system and notify other users who are impacted by the changes. That way the organization stays informed on modifications in the data lifecycle.
Critically, it should have a top-notch parsing machine, making fewer and fewer mistakes. A good Lineage is more accurate than rudimentary solutions because it should support different SQL dialects. It also constantly improves itself with the parsing machine learning experience, handling more edge cases over time. Adequate lineage solution should be considered one source of truth for all data –– and that source of truth aims to keep getting better.
The next-generation of column lineage is ultra-secure and meets data governance compliance standards. It deals only with metadata, not the data itself –– thus ensuring protection of personal identifiable information (PII). This component will ensure that sensitive information is never compromised, without sacrificing visibility where it matters.
The Bottom Line: Choosing the Right Column-Level Lineage Tool is Key
To ensure robust lineage, companies will want eventually a solution at the column-level, not merely table-level. Yet even within the broader category of column-level, some solutions are more powerful than others. Aggua, for example, is one such next-generation solution that can provide a bird’s-eye view of data and one of the highest parsing accuracy considered in the industry.
Aggua integrates CRMs, ETLs, CDP, data warehouses, data lakes, BI tools, and many more technologies one would expect to find in a common data tech stack. With Aggua, you get end-to-end lineage for all these solutions –– not just a standalone dashboard. This holistic view of the data flow allows users to get much more out of the tool.
With Aggua’s automated column-level lineage solution, data teams can instantly access upstream and downstream dependencies. This helps teams break silos and solve data issues quickly.
Aggua has developed a proprietary data engine that is built specifically for Google BigQuery (ZetaSQL) and Snowflake (ANSI SQL) to provide precise lineage mapping. The uniqueness is that Aggua supports not only vanilla SQL but also the specific “dialect” of the data, which means it supports all functionality provided by Snowflake and BigQuery. To learn more about Aggua’s data lineage tool and many other related capabilities, visit Aggua.io.