Data Catalog – Build or Buy: Open Source or Closed Source?
Introduction
Constantly searching for data assets is a time-consuming and arduous task. That's why companies that maintain significant data assets will eventually want to catalog them. At some point, companies will have to determine how they go about this.
There are three ways to build a data catalog: in-house code development, implementation of an open-source solution, and buying a closed-source, all-in-one platform.
This guide will focus on the differences between implementing open-source and closed-source software. As you will see, there are a number of complex technical and business requirements a business must have in place if they want to successfully launch an open-source data catalog.
Section 1: Requirements for Building a Data Catalog Solution from Open-Source
Introducing new digital initiatives inevitably comes with both foreseen and unforeseen organizational challenges. It’s important for businesses to accurately assess their capacity to manage such challenges given budgetary, time, and talent constraints. These are a few of the organizational requirements needed to maintain a successful open-source solution:
1. A Dedicated Team
The right people are required to maintain a robust data catalog using open source. This is easier said than done. Taking scripts, running them, and making sure all the components of the open-source solution work together can be a full time job. Debugging is an arduous process, and employees will need to be highly skilled at this to ensure a flawless solution. Even after deployment, several months of work lies ahead to make sure everything works –– prolonging time to value.
Companies need to employ dedicated teams to maintain different environments and providers (e.g., data lakes and warehouses, BI dashboards and reports). After the metadata is collected, you need to document the existing data. If these teams are responsible for other in-house platforms as well, it can be a tricky balancing act.
Companies should also consider the potential impact of pulling resources away from activities that drive the company’s main mission. For some companies, maintaining a data catalog fully aligns with their core business. But for many others, it would be a distraction from other pressing activities.
2. A Compliance Team With Bandwidth For Monitoring Data Tools
Organizations who choose to implement an open-source solution will be required to employ a compliance officer or team to ensure that PII is identified on the column level, and stay in compliance with regulations such as GDPR and CCPA. Teams will need to research best practices and the latest regulatory requirements while conducting regular risk assessments. Finally, organizations may have to compete for resources internally to make changes to the system in order to stay compliant.
3. Time & Money it takes to Configure The Solution According To The Required Standards
According to McKinsey, “two out of three large programs regularly exceed initial budgets, miss schedule estimates, and underdeliver against business objectives and benefits, often by significant margins.” In addition, the consulting firm found that “25 to 40 percent of programs exceed their budget or schedules by more than 50 percent.”
In other words, implementing an open-source data catalog is likely to consume more time and money than was initially planned for. And because organizations have competing priorities and little prior experience implementing such a solution, initial time and cost estimates are likely to be off.
4. Ability To Continuously Innovate
Staying innovative is essential to staying future-proof. But because the open-source data catalog is not the company’s only software or objective, it’s challenging to truly be on the cutting-edge. There will always be competing initiatives for time, budget and resources. Even the most innovative companies will struggle to reach the highest levels of innovation amid everything else going on. In contrast, a third-party vendor has the luxury to remain fully focused on its solution while keeping its finger on the pulse of technological innovations and upgrading its software accordingly.
5. Organizational Commitment
Organizational commitment to maintaining the solution is essential, both from a DevOps and user perspective. The department leading the open-source data catalog initiative will need buy-in from organizational heads as well as data engineers and analysts. The whole business division is more likely to embrace the new technology if evidence is shown (whether through case studies or industry KPIs) that the new data catalog will help teams quickly and easily find the information they need. The problem is that open-source software lacks proof of impact. This increases the organization’s perception of risk and may lead to skepticism among DevOps and users alike.
Section 2: The Advantages of Buying a Data Catalog Solution
As seen, the Total Cost of Ownership of a data catalog is significant — both technically and organizationally. After realistically assessing the requirements involved, many businesses prefer to leverage a ready-made solution.
In addition to eliminating the need to fulfill the aforementioned requirements, here are additional reasons industry leaders have cited for buying a closed-source, all-in-one solution:
1. Ability To Gain Access To Related Capabilities From The Same Platform
When organizations adopt an out-of-the-box data catalog solution, they usually get access to related data management tools, such as data lineage and data observability. Having one unified solution for data management allows organizations to handle different types of data problems, eliminating the need to manage separate in-house solutions. You won’t have to continue the open-source project and develop a lineage solution in addition for example.
With a unified solution, businesses can adopt the features they need to solve today’s challenges — and down the line, adopt additional features to meet tomorrow’s challenges. When they do, all capabilities will be integrated and available for use. This ensures that there is no scrambling to implement related features as they become needed.
2. Access To Enterprise-Grade Integrations Out Of The Box
Companies that buy a ready-made, closed-source solution instantly benefit from a set of stable enterprise-grade integrations ready for use. For example, there are several vendors that already have a built-in integration for data warehouses such as Snowflake, Bigquery, Redshift and Azure. More than that, you can find dozens of trustworthy integrations to BI, ETL/ELT, and other data software tools, saving your team significant effort.
3. Dedicated Support Team Of Product Experts
Many open-source projects are implemented to an acceptable standard. But competing demands from other initiatives means that continual improvement and innovation is unlikely. On the other hand, partnering with a certified vendor that specializes in perfecting a single software means benefiting from a product that won’t get rusty over time. Having a dedicated customer success manager (CSM) to support and innovate has proven valuable for businesses at scale that need to stay competitive in a dynamic business and regulatory environment. If there’s an issue, there’s always someone to turn to for assistance. No debugging or constant adjustments required from the development team overtime.
4. Unlikely to Disappear
Technology vendors may close down, but it’s unlikely to be a sudden event with no prior warning. In contrast, open-source solutions often go dormant. If supported by a commercial company, the company’s priorities may change, or they may simply go out of business. If supported by individuals, they may tire of investing in the open-source tool and walk away. An established vendor with a closed-source, all-in-one solution is far more reliable and motivated to stay the course.
Section 3: Build or Buy? A Checklist
Ultimately, the decision whether to implement an open-source solution or buy a closed-source solution depends on the unique circumstances of the organization. This checklist can help assess whether your organization would benefit more from an open-sourced or closed-source data catalog.
Ask yourself: Does your organization have –
Fewer checks suggests that buying a closed-source data catalog solution would provide a bigger ROI than implementing an open-source solution, which is inevitably fraught with risk.
Conclusion
Open-source or closed-source? It’s a perennial question. Maintaining open-source technology in-house may be preferable if there is no suitable solution on the market. In addition, there may be specific cases where data is very privacy-sensitive, and everything is on-premise. In those cases, it might be advisable to use open source. However, in general, the data catalog solutions available today are secure, intuitive to use, and easy for each department to customize for its own purposes.
Even if it's possible to develop and implement an open source solution, time to value is usually more important. There is much more to the equation than the upfront cost. An enterprise solution can get you started quickly rather than require spending time to get an open-source solution ready for production.
In today’s business climate, resources are sacred and vast quantities of data are threatening to deplete those resources. Investing in a comprehensive, extensible solution can increase time to ROI more than any open-source solution ever could. It also guarantees a specialized solution that can be continuously enhanced by a dedicated partner, which in turn supports your data team in what they do best rather than distracting them from core business tasks.
Aggua has invested years in developing robust cloud data management capabilities that are secure, compliant, and easy for users to engage with. Aggua’s deep domain expertise in the data space in general and data catalog in particular focuses on creating the best bird's-eye view of your data assets that cannot easily be replicated by even the most agile organizations.
Show Me Aggua Data Catalog In Action »
About Aggua
Aggua is a Data fabric platform that enables data and business teams access to their data, creating trust and giving practical data insights.
Aggua developed the leading collaborative data management solution specifically built for modern data stack and data teams with Snowflake, BigQuery or Redshift warehouses at the heart of their architecture.
Our automated data catalog gives you a bird's eye view of your data, along with automated column-level lineage from source-to-target. With our robust modeling and testing tools, you can confidently map out your data journey and ensure that your data is always accurate and up-to-date. And our usage oversight capabilities will help you keep costs under control, optimize the performance and enforce governance policies. If you're looking for a better way to manage your data, Aggua is the vendor of choice.