Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

5 Guidelines for Building a Successful Data Catalog

DZone's Guide to

5 Guidelines for Building a Successful Data Catalog

Thought + planning = a worry-free environment. Once you've thought about what data catalog you want to use, you can start to focus on making it work for your business.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

At times, the search for a perfect data catalog can seem like finding the needle in a hay stack. Each stakeholder has equally demanding and disparate sets of requirements for success. Where business analysts want a slick, refined, and easily navigated UI with easy export capabilities, data scientists might refuse to accept anything that does not allow custom-tailored queries, connections to their favorite notebook, and unburdened access to all of the data that has ever existed in the data lake. Meanwhile, the security group wants none of this! Exposing the data at all is a non-starter.

This leaves you — the tech visionary who has a stable of cutting-edge vendors at the ready and a five-year rollout plan to go with them — stuck in neutral.

Before you resort to breaking out the floppy disks in protest, we have five guidelines for building a successful data catalog and how you can help your business succeed without compromising your stakeholders.

1. Reduce Overhead Through Open Access

Open access is the foundation of any successful data catalog. The desire for a more intuitive, less burdensome method of accessing available data is when the demand for a data catalog commonly emerges. Any solution that does not address this fundamental point is bound to run into difficulties in gaining business support.

For users, open access provides the value that is sorely lacking from traditional data lakes: efficient, accurate, and personalized access to data, no matter where that data may originate.

For administrators, open access can dramatically reduce the overhead that comes from routine requests and maintenance of audit histories and ticketing systems. A good catalog should automate and manage these functions.

Value: Open access to any data owned or used by consumers of the catalog creates fast time-to-value for users and greater efficiency for administrators.

2. Protect Data Through Governance and Security

Security in the data catalog often creates a significant conundrum. While providing open, transparent access to data is tantamount to success, this access can also create security threats that will quickly shut the project down. Although security groups at your organization may not have an active role in using the data catalog, they certainly have a vested interest in keeping the organization protected from both external and internal threats. How can you balance these two seemingly opposing forces?

Fortunately, modern Hadoop technologies provide a breadth of options. Projects such as Ranger, Knox, and Sentry provide a new level of protection for at-rest data on the cluster, while more tools such as NiFi and Kafka either include support for external security protocols (i.e., Kerberos) or built-in security layers. Assuming we focus on Hadoop-based data catalogs, these tools will take care of most facets of security within the data lake. Any remaining holes will come from external systems, the burden of which should fall on those systems.

Value: Security, whether applied by the catalog or by underlying systems, is integral to continued operation at an enterprise level. Hadoop has several components to address this requirement.

3. Use Hadoop Components Specific to the Data Catalog

To leverage Hadoop as a core component of the data catalog, the catalog itself must have a foundation in Hadoop. This means that the agility, security, and flexibility available through the various components of Hadoop must be explicitly designed for in the catalog itself.

Many catalogs provide connectivity or extensibility to Hadoop, but these tools inevitably introduce security holes, functionality limitations, and additional maintenance points. Even if such a catalog does work today, Hadoop evolves so rapidly that there is no guarantee that it will work tomorrow when an API call to HBase or Sqoop changes its syntax. In most cases, it is much easier to bring external systems into Hadoop rather than vice versa.

Value: By natively leveraging Hadoop, data catalogs can take advantage of the power and agility of the ecosystem without introducing cumbersome, unstable, or complex workarounds.

4. Plan for Connecting Hadoop to External Systems

Although it is important to consider Hadoop compatibility when choosing a data catalog vendor, it is just as important to consider how well the solution will play with external systems. As much as any Hadoop advocate may hate to admit it, any enterprise-size infrastructure will include many more systems than Hive, HBase, HDFS, and other Hadoop components.

NoSQL, RDBMS, and traditional NFS style systems abound in real-world infrastructure. Without connectivity to these types of systems, a data catalog will only ever capture a fraction of the full potential audience and will always miss out on some of the potential insights generated by a consolidated view. Fortunately, as long as our catalog is Hadoop-based, a plethora of tools exists to cover ingestion and export ranging from field sensors to relational databases. Tools like Sqoop, NiFi, Flume, and Kafka are well-established and rapidly evolving to keep up with source systems.

Value: Hadoop will never hold 100% of your data assets. To create a complete picture, and therefore have the most effective and valuable catalog, external systems should be easily connected.

5. Be Ready for Change

If there is one constant in Big Data, it is change. Hadoop truly changes day to day — and that isn’t even considering all the peripheral technologies that make up enterprise infrastructures. In order to be successful, a measure of future-proofing is necessary. While no software platform can predict the future (although certain AI tools may be getting close), a solid framework of stable, public-facing APIs can go a long way towards achieving this goal.

Technology isn’t the only thing that changes, and as your business grows, expands, and refines its needs and processes, the data catalog needs to be able to scale out organically as well. This means flexibility in usage and functionality is key.

Value: A catalog that is ready for changes in technology and scope will mean more ROI and less headache as your business inevitably changes.

Building a data catalog can be a long, complex, difficult process. The five guidelines above are just a starting point. Although new products, vendors, and services arise daily, following these guidelines can help navigate the muddy waters of data lake and Hadoop and increase your chances of a successful launch.

In the end, the choice of a product often comes down to a personal choice for your organization. As long as thought and planning have gone into that choice, you can stop worrying about the many ways that a data catalog can go wrong and start focusing on how to make it work for your business.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data catalog ,big data ,hadoop

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}