DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Big Data
  4. 5 Guidelines for Building a Successful Data Catalog

5 Guidelines for Building a Successful Data Catalog

Thought + planning = a worry-free environment. Once you've thought about what data catalog you want to use, you can start to focus on making it work for your business.

Greg Wood user avatar by
Greg Wood
·
Feb. 22, 17 · Opinion
Like (5)
Save
Tweet
Share
4.78K Views

Join the DZone community and get the full member experience.

Join For Free

At times, the search for a perfect data catalog can seem like finding the needle in a hay stack. Each stakeholder has equally demanding and disparate sets of requirements for success. Where business analysts want a slick, refined, and easily navigated UI with easy export capabilities, data scientists might refuse to accept anything that does not allow custom-tailored queries, connections to their favorite notebook, and unburdened access to all of the data that has ever existed in the data lake. Meanwhile, the security group wants none of this! Exposing the data at all is a non-starter.

This leaves you — the tech visionary who has a stable of cutting-edge vendors at the ready and a five-year rollout plan to go with them — stuck in neutral.

Before you resort to breaking out the floppy disks in protest, we have five guidelines for building a successful data catalog and how you can help your business succeed without compromising your stakeholders.

1. Reduce Overhead Through Open Access

Open access is the foundation of any successful data catalog. The desire for a more intuitive, less burdensome method of accessing available data is when the demand for a data catalog commonly emerges. Any solution that does not address this fundamental point is bound to run into difficulties in gaining business support.

For users, open access provides the value that is sorely lacking from traditional data lakes: efficient, accurate, and personalized access to data, no matter where that data may originate.

For administrators, open access can dramatically reduce the overhead that comes from routine requests and maintenance of audit histories and ticketing systems. A good catalog should automate and manage these functions.

Value: Open access to any data owned or used by consumers of the catalog creates fast time-to-value for users and greater efficiency for administrators.

2. Protect Data Through Governance and Security

Security in the data catalog often creates a significant conundrum. While providing open, transparent access to data is tantamount to success, this access can also create security threats that will quickly shut the project down. Although security groups at your organization may not have an active role in using the data catalog, they certainly have a vested interest in keeping the organization protected from both external and internal threats. How can you balance these two seemingly opposing forces?

Fortunately, modern Hadoop technologies provide a breadth of options. Projects such as Ranger, Knox, and Sentry provide a new level of protection for at-rest data on the cluster, while more tools such as NiFi and Kafka either include support for external security protocols (i.e., Kerberos) or built-in security layers. Assuming we focus on Hadoop-based data catalogs, these tools will take care of most facets of security within the data lake. Any remaining holes will come from external systems, the burden of which should fall on those systems.

Value: Security, whether applied by the catalog or by underlying systems, is integral to continued operation at an enterprise level. Hadoop has several components to address this requirement.

3. Use Hadoop Components Specific to the Data Catalog

To leverage Hadoop as a core component of the data catalog, the catalog itself must have a foundation in Hadoop. This means that the agility, security, and flexibility available through the various components of Hadoop must be explicitly designed for in the catalog itself.

Many catalogs provide connectivity or extensibility to Hadoop, but these tools inevitably introduce security holes, functionality limitations, and additional maintenance points. Even if such a catalog does work today, Hadoop evolves so rapidly that there is no guarantee that it will work tomorrow when an API call to HBase or Sqoop changes its syntax. In most cases, it is much easier to bring external systems into Hadoop rather than vice versa.

Value: By natively leveraging Hadoop, data catalogs can take advantage of the power and agility of the ecosystem without introducing cumbersome, unstable, or complex workarounds.

4. Plan for Connecting Hadoop to External Systems

Although it is important to consider Hadoop compatibility when choosing a data catalog vendor, it is just as important to consider how well the solution will play with external systems. As much as any Hadoop advocate may hate to admit it, any enterprise-size infrastructure will include many more systems than Hive, HBase, HDFS, and other Hadoop components.

NoSQL, RDBMS, and traditional NFS style systems abound in real-world infrastructure. Without connectivity to these types of systems, a data catalog will only ever capture a fraction of the full potential audience and will always miss out on some of the potential insights generated by a consolidated view. Fortunately, as long as our catalog is Hadoop-based, a plethora of tools exists to cover ingestion and export ranging from field sensors to relational databases. Tools like Sqoop, NiFi, Flume, and Kafka are well-established and rapidly evolving to keep up with source systems.

Value: Hadoop will never hold 100% of your data assets. To create a complete picture, and therefore have the most effective and valuable catalog, external systems should be easily connected.

5. Be Ready for Change

If there is one constant in Big Data, it is change. Hadoop truly changes day to day — and that isn’t even considering all the peripheral technologies that make up enterprise infrastructures. In order to be successful, a measure of future-proofing is necessary. While no software platform can predict the future (although certain AI tools may be getting close), a solid framework of stable, public-facing APIs can go a long way towards achieving this goal.

Technology isn’t the only thing that changes, and as your business grows, expands, and refines its needs and processes, the data catalog needs to be able to scale out organically as well. This means flexibility in usage and functionality is key.

Value: A catalog that is ready for changes in technology and scope will mean more ROI and less headache as your business inevitably changes.

Building a data catalog can be a long, complex, difficult process. The five guidelines above are just a starting point. Although new products, vendors, and services arise daily, following these guidelines can help navigate the muddy waters of data lake and Hadoop and increase your chances of a successful launch.

In the end, the choice of a product often comes down to a personal choice for your organization. As long as thought and planning have gone into that choice, you can stop worrying about the many ways that a data catalog can go wrong and start focusing on how to make it work for your business.

Data science hadoop guidelines security

Published at DZone with permission of Greg Wood, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How Elasticsearch Works
  • How We Solved an OOM Issue in TiDB with GOMEMLIMIT
  • Getting a Private SSL Certificate Free of Cost
  • Microservices 101: Transactional Outbox and Inbox

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: