DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • What Is Data Engineering? Skills and Tools Required
  • 8 Best Big Data Tools in 2020
  • 6 Free Data Mining and Machine Learning eBooks
  • Turn Cloud Storage or HDFS Into Your Local File System for Faster AI Model Training With TensorFlow

Trending

  • .NET Performance Optimization Techniques for Expert Developers
  • Smart BDD vs. Cucumber Using Java and JUnit5
  • Parallel Sort
  • The Systemic Process of Debugging
  1. DZone
  2. Data Engineering
  3. Big Data
  4. What are the Biggest Hadoop Challenges?

What are the Biggest Hadoop Challenges?

Hadoop has become much more popular. With its increased use however are many misconceptions. Don't just use Hadoop, use it right. Check out these common Hadoop challenges.

Sergey Tryuber user avatar by
Sergey Tryuber
·
Jul. 27, 15 · Analysis
Like (3)
Save
Tweet
Share
9.05K Views

Join the DZone community and get the full member experience.

Join For Free

Many companies are adopting Hadoop in their IT infrastructure. For old Big Data stagers with a strong engineering team, it is usually not a big issue to design the target system, choose a technology stack, and start implementation. Those with a lot of experience can still sometimes face obstacles with all the complexity, but Hadoop beginners face a myriad of challenges to get started. Below are the most commonly seen Hadoop challenges which Grid Dynamics resolves for its clients.

Diversity of Vendors. Which to choose?

The common first reaction is to use the original Hadoop binaries from the Apache website but this results in the realization as to why only a few companies use them “as-is” in a production environment. There are a lot of great arguments to not do this. But then panic comes with the realization of just how many Hadoop distributions are freely available from Hortonworks, Cloudera, MapR and ending with large commercial IBM InfoSphere BigInsights and Oracle Big Data Appliance. Oracle even includes hardware! Things become even more tangled after a few introductory calls with the vendors. Selecting the right distribution is not an easy task, even for experienced staff, since each of them embed different Hadoop components (like Cloudera Impala in CDH), configuration managers (Ambari, Cloudera Manager, etc.), and an overall vision of a Hadoop mission.

SQL on Hadoop. Very popular, but not clear...

Hadoop stores a lot of data. Apart from processing according to predefined pipelines, businesses want to get more value by giving an interactive access to data scientists and business analysts. Marketing buzz on the Internet even forces them to do this, implying, but not clearly saying, competitiveness with Enterprise Data Warehouses. The situation here is similar to the diversity of vendors, since there are too many frameworks that provide “interactive SQL on top of Hadoop,” but the challenge is not in selecting the best one. Understand that currently they all are still not an equal replacement for traditional OLAP databases. Simultaneously with many obvious strategic advantages, there are disputable shortcomings in performance, SQL-compliance, and support simplicity. This is a different world and you should either play by its rules or do not consider it as a replacement for traditional approaches.

Big Data Engineers. Are there any?

A good engineering staff is a major part of any IT organization, but it is really critical in Big Data. Relying on good Java/Python/C++/etc engineers to design/implement good quality data processing flows in most of cases means wasting of millions of dollars. After two years of development you could get unstable, unsupportable, and over-engineered chaotic scripts/jars accompanied by a zoo of frameworks. The situation becomes desperate if key developers leave the company. As in any other programming area, experienced Big Data developers spend most of the time thinking how to keep things simple and how the system will evaluate in the future. But experience in the Big Data technological stack is a key factor. So the challenge is in finding such developers.

Secured Hadoop Environment. Point of a headache.

More and more companies are storing sensitive data in Hadoop. Hopefully not credit cards numbers, but at least data which falls under security regulations with respective requirements. So this challenge is purely technical, but often causes issues. Things are simple if there are only HDFS and MapReduce used. Both data-in-the-motion and at-rest encryption are available, file system permissions are enough for authorization, Kerberos is used for authentication. Just add perimeter and host level security with explicit edge nodes and be calm. But once you decide to use other frameworks, especially if they execute requests under their own system user, you’re diving into troubles. The first is that not all of them support Kerberized environment. The second is that they might not have their own authorization features. The third is frequent absence of data-in-the-motion encryption. And, finally, lots of trouble if requests are supposed to be submitted outside of the cluster.

Conclusion

We pointed out a few topical challenges as we see them. Of course, the items above are far from being complete and one could be scared off by them resulting in a decision to not use Hadoop at all or to postpone its adoption for some later time. That would not be wise. There are a whole list of advantages brought by Hadoop to organizations with skillful hands. In cooperation with other Big Data frameworks and techniques, it can move capabilities of data-oriented business to an entirely new level of performance.

hadoop Big data Data science

Opinions expressed by DZone contributors are their own.

Related

  • What Is Data Engineering? Skills and Tools Required
  • 8 Best Big Data Tools in 2020
  • 6 Free Data Mining and Machine Learning eBooks
  • Turn Cloud Storage or HDFS Into Your Local File System for Faster AI Model Training With TensorFlow

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: