DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Fluentd + Hadoop: Instant Big Data Collection

Fluentd + Hadoop: Instant Big Data Collection

Sadayuki Furuhashi user avatar by
Sadayuki Furuhashi
·
Nov. 26, 12 · Interview
Like (0)
Save
Tweet
Share
8.67K Views

Join the DZone community and get the full member experience.

Join For Free

Fluentd is a JSON-based, open-source log collector originally written at Treasure Data. Fluentd is specifically designed for solving big data collection problem.

Many companies choose Hadoop Distributed Filesystem (HDFS) for big data storage. [1] Until recently, however, the only API interface was Java. This changed with the new WebHDFS interface, which allows users to interact with HDFS via HTTP. [2]

This post shows you how to set up Fluentd to receive data over HTTP and upload it to HDFS via WebHDFS.

Mechanism

The figure below shows the high-level architecture.

Install

For simplicity, this post shows the one-node configuration. You should have the following software installed on the same node.

  • Fluentd with WebHDFS Plugin
  • HDFS

Fluentd’s most recent version of deb/rpm package (v1.1.10 or later) includes the WebHDFS plugin. If you want to use Ruby Gems to install the plugin, gem install fluent-plugin-webhdfs does the job.

  • Debian Package
  • RPM Package
  • For CDH, please refer to the downloads page (CDH3u5 and CDH4 or later)

Fluentd Configuration

Let’s configure Fluentd. If you use deb/rpm, the Fluentd’s config file is located at/etc/td-agent/td-agent.conf. Otherwise, it is located at/etc/fluentd/fluentd.conf.

HTTP Input

For input, let’s set up Fluentd to accept data from HTTP. This is what the Fluentd configuration looks like.

<source>
  type http
  port 8080
</source>

WebHDFS Output

The output configuration should look like this:

<match hdfs.access.**>
  type webhdfs
  host namenode.your.cluster.local
  port 50070
  path /log/%Y%m%d_%H/access.log.${hostname}
  flush_interval 10s
</match>

The match section specifies the regexp to match the tags. If the tag is matched, then the config inside it is used.

flush_internal indicates how often data is written to HDFS. Append operation is used to append the incoming data to the file specified by the path parameter.

For the value of path, you can use the placeholders for time and hostname (notice how%Y%m%d_%H and ${hostname} are used above). This prevents multiple Fluentd instances to append the data into the same file, which must be avoided for append operation.

The other two options, host and port, specify HDFS’s NameNode host and port respectively.

HDFS Configuration

Append is disabled by default. Please put these configurations into your hdfs-site.xml and restart the whole cluster.

<property>
  <name>dfs.webhdfs.enabled</name>
  <value>true</value>
</property>

<property>
  <name>dfs.support.append</name>
  <value>true</value>
</property>

<property>
  <name>dfs.support.broken.append</name>
  <value>true</value>
</property>

Also, please make sure that path specified in Fluentd’s WebHDFS output is configured to be writable by hdfs user.

Test

To test the setup, just post a JSON to Fluentd. This example users curl command to do so.

$ curl -X POST -d 'json={"action":"login","user":2}' \
  http://localhost:8080/hdfs.access.test

Then, let’s access HDFS and see the stored data.

$ sudo -u hdfs hadoop fs -lsr /log/
drwxr-xr-x   - 1 supergroup          0 2012-10-22 09:40 /log/20121022_14/access.log.dev

Success!

Big data Fluentd hadoop Data collection

Published at DZone with permission of Sadayuki Furuhashi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Kotlin Is More Fun Than Java And This Is a Big Deal
  • Iptables Basic Commands for Novice
  • Fraud Detection With Apache Kafka, KSQL, and Apache Flink
  • How Do the Docker Client and Docker Servers Work?

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: