Fluentd + Hadoop: Instant Big Data Collection
Join the DZone community and get the full member experience.
Join For FreeFluentd is a JSON-based, open-source log collector originally written at Treasure Data. Fluentd is specifically designed for solving big data collection problem.
Many companies choose Hadoop Distributed Filesystem (HDFS) for big data storage. [1] Until recently, however, the only API interface was Java. This changed with the new WebHDFS interface, which allows users to interact with HDFS via HTTP. [2]
This post shows you how to set up Fluentd to receive data over HTTP and upload it to HDFS via WebHDFS.
Mechanism
The figure below shows the high-level architecture.
Install
For simplicity, this post shows the one-node configuration. You should have the following software installed on the same node.
- Fluentd with WebHDFS Plugin
- HDFS
Fluentd’s most recent version of deb/rpm package (v1.1.10 or later) includes the WebHDFS plugin. If you want to use Ruby Gems to install the plugin, gem install fluent-plugin-webhdfs
does the job.
- Debian Package
- RPM Package
- For CDH, please refer to the downloads page (CDH3u5 and CDH4 or later)
Fluentd Configuration
Let’s configure Fluentd. If you use deb/rpm, the Fluentd’s config file is located at/etc/td-agent/td-agent.conf
. Otherwise, it is located at/etc/fluentd/fluentd.conf
.
HTTP Input
For input, let’s set up Fluentd to accept data from HTTP. This is what the Fluentd configuration looks like.
<source> type http port 8080 </source>
WebHDFS Output
The output configuration should look like this:
<match hdfs.access.**> type webhdfs host namenode.your.cluster.local port 50070 path /log/%Y%m%d_%H/access.log.${hostname} flush_interval 10s </match>
The match section specifies the regexp to match the tags. If the tag is matched, then the config inside it is used.
flush_internal
indicates how often data is written to HDFS. Append operation is used to append the incoming data to the file specified by the path
parameter.
For the value of path
, you can use the placeholders for time and hostname (notice how%Y%m%d_%H
and ${hostname}
are used above). This prevents multiple Fluentd instances to append the data into the same file, which must be avoided for append operation.
The other two options, host
and port
, specify HDFS’s NameNode host and port respectively.
HDFS Configuration
Append is disabled by default. Please put these configurations into your hdfs-site.xml
and restart the whole cluster.
<property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.support.append</name> <value>true</value> </property> <property> <name>dfs.support.broken.append</name> <value>true</value> </property>
Also, please make sure that path
specified in Fluentd’s WebHDFS output is configured to be writable by hdfs user.
Test
To test the setup, just post a JSON to Fluentd. This example users curl command to do so.
$ curl -X POST -d 'json={"action":"login","user":2}' \ http://localhost:8080/hdfs.access.test
Then, let’s access HDFS and see the stored data.
$ sudo -u hdfs hadoop fs -lsr /log/ drwxr-xr-x - 1 supergroup 0 2012-10-22 09:40 /log/20121022_14/access.log.dev
Success!
Published at DZone with permission of Sadayuki Furuhashi, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Avoiding Pitfalls With Java Optional: Common Mistakes and How To Fix Them [Video]
-
An Overview of Kubernetes Security Projects at KubeCon Europe 2023
-
What Is Istio Service Mesh?
-
VPN Architecture for Internal Networks
Comments