Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Accessing Hadoop HDFS Data Using Node.js and the WebHDFS REST API

DZone's Guide to

Accessing Hadoop HDFS Data Using Node.js and the WebHDFS REST API

HDFS files are a popular means of storing data. Learn how to use Node.js and the WebHDFS RESTful API to manipulate HDFS data stored in Hadoop.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Apache Hadoop exposes services for accessing and manipulating HDFS content with the help of WebHDFS REST APIs. To check out this official documentation, click here.

Available Services

Below are the set of services available:

1) File and Directory Operations

    1.1 Create and Write to a File: CREATE (HTTP PUT)
    1.2 Append to a File: APPEND (HTTP POST)
    1.3 Open and Read a File: OPEN (HTTP GET)
    1.4 Make a Directory: MKDIRS (HTTP PUT)
    1.5 Rename a File/Directory: RENAME (HTTP PUT)
    1.6 Delete a File/Directory: DELETE (HTTP DELETE)
    1.7 Status of a File/Directory: GETFILESTATUS (HTTP GET)
    1.8 List a Directory: LISTSTATUS (HTTP GET)

2) Other File System Operations

    2.1 Get Content Summary of a Directory: GETCONTENTSUMMARY (HTTP GET)
    2.2 Get File Checksum: GETFILECHECKSUM (HTTP GET)
    2.3 Get Home Directory: GETHOMEDIRECTORY (HTTP GET)
    2.4 Set Permission: SETPERMISSION (HTTP PUT)
    2.5 Set Owner: SETOWNER (HTTP PUT)
    2.6 Set Replication Factor: SETREPLICATION (HTTP PUT)
    2.7 Set Access or Modification Time: SETTIMES (HTTP PUT)

Enabling the WebHDFS API

Make sure the config parameter dfs.webhdfs.enabled is set to true in the hdfs-site.xml file (this config file can be found inside {your_hadoop_home_dir}/etc/hadoop.

<configuration>
    <property>
        .....
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
</configuration>

Connecting to WebHDFS From Node.js

I am hoping you are familiar with Node.js and package installations. Please go through this  if you are not. There is an npm module, "node-webhdfs," with a wrapper that allows you to access Hadoop WebHDFS APIs. You can install the node-webhdfs package using npm:

npm install webhdfs 

After the above step, you can write a Node.js program to access this API. Below are a few steps to help you out.

Import Dependent Modules

Below are external modules to be imported:

const WebHDFS = require("webhdfs");

var request = require("request");  

Prepare Connection URL

Let us prepare the connection URL:

let url = "http://<<your hdfs host name here>>";

let port = 50070; //change here if you are using different port

let dir_path = "<<path to hdfs folder>>"; 

let path = "/webhdfs/v1/" + dir_path + "?op=LISTSTATUS&user.name=hdfs";

let full_url = url+':'+port+path;

List a Directory

Acess the API and get the results:

request(full_url, function(error, response, body) {

    if (!error && response.statusCode == 200) {

        console.log(".. response body..", body);

        let jsonStr = JSON.parse(body);

        let myObj = jsonStr.FileStatuses.FileStatus;

        let objLength = Object.entries(myObj).length;

                 console.log("..Number of files in the folder: ", objLength);

    } else {

         console.log("..error occured!..");

    }

}  

Here is the sample request and response of LISTSTATUSAPI:

 https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#LISTSTATUS 

Get and Display Content of an HDFS File

Assign the HDFS file name with a path:

let hdfs_file_name = '<<HDFS file path>>' ; 

The below code will connect to HDFS using the WebHDFS client instead of the request module we used in the above section:

 let hdfs = WebHDFS.createClient({

    user: "<<user> >",

    host: "<<host/IP >>",

    port: 50070, //change here if you are using different port

    path: "webhdfs/v1/"

});  

The below code is going to be reading and displaying the contents of an HDFS file,

let remoteFileStream = hdfs.createReadStream( hdfs_file_name );

remoteFileStream.on("error", function onError(err) { //handles error while read

    // Do something with the error

    console.log("...error: ", err);

});



let dataStream = [];

remoteFileStream.on("data", function onChunk(chunk) { //on read success

    // Do something with the data chunk 

    dataStream.push(chunk);

    console.log('..chunk..',chunk);

});

remoteFileStream.on("finish", function onFinish() { //on read finish

    console.log('..on finish..');

    console.log('..file data..',dataStream);

}); 

Here is the sample request and response of OPEN API:

https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#OPEN 

How to Read All Files in a Directory

This is not something straightforward, as we don't have a direct method, but we can achieve it by combining the above two operations - reading a directory and then reading the files in that directory one by one.

Conclusion

I am hoping you got some idea about connecting to HDFS and doing basic operations by using Node and the WebHDFS module. All the best!

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,hadoop tutorial ,node.js tutorial ,hdfs data files

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}