Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Saving H2O Models From R/Python API in Hadoop Environment

DZone's Guide to

Saving H2O Models From R/Python API in Hadoop Environment

Check out an example of why you might get a No Such File Or Directory error and learn how you can get around this issue.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

When you are using H2O in a clustered environment, i.e. Hadoop, the machine could be different where h2o.savemodel() is trying to write the model. That's why you see the error, “No such file or directory.” If you just give the path, i.e. /tmp, and visit the machine ID where the H2O connection is initiated from R, you will see the model stored there.

Here is a good example to understand it better.

1. Start Hadoop Driver in EC2 Environment 

[ec2-user@ip-10-0-104-179 ~]$ hadoop jar h2o-3.10.4.8-hdp2.6/h2odriver.jar -nodes 2 -mapperXmx 2g -output /usr/ec2-user/005
....
....
....
Open H2O Flow in your web browser: http://10.0.65.248:54323  <=== H2O is started.
[ec2-user@ip-10-0-104-179 ~]$ hadoop jar h2o-3.10.4.8-hdp2.6/h2odriver.jar -nodes 2 -mapperXmx 2g -output /usr/ec2-user/005
....
....
....
Open H2O Flow in your web browser: http://10.0.65.248:54323  <=== H2O is started.

2. Connect R Client With H2O

> h2o.init(ip = "10.0.65.248", port = 54323, strict_version_check = FALSE)

Note: I have used the IP address as shown above to connect with the existing H2O cluster. However, the machine where I am running the R client is different, as its IP address is 34.208.200.16.

3. Saving H2O Model

h2o.saveModel(my.glm, path = "/tmp", force = TRUE)

The mode is saved at 10.0.65.248 even when the R client is running at 34.208.200.16.

ec2-user@ip-10-0-65-248 ~]$ ll /tmp/GLM*
-rw-r--r-- 1 yarn hadoop 90391 Jun 2 20:02 /tmp/GLM_model_R_1496447892009_1

You need to make sure you have access to a folder where the H2O service is running, or you can save model at HDFS something similar to as below:

h2o.saveModel(my.glm, path = "hdfs://ip-10-0-104-179.us-west-2.compute.internal/user/achauhan", force = TRUE)

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
hadoop ,big data ,python ,r ,api ,h2o ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}