Over a million developers have joined DZone.

Write a Data Pipeline with Apache Falcon

DZone's Guide to

Write a Data Pipeline with Apache Falcon

The Falcon process which I am going to describe triggers two conditions in which an Oozie workflow is invoked to call a SSH script.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

In last two posts (post 1, post 2) I provided basic introductions to Apache Falcon. In this post, I will describe how we can write a basic Falcon data pipeline.

The Falcon process which I am going to describe triggers two conditions:

  1.  Process start time (i.e. 15:00 UTC) is met.
  2. And a trigger folder is created in location /tmp/feed-01/ with name as ${YEAR}-${MONTH}-${DAY}.

Once the Falcon process is triggered, it invokes an Oozie workflow which calls a SSH script which just prints the two input parameters to /tmp/demo.out file on local FS of SSH box.

The code for Falcon cluster (test-primary-cluster) is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="test-primary-cluster" description="test-primary-cluster" colo="TEST DEV PRIMARY CLUSTER" xmlns="uri:falcon:cluster:0.1">
        <interface type="readonly" endpoint="hftp://localhost:50070" version="2.2.0"/>
        <interface type="write" endpoint="hdfs://localhost:8020" version="2.2.0"/>
        <interface type="execute" endpoint="localhost:8050" version="2.2.0"/>
        <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.0.0"/>
        <interface type="messaging" endpoint="tcp://localhost:61616?daemon=true" version="5.1.6"/>
        <location name="staging" path="/apps/falcon/test-primary-cluster/staging"/>
        <location name="temp" path="/tmp"/>
        <location name="working" path="/apps/falcon/test-primary-cluster/working"/>
    <ACL owner="rishav" group="hdpuser" permission="0770"/>

One important thing to note here is you need to create staging and working directories on HDFS with proper permission and ownership. The below permissions and ownership are needed on Hortonworks cluster:

hadoop fs -mkdir -p /apps/falcon/test-primary-cluster/staging/
hadoop fs -chmod 777 /apps/falcon/test-primary-cluster/staging/
hadoop fs -mkdir -p /apps/falcon/test-primary-cluster/working/
hadoop fs -chmod 755 /apps/falcon/test-primary-cluster/working/
hadoop fs -chown -R falcon:hadoop /apps/falcon/test-primary-cluster

The code for Falcon feed (feed-01-trigger) is:

<?xml version="1.0" encoding="UTF-8"?>
<feed description="feed-01-trigger"
        name="feed-01-trigger" xmlns="uri:falcon:feed:0.1">
        <late-arrival cut-off="hours(20)" />
                <cluster name="test-primary-cluster" type="source">
                        <validity start="2015-09-07T14:00Z" end="2099-03-09T12:00Z" />
                        <retention limit="months(9999)" action="archive" />
                                <location type="data" path="/tmp/feed-01/${YEAR}-${MONTH}-${DAY}" />

                <location type="data" path="/tmp/feed-01/${YEAR}-${MONTH}-${DAY}" />
                <location type="stats" path="/none" />
                <location type="meta" path="/none" />

    <ACL owner="rishav" group="hdpuser" permission="0770"/>
        <schema location="/none" provider="none" />

For this feed -

  • The retention limit is set to 9999 months.
  • Late arrival limit is set to 20 hours.
  • And frequency is set to daily.

The code for Falcon process (process-01) is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="process-01"
                <cluster name="test-primary-cluster">
                        <validity start="2015-09-08T15:00Z" end="2099-03-10T23:00Z" />
                <input name="feed-01-trigger"
                        end="today(1,0)" start="today(0,0)"
                        feed="feed-01-trigger" />
                <property name="workflowName" value="workflow-01" />
                <property name="input1" value="variable1" />
                <property name="input2" value="${formatTime(dateOffset(instanceTime(), -1, 'DAY'),'yyyy-MM-dd')}" />
        <workflow name="workflow-01"
                version="2.0.0" engine="oozie"
                path="/tmp/oozie_workflow" />
        <retry policy="periodic" delay="minutes(15)" attempts="2" />
    <ACL owner="rishav" group="hdpuser" permission="0770"/>

For this process -

  • The start time is set at 15:00 UTC.
  • Dependency is set to input feed feed-01-trigger.
  • Retry policy is set to 2 times with a gap of 15 minutes.
  • This process is also using EL expression to set input2 variable to get yesterday's date.

The oozie workflow with SSH action is as defined below:

<workflow-app name="${workflowName}" xmlns="uri:oozie:workflow:0.1">
 <start to="demo_script"/>

 <action name="demo_script">
        <ssh xmlns="uri:oozie:ssh-action:0.1">
   <ok to="end"/>
   <error to="kill"/>
 <kill name="kill">
   <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
 <end name="end"/>

This Oozie workflow -

  • Gets input1, input2 and workflowName variable from Falcon proces-01 process.
  • And invokes shell script on poc001 box with input1 and input2 as parameters.

And demo.bash script called by Oozie SSH action is given below:

cd ~
echo `date` >> /tmp/demo.out
echo "input1 $1" >> /tmp/demo.out
echo "input2 $2" >> /tmp/demo.out

demo.bash is a simple script which echos current date, input1 and input2 variable to /tmp/demo.out file.

In my next post I will explain how we can submit and schedule these Falcon processes.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

big data ,apache ,apache falcon ,oozie ,data pipeline

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}