Reading from HBase
Job Configuration
The following is an example of using HBase as a MapReduce source in a read-only manner:
Configuration config = HBaseConfiguration.create();
config.set( // speculative
“mapred.map.tasks.speculative.execution”, // execution will
“false”); // decrease performance
// or damage the data
Job job = new Job(config, “ExampleRead”);
job.setJarByClass(MyReadJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan,
// which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don’t set to true for MR jobs
// set other scan attrs
...
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
null, // mapper output key
null, // mapper output value
job);
job.setOutputFormatClass(NullOutputFormat.class); // because we
// aren’t emitting anything from mapper
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException(“error with job!”);
}
The mapper instance would extend TableMapper, too, like this:
public static class MyMapper extends TableMapper<Text, Text> {
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
// process data for the row from the Result instance.
}
}
Map Tasks Number
When TableInputFormat is used (set by default with TableMapReduceUtil. initTableMapperJob(...)) to read an HBase table for input to a MapReduce job, its splitter will make a map task for each region of the table. Thus, if 100 regions are in the table, there will be 100 map tasks for the job, regardless of how many column families are selected in the Scan. To implement a different behavior (custom splitters), see the method getSplits in TableInputFormatBase (either override in custom-splitter class or use as example).
Writing to HBase
Job Configuration
The following is an example of using HBase both as a source and as a sink with MapReduce:
Configuration config = ...; // configuring reading
Job job = ...; // from HBase table
Scan scan = ...; // is the same as in
TableMapReduceUtil // read-only example
.initTableMapperJob(...); // above
TableMapReduceUtil.initTableReducerJob(
targetTable, // output table
MyTableReducer.class, // reducer class
job);
job.setNumReduceTasks(1); // at least one, adjust as required
boolean b = job.waitForCompletion(true);
And the reducer instance would extend TableReducer, as shown here:
public static class MyTableReducer extends TableReducer<Text, IntWritable,
ImmutableBytesWritable> {
public void reduce(Text key, Iterable
values, Context context)
throws IOException, InterruptedException {
...
Put put = ...; // data to be written
context.write(null, put);
...
}
}
{{ parent.title || parent.header.title}}
{{ parent.tldr }}
{{ parent.linkDescription }}
{{ parent.urlSource.name }}