Avro Secondary Sorting

August 18, 2019

In this post, We are gonna see secondary sorting with Avro file format, using AvroJob Api and its gonna be two part series,

MapReduce V1 Api(org.apache.hadoop.mapred)
MapReduce V2 Api (org.apache.hadoop.mapreduce)

In this post we are going to see example in MapReduce V1 Api.

So lets start, Here's the Avro Input Schema that we are going to use in this example.

Sample Data which we are going to read as input file, for viewing purpose data is shown in JSON format. but it was read as avro format in mapreduce program.

Secondary sorting means the values sent to the Reducer should be sorted based on some criteria. you can refer more about secondary sorting here.

In this example, Reducer Iterator values should be sorted based the TimeStamp of above data, but the Reducer Input Key should be grouped based on Id of above sample data.

We need use MapOutput Key Schema for Secondary Sorting(Natural Key + Composite Key).Here Natural Key is Id and Composite Key is TimeStamp. Please Refer below for MapOutputKey Schema here.
Below shows the Driver,Mapper and Reducer Code we are going to use in this example.

Driver Class: Which Holds the Conf to Run MapReduce Job.

Mapper Class:
Reducer Class:
GroupingComparator Class: Reducer Side data should be grouped based the Natural Key Id, not based on the combination of Natural Key + Composite Key(TimeStamp).

Partitioner Class: This Class is used to determine which key should go to which reducer group.

Below shows the Ouput of the MapReduce Job, which outputs as avro file for viewing purpose, we converted it into JSON file using AvroTools Jar. output is grouped by Id and sorted based on TimeStamp.
You Can Refer full SourceCode for this example here.
Thanks for reading this post, ciao until next post.

For Further Reading:
Secondary Sort
Avro Job

Search This Blog

Dev Zone

Avro Secondary Sorting

Comments

Post a Comment

Popular posts from this blog

Builder Design Pattern

DNS problem in BIG DATA