Vous êtes sur la page 1sur 11

Reduce-side joins in Java map-reduce

1.0. About reduce side joins


Joins of datasets done in the reduce phase are called reduce side joins. Reduce side joins are easier to
implement as they are less stringent than map-side joins that require the data to be sorted and partitioned
the same way. They are less efficient than maps-side joins because the datasets have to go through the
sort and shuffle phase.
What's involved..
1. The key of the map output, of datasets being joined, has to be the join key - so they reach the same
reducer
2. Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the
datasets in the reducer, so they can be processed accordingly.
3. In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to
be processed as required.
4. A secondary sort needs to be done to ensure the ordering of the values sent to the reducer
5. If the input files are of different formats, we would need separate mappers, and we would need to use
MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.
[MultipleInputs.addInputPath( job, (input path n), (inputformat class), (mapper class n));]
Note: The join between the datasets (employee, current salary - cardinality of 1..1) in the sample
program below has been demonstrated in my blog on map side joins of large datasets, as well. I have
used the same datasets here...as the purpose of this blog is to demonstrate the concept. Whenever
possible, reduce-side joins should be avoided.
[Update - 10/15/2013]
I have added a pig equivalent in the final section.

2.0. Sample datasets used in this gist


The datasets used are employees and salaries. For salary data, there are two files - one file with current
salary (1..1), and one with historical salary data (1..many). Then there is the department data, a small
reference dataset, that we will add to distributed cache and look up in the reducer.

3.0. Implementation a reduce-side join


The sample code is common for a 1..1 as well as 1..many join for the sample datasets.
The mapper is common for both datasets, as the format is the same.

3.0.1. Components/steps/tasks:
1. Map output key
The key will be the empNo as it is the join key for the datasets employee and salary
[Implementation: in the mapper]
2. Tagging the data with the dataset identity
Add an attribute called srcIndex to tag the identity of the data (1=employee, 2=salary, 3=salary history)
[Implementation: in the mapper]
3. Discarding unwanted atributes
[Implementation: in the mapper]

4. Composite key
Make the map output key a composite of empNo and srcIndex
[Implementation: create custom writable]
5. Partitioner
Partition the data on natural key of empNo
[Implementation: create custom partitioner class]
5. Sorting
Sort the data on empNo first, and then source index
[Implementation: create custom sorting comparator class]
6. Grouping
Group the data based on natural key
[Implementation: create custom grouping comparator class]
7. Joining
Iterate through the values for a key and complete the join for employee and salary data, perform lookup
of department to include department name in the output
[Implementation: in the reducer]

3.0.2a. Data pipeline for cardinality of 1..1 between employee and salary data:

3.0.2b. Data pipeline for cardinality of 1..many between employee and salary data:

3.0.3. The Composite key


The composite key is a combination of the joinKey empNo, and the source Index (1=employee file..,
2=salary file...)

3.0.4. The mapper


In the setup method of the mapper1. Get the filename from the input split, cross reference it against the configuration (set in driver), to
derive the source index. [Driver code: Add configuration [key=filename of employee,value=1],
[key=filename of current salary dataset,value=2], [key=filename of historical salary dataset,value=3]
2. Build a list of attributes we cant to emit as map output for each data entity
The setup method is called only once, at the beginning of a map task. So it is the logical place to to
identify the source index.
In the map method of the mapper:
3. Build the map output based on attributes required, as specified in the list from #2
Note: For salary data, we are including the "effective till" date, even though it is not required in the final
output because this is common code for a 1..1 as well as 1..many join to salary data. If the salary data is
historical, we want the current salary only, that is "effective till date= 9999-01-01".
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2

//********************************************************************************
//Class:
MapperRSJ
//Purpose: Mapper
//Author:
Anagha Khanolkar
//********************************************************************************
*
package khanolkar.mapreduce.join.samples.reducesidejoin;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import
import
import
import

org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Mapper;
org.apache.hadoop.mapreduce.lib.input.FileSplit;

1
3 public class MapperRSJ extends
Mapper<LongWritable, Text, CompositeKeyWritableRSJ, Text> {
1
4
CompositeKeyWritableRSJ ckwKey = new CompositeKeyWritableRSJ();
Text txtValue = new Text("");
1
int intSrcIndex = 0;
5
StringBuilder strMapValueBuilder = new StringBuilder("");
1
List<Integer> lstRequiredAttribList = new ArrayList<Integer>();
6
1
@Override
protected void setup(Context context) throws IOException,
7
InterruptedException {
1
8
// {{
1
// Get the source index; (employee = 1, salary = 2)
// Added as configuration in driver
9
FileSplit fsFileSplit = (FileSplit) context.getInputSplit();
2
intSrcIndex = Integer.parseInt(context.getConfiguration().get(
0
fsFileSplit.getPath().getName()));
2
// }}
1
// {{
2
// Initialize the list of fields to emit as output based on
2
// intSrcIndex (1=employee, 2=current salary, 3=historical salary)
2
if (intSrcIndex == 1) // employee
3
{
lstRequiredAttribList.add(2); // FName
2
lstRequiredAttribList.add(3); // LName
4
lstRequiredAttribList.add(4); // Gender
2
lstRequiredAttribList.add(6); // DeptNo
5
} else // salary
2
{
lstRequiredAttribList.add(1); // Salary
6
lstRequiredAttribList.add(3); // Effective-to-date (Value
2
of
7
2
8
}
2
// }}
9
3
}
0
private String buildMapValue(String arrEntityAttributesList[]) {
3
// This method returns csv list of values to emit based on data
1
entity
3
2
strMapValueBuilder.setLength(0);// Initialize
3
// Build list of attributes to output based on source 3
employee/salary
3
for (int i = 1; i < arrEntityAttributesList.length; i++) {
4
// If the field is in the list of required output
3
// append to stringbuilder
if (lstRequiredAttribList.contains(i)) {
5
3
strMapValueBuilder.append(arrEntityAttributesList[i]).append(
6
",");
3
}
7
}
if (strMapValueBuilder.length() > 0) {
3
// Drop last comma
8
strMapValueBuilder.setLength(strMapValueBuilder.length() 3 1);
9
}
4
return strMapValueBuilder.toString();
0
}
4
1
@Override
4
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
2
4
if (value.toString().length() > 0) {
3
String arrEntityAttributes[] = value.toString().split(",");
4
4
ckwKey.setjoinKey(arrEntityAttributes[0].toString());
ckwKey.setsourceIndex(intSrcIndex);
4
txtValue.set(buildMapValue(arrEntityAttributes));
5
4
context.write(ckwKey, txtValue);
6
}
4
}
7
}
4
8
4
9
5
view raw 06-Mapper hosted with by GitHub

// 9999-01// salary)

3.0.5. The partitioner


Even though the map output key is composite, we want to partition by the natural join key of empNo,
therefore a custom partitioner is in order.
1
7
8
9
1 //********************************************************************************
0 //Class:
PartitionerRSJ
1 //Purpose: Custom partitioner
1 //Author: Anagha Khanolkar
//********************************************************************************
1 *
2
1 package khanolkar.mapreduce.join.samples.reducesidejoin;
3
org.apache.hadoop.io.Text;
1 import
import org.apache.hadoop.mapreduce.Partitioner;
4
1 public class PartitionerRSJ extends Partitioner<CompositeKeyWritableRSJ, Text> {
5
@Override
1
public int getPartition(CompositeKeyWritableRSJ key, Text value,
6
int numReduceTasks) {
1
// Partitions on joinKey (EmployeeID)
return (key.getjoinKey().hashCode() % numReduceTasks);
7
}
1
8 }
1
9
2
0
view raw 07-Partitioner hosted with by GitHub

3.0.6. The sort comparator


To ensure that the input to the reducer is sorted on empNo, then on sourceIndex, we need a sort
comparator. This will guarantee that the employee data is the first set in the values list for a key, then
the salary data.
1
2
3 package khanolkar.mapreduce.join.samples.reducesidejoin;
4 import org.apache.hadoop.io.WritableComparable;
5 import org.apache.hadoop.io.WritableComparator;
6
7 //********************************************************************************
SortingComparatorRSJ
8 //Class:
//Purpose: Sorting comparator
9 //Author: Anagha Khanolkar
1 //********************************************************************************
0 *
1
1 public class SortingComparatorRSJ extends WritableComparator {
1
protected SortingComparatorRSJ() {
2
super(CompositeKeyWritableRSJ.class, true);
}
1
3
@Override
1
public int compare(WritableComparable w1, WritableComparable w2) {
4
// Sort on all attributes of composite key
1
CompositeKeyWritableRSJ key1 = (CompositeKeyWritableRSJ) w1;
CompositeKeyWritableRSJ key2 = (CompositeKeyWritableRSJ) w2;
5
1
int cmpResult = key1.getjoinKey().compareTo(key2.getjoinKey());
6
if (cmpResult == 0)// same joinKey
1
{
return Double.compare(key1.getsourceIndex(),
7
key2.getsourceIndex());
1
}
8
return cmpResult;
1
}
9 }
3
1
view raw 08-SortComparator hosted with by GitHub

3.0.7. The grouping comparator


This class is needed to indicate the group by attribute - the natural join key of empNo
1 package khanolkar.mapreduce.join.samples.reducesidejoin;
2 import org.apache.hadoop.io.WritableComparable;

3 import org.apache.hadoop.io.WritableComparator;
4
//********************************************************************************
5 //Class:
GroupingComparatorRSJ
6 //Purpose: For use as grouping comparator
7 //Author: Anagha Khanolkar
8 //********************************************************************************
9 *
1 public class GroupingComparatorRSJ extends WritableComparator {
0
protected GroupingComparatorRSJ() {
super(CompositeKeyWritableRSJ.class, true);
1
}
1
1
@Override
2
public int compare(WritableComparable w1, WritableComparable w2) {
// The grouping comparator is the joinKey (Employee ID)
1
CompositeKeyWritableRSJ key1 = (CompositeKeyWritableRSJ) w1;
3
CompositeKeyWritableRSJ key2 = (CompositeKeyWritableRSJ) w2;
1
return key1.getjoinKey().compareTo(key2.getjoinKey());
4
}
1 }
5
1
6
view raw 09-GroupingComparator hosted with by GitHub

3.0.8. The reducer


In the setup method of the reducer (called only once for the task)We are checking if the side data, a map file with department data is in the distributed cache and if found,
initializing the map file reader
In the reduce method, While iterating through the value list 1. If the data is employee data (sourceIndex=1), we are looking up the department name in the map file
with the deptNo, which is the last attribute in the employee data, and appending the department name to
the employee data.
2. If the data is historical salary data, we are only emitting salary where the last attribute is '9999-01-01'.
Key pointWe have set the sort comparator to sort on empNo and sourceIndex.
The sourceIndex of employee data is lesser than salary data - as set in the driver.
Therefore, we are assured that the employee data is always first followed by salary data.
So for each distinct empNo, we are iterating through the values, and appending the same and emitting as
output.
1 2 package khanolkar.mapreduce.join.samples.reducesidejoin;
34
5 6 import java.io.File;
import java.io.IOException;
7 8 import java.net.URI;
9
10 import org.apache.hadoop.filecache.DistributedCache;
11 import org.apache.hadoop.fs.FileSystem;
org.apache.hadoop.fs.Path;
12 import
import org.apache.hadoop.io.MapFile;
13 import org.apache.hadoop.io.NullWritable;
14 import org.apache.hadoop.io.Text;
15 import org.apache.hadoop.mapreduce.Reducer;
16
//*******************************************************************************
17 *
18 //Class:
ReducerRSJ
19 //Purpose: Reducer
20 //Author: Anagha Khanolkar
21 //*******************************************************************************
**
22
23 public class ReducerRSJ extends
Reducer<CompositeKeyWritableRSJ, Text, NullWritable, Text> {
24
25
StringBuilder reduceValueBuilder = new StringBuilder("");
26
NullWritable nullWritableKey = NullWritable.get();
27
Text reduceOutputValue = new Text("");
String strSeparator = ",";
28
private MapFile.Reader deptMapReader = null;
29
Text txtMapFileLookupKey = new Text("");
30
Text txtMapFileLookupValue = new Text("");
31
32
@Override
protected void setup(Context context) throws IOException,
33
InterruptedException {
34
35
// {{
36
// Get side data from the distributed cache
37
Path[] cacheFilesLocal =
38 DistributedCache.getLocalCacheArchives(context

39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10

.getConfiguration());
for (Path eachPath : cacheFilesLocal) {
if (eachPath.getName().toString().trim()
.equals("departments_map.tar.gz")) {
URI uriUncompressedFile = new
File(eachPath.toString()
+ "/departments_map").toURI();
initializeDepartmentsMap(uriUncompressedFile,
context);
}
}
// }}
}
@SuppressWarnings("deprecation")
private void initializeDepartmentsMap(URI uriUncompressedFile, Context
context)
throws IOException {
// {{
// Initialize the reader of the map file (side data)
FileSystem dfs = FileSystem.get(context.getConfiguration());
try {
deptMapReader = new MapFile.Reader(dfs,
uriUncompressedFile.toString(),
context.getConfiguration());
} catch (Exception e) {
e.printStackTrace();
}
// }}
}
private StringBuilder buildOutputValue(CompositeKeyWritableRSJ key,
StringBuilder reduceValueBuilder, Text value) {
if (key.getsourceIndex() == 1) {
// Employee data
// {{
// Get the department name from the MapFile in
distributedCache
// Insert the joinKey (empNo) to beginning of the
stringBuilder
reduceValueBuilder.append(key.getjoinKey()).append(strSeparator);
String arrEmpAttributes[] = value.toString().split(",");
txtMapFileLookupKey.set(arrEmpAttributes[3].toString());
try {
deptMapReader.get(txtMapFileLookupKey,
txtMapFileLookupValue);
} catch (Exception e) {
txtMapFileLookupValue.set("");
} finally {
txtMapFileLookupValue
.set((txtMapFileLookupValue.equals(null) || txtMapFileLookupValue
.equals("")) ?
"NOT-FOUND"
:
txtMapFileLookupValue.toString());
}
// }}
// {{
// Append the department name to the map values to form a
complete
// CSV of employee attributes
reduceValueBuilder.append(value.toString()).append(strSeparator)
.append(txtMapFileLookupValue.toString())
.append(strSeparator);
// }}
} else if (key.getsourceIndex() == 2) {
// Current recent salary data (1..1 on join key)
// Salary data; Just append the salary, drop the
effective-to-date
String arrSalAttributes[] = value.toString().split(",");
reduceValueBuilder.append(arrSalAttributes[0].toString()).append(
strSeparator);
} else // key.getsourceIndex() == 3; Historical salary data
{
// {{
// Get the salary data but extract only current salary
// (to_date='9999-01-01')
String arrSalAttributes[] = value.toString().split(",");
if (arrSalAttributes[1].toString().equals("9999-01-01")) {
// Salary data; Just append
reduceValueBuilder.append(arrSalAttributes[0].toString())
.append(strSeparator);

8
10
9
11
0
11
1
11
2
11
3
11
4
11
5 {
11
6
11
7
11
8
11
9
12
0
1);
12
1
12
2
12
3
12
4
12
5
12
6
12
7
12
8
12 }
9
13
0
13
1
13
2
13
3
13
4
13
5
13
6
13
7
13
8
13
9
14
0
14
1
14
2
14
3
14
4
14
5
14
6
14

}
// }}
}
// {{
// Reset
txtMapFileLookupKey.set("");
txtMapFileLookupValue.set("");
// }}
return reduceValueBuilder;
}
@Override
public void reduce(CompositeKeyWritableRSJ key, Iterable<Text> values,
Context context) throws IOException, InterruptedException
// Iterate through values; First set is csv of employee data
// second set is salary data; The data is already ordered
// by virtue of secondary sort; Append each value;
for (Text value : values) {
buildOutputValue(key, reduceValueBuilder, value);
}
// Drop last comma, set value, and emit output
if (reduceValueBuilder.length() > 1) {
reduceValueBuilder.setLength(reduceValueBuilder.length() // Emit output
reduceOutputValue.set(reduceValueBuilder.toString());
context.write(nullWritableKey, reduceOutputValue);
} else {
System.out.println("Key=" + key.getjoinKey() + "src="
+ key.getsourceIndex());
}
// Reset variables
reduceValueBuilder.setLength(0);
reduceOutputValue.set("");
}
@Override
protected void cleanup(Context context) throws IOException,
InterruptedException {
deptMapReader.close();
}

7
view raw 10-Reducer hosted with by GitHub

3.0.9. The driver


Besides the usual driver code, we are1. Adding side data (department lookup data in map file format - in HDFS) to the distributed cache
2. Adding key-value pairs to the configuration, each key value pair being filename, source index.
This is used by the mapper, to tag data with sourceIndex.
3. And lastly, we are associating all the various classes we created to the job.
1 2 package khanolkar.mapreduce.join.samples.reducesidejoin;
3 4 import java.net.URI;
56
7 8 import org.apache.hadoop.conf.Configuration;
9 import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
10 import org.apache.hadoop.fs.Path;
11 import org.apache.hadoop.io.NullWritable;
12 import org.apache.hadoop.io.Text;
13 import org.apache.hadoop.mapreduce.Job;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
14 import
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
15 import org.apache.hadoop.util.Tool;
16 import org.apache.hadoop.util.ToolRunner;
17
18 //*******************************************************************************
*
19 //Class:
DriverRSJ
20 //Purpose: Driver for Reduce Side Join of two datasets
21 //
with a 1..1 or 1..many cardinality on join key
//Author:
Anagha Khanolkar
22
//*******************************************************************************
23
**
24
25 public class DriverRSJ extends Configured implements Tool {
26
@Override
27
public int run(String[] args) throws Exception {
28
29
// {{
30
// Exit job if required arguments have not been provided
if (args.length != 3) {
31
System.out
32
.printf("Three parameters are required for
33 DriverRSJ- <input dir1> <input dir2> <output dir>\n");
34
return -1;
}
35
// }{
36
37
// {{
38
// Job instantiation
39
Job job = new Job(getConf());
Configuration conf = job.getConfiguration();
40
job.setJarByClass(DriverRSJ.class);
41
job.setJobName("ReduceSideJoin");
42
// }}
43
// {{
44
// Add side data to distributed cache
45
DistributedCache
46
.addCacheArchive(
47
new URI(
48
"/user/akhanolk/joinProject/data/departments_map.tar.gz"),
49
conf);
50
51
// }}
52
53
// {
// Set sourceIndex for input files;
54
// sourceIndex is an attribute of the compositeKey,
55
// to drive order, and reference source
56
// Can be done dynamically; Hard-coded file names for simplicity
57
conf.setInt("part-e", 1);// Set Employee file to 1
conf.setInt("part-sc", 2);// Set Current salary file to 2
58
conf.setInt("part-sh", 3);// Set Historical salary file to 3
59
60
// }
61
62
// {
// Build csv list of input files
63
StringBuilder inputPaths = new StringBuilder();
64
inputPaths.append(args[0].toString()).append(",")
65
.append(args[1].toString());
66
// }
67
// {{
68
// Configure remaining aspects of the job
69
FileInputFormat.setInputPaths(job, inputPaths.toString());
70
FileOutputFormat.setOutputPath(job, new Path(args[2]));

71
job.setMapperClass(MapperRSJ.class);
72
job.setMapOutputKeyClass(CompositeKeyWritableRSJ.class);
73
job.setMapOutputValueClass(Text.class);
74
job.setPartitionerClass(PartitionerRSJ.class);
75
job.setSortComparatorClass(SortingComparatorRSJ.class);
76
job.setGroupingComparatorClass(GroupingComparatorRSJ.class);
77
78
job.setNumReduceTasks(4);
79
job.setReducerClass(ReducerRSJ.class);
job.setOutputKeyClass(NullWritable.class);
80
job.setOutputValueClass(Text.class);
81
// }}
82
83
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
84
}
85
86
public static void main(String[] args) throws Exception {
87
int exitCode = ToolRunner.run(new Configuration(), new
88 DriverRSJ(),
args);
89
System.exit(exitCode);
90
}
91 }
92
93
94
95
96
97
98
99
10
0
10
1
10
2
view raw 11-Driver hosted with by GitHub

4.0. The pig equivalent


Pig script-version 1:
1
2
3
4
5
6 /*************************************
7 Joining datasets in Pig
Employee..Salary = 1..many
8 Displaying most recent salary
9 Without using any join optimizations
1 **************************************/
0
= load '/user/akhanolk/joinProject/data/employees_active/part-e' using
1 rawEmpDS
PigStorage(',') as
1 (empNo:chararray,dOB:chararray,lName:chararray,fName:chararray,gender:chararray,hi
1 reDate:chararray,deptNo:chararray);
2
1 empDS = foreach rawEmpDS generate empNo,fName,lName,gender,deptNo;
3 rawSalDS = load '/user/akhanolk/joinProject/data/salaries_history/part-sh' using
1 PigStorage(',') as
4 (empNo:chararray,salary:long,fromDate:chararray,toDate:chararray);
1
5 filteredSalDS = filter rawSalDS by toDate == '9999-01-01';
1 salDS = foreach filteredSalDS generate empNo, salary;
6
1 joinedDS = join empDS by empNo, salDS by empNo;
7
finalDS = foreach joinedDS generate
1 empDS::empNo,empDS::fName,empDS::lName,empDS::gender,empDS::deptNo,salDS::salary;
8
1 store finalDS into '/user/akhanolk/joinProject/output/pig-RSJ';
9
2
0
2
1

2
2
view raw 12-PigScript hosted with by GitHub

Pig script-version 2 - eliminating the reduce-side join:


In this script, we are filtering on most recent salary, and then using the merge join optimization (mapside) in Pig, that can be leveraged on sorted input to the join.
1
2
3
4
5 rawEmpDS = load '/user/akhanolk/joinProject/data/employees_active/part-e' using
PigStorage(',') as
6 (empNo:chararray,dOB:chararray,lName:chararray,fName:chararray,gender:chararray,hi
7 reDate:chararray,deptNo:chararray);
8
9 empDS = foreach rawEmpDS generate empNo,fName,lName,gender,deptNo;
1 sortedEmpDS = ORDER empDS by empNo;
0
1 rawSalDS = load '/user/akhanolk/joinProject/data/salaries_history/part-sh' using
1 PigStorage(',') as
1 (empNo:chararray,salary:long,fromDate:chararray,toDate:chararray);
2 filteredSalDS = filter rawSalDS by toDate == '9999-01-01';
1
3 salDS = foreach filteredSalDS generate empNo, salary;
1
4 sortedSalDS = ORDER salDS by empNo;
1 joinedDS = join sortedEmpDS by empNo, sortedSalDS by empNo using 'merge';
5
1 finalDS = foreach joinedDS generate
6 sortedEmpDS::empNo,sortedEmpDS::fName,sortedEmpDS::lName,sortedEmpDS::gender,sorte
dEmpDS::deptNo,sortedSalDS::salary;
1
7 store finalDS into '/user/akhanolk/joinProject/output/pig-RSJ';
1
8
1
9
view raw 13-PigOptimized hosted with by GitHub

Output:
**********************
Output of pig script
**********************
$ hadoop fs -cat joinProject/output/pig-RSJ/part* | less
Facello Georgi M
d005
88958
10002
Simmel Bezalel F
d007
72527
10003
Bamford Parto
M
d004
43311
10004
Koblick Chirstian
M
d004
74057
.........

1 2 3 4 5 6 7 8 9 1010001

view raw 14-PigOutput hosted with by GitHub

Vous aimerez peut-être aussi