Académique Documents
Professionnel Documents
Culture Documents
Cloudwick
Technologies
Copyright2012
2012
Cloudwick
Technologies
Counters are a useful channel for gathering statistics about the job: for quality control or for
application level-statistics.
They are also useful for problem diagnosis
Built-in Counters:
Though Sorting is done at sort and shuffle phase, there are different ways to achieve and control
sorting.
Partial Sort:
The default MapReduce job will sort the input records by keys. If there are 30 reducers, 30 sorted
files will be generated. These files cannot be combined to produce a globally sorted file.
Total Sort:
Use only one reducer. But, it is very inefficient for large files.
Use a Partitioner that respects the total order of the output. For example, if we had four partitions,
we could put keys for temperatures less than 10C in the first partition, those between 10C and
0C in the second, those between 0C and 10C in the third, and those over 10C in the fourth.
Secondary Sort:
For any particular key , values are not sorted.
Use a composite key of key and value and use a Partitioner by the key part of the composite key.
MapReduce can perform joins between large datasets, but writing the code to do joins from scratch is
fairly involved. Rather than writing MapReduce programs, you might consider using a higher-level
framework such as Pig, Hive, or Cascading, in which join operations are a core part of the
implementation.
If the join is performed by the mapper, it is called a map-side join, whereas if it is performed by the
reducer it is called a reduce-side join.
A map-side join between large inputs works by performing the join before the data reaches the map
function. For this to work, though, the inputs to each map must be partitioned and sorted in a particular
way. Each input dataset must be divided into the same number of partitions, and it must be sorted by
the same key (the join key) in each source. All the records for a particular key must reside in the same
partition.
Use a CompositeInputFormat from the org.apache.hadoop.mapreduce.join package to run a map-side
join.
A Reduce-side join is less efficient as both datasets have to go through the MapReduce shuffle. The basic
idea is that the mapper tags each record with its source and uses the join key as the map output key, so
that the records with the same key are brought together in the reducer.
Side data can be defined as extra read-only data needed by a job (map or reduce tasks) to
process the main dataset. The challenge is to make side data available to all the map or
reduce tasks (which are spread across the cluster) in a convenient and efficient fashion.
Example of side data:
Lookup tables
Dictionaries
Standard configuration values
it is possible to cache side-data in memory in a static field, so that tasks of the same job that
run in succession on the same tasktracker can share the data.
You can set arbitrary key-value pairs in the job configuration using the various setter methods
on Configuration (or JobConf in the old MapReduce API). This is very useful if you need to pass
a small piece of metadata to your tasks. But, this is not scalable.
Rather than serializing side data in the job configuration, it is preferable to distribute
datasets using Hadoops distributed cache mechanism. This provides a service for
copying files and archives to the task nodes in time for the tasks to use them when they
run. To save network bandwidth, files are normally copied to any particular node once
per job.
Transfer happens behind the scenes before any task is executed
Note: DistributedCache is read-only
Files in the DistributedCache are automatically deleted from slave nodes when the job
finishes
Implementation:
Place the files into HDFS
Configure the DistributedCache in your driver code
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/tmp/lookup.txt"), job);
DistributedCache.addFileToClassPath(new Path("/tmp/abc.jar"), job);
DistributedCache.addCacheArchive(new URI("/tmp/xyz.zip", job);
or
$ hadoop jar myjar.jar MyDriver -files file1, file2, file3, ...
10
11
12
13