Académique Documents
Professionnel Documents
Culture Documents
Quora
www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-public [http://www.quora.com/
Data/Where-can-I-find-large-datasets-open-to-the-public] -- very good collection of links
DataMobs
http://datamob.org/datasets
KDNuggets
http://www.kdnuggets.com/datasets/
Research Pipeline
http://www.researchpipeline.com/mediawiki/index.php?title=Main_Page
Delicious 1
https://delicious.com/pskomoroch/dataset
Delcious 2
https://delicious.com/judell/publicdata?networkaddconfirm=judell
15.2. Generic Repositories
InfoChimps
InfoChimps has data marketplace with a wide variety of data sets.
InfoChimps market place [http://www.infochimps.com/marketplace]
Open Flights
Crowd sourced flight data http://openflights.org/
Wikipedia data
wikipedia data [http://en.wikipedia.org/wiki/Wikipedia:Database_download]
OpenStreetMap.org
OpenStreetMap is a free worldwide map, created by people users. The geo and map data is available
for download.
openstreet.org [http://planet.openstreetmap.org/]
Geocomm
http://data.geocomm.com/drg/index.html
Publicly Available Big Data Sets
59
Geonames data
http://www.geonames.org/
US GIS Data
Available from http://libremap.org/
15.4. Web data
Wikipedia data
wikipedia data [http://en.wikipedia.org/wiki/Wikipedia:Database_download]
Freebase data
variety of data available from http://www.freebase.com/
US government data
data.gov [http://www.data.gov/]
UK government data
data.gov.uk [http://data.gov.uk/]
Publicly Available Big Data Sets
60
Aid information
http://www.aidinfo.org/data
UN Data
http://data.un.org/Explorer.aspx
61
Chapter 16. Big Data News and Links
16.1. news sites
datasciencecentral.com [http://www.datasciencecentral.com/]
www.scoop.it/t/hadoop-and-mahout [http://www.scoop.it/t/hadoop-and-mahout]
www.infoworld.com/t/big-data [http://www.infoworld.com/t/big-data]
highscalability.com [http://highscalability.com/]
gigaom.com/tag/big-data/ [http://gigaom.com/tag/big-data/]
techcrunch.com/tag/big-data/ [http://techcrunch.com/tag/big-data/]
16.2. blogs from hadoop vendors
blog.cloudera.com/blog [http://blog.cloudera.com/blog/]
hortonworks.com/blog [http://hortonworks.com/blog/]
www.greenplum.com/blog/tag/pivotal-hd [http://www.greenplum.com/blog/tag/pivotal-hd]
blogs.wandisco.com/category/big-data/ [http://blogs.wandisco.com/category/big-data/]
www.mapr.com/blog [http://www.mapr.com/blog]
Part III. Hadoop In Depth
This section will take a closer look at Hadoop internals.
Coming Soon:
'Hello world' in Map Reduce
Configurations
Counters
Distributed Cache
63
Chapter 17. HDFS NameNode
Namenode is the coordinator in HDFS. In this chapter we will look at the functions and internals of Na-
meNode.
17.1. Functions of NameNode
NameNode maintains the file system meta data
NameNode contains all meta data about the file system. For example, the file system hierarchy, file per-
missions, etc. This information is vital for navigating the file system and accessing data.
Note
NameNode does not store the actual data of the file system. All file data is stored by Data Nodes.
NameNode has the location information for files in the
cluster
As we know, in HDFS files are spread across the cluster. So how does one know which machine contains
which files? This information is maintained by NameNode. This information is called 'Block Location
data' - because files are chopped into blocks.
This 'block map' is not persisted by NameNode. It dynamically builds this map from 'block reports' sent
by Data Nodes. Each Data Node reports in with what blocks it has when it joins the cluster. Also Data
Nodes periodically send updated block reports to NameNode. So NameNode has the latest accurate block
information across the cluster.
HDFS NameNode
64
NameNode is a client's first point of contact
When a client wants to access a file, it contacts NameNode first. NameNode provides the details of the file
and redirects the client to the appropriate Data Nodes. Then the client contacts the Data Node to access
the file.
NameNode does not get involved in the data path
The NameNode serves as a reference point for files. When a client needs to read a file it actually contacts
the Data Nodes and streams data directly from Data Nodes. The NameNode does not get involved in the
data handling path. The reason for this is obvious, we don't want the NameNode to be a bottleneck.
17.2. NameNode Architecture
What data does the NameNode store?
We said the NameNode contains file system meta data. What exactly is it? There are two parts to it. One,
it keeps information about the file system (meta data) such as the file system hierarchy, permissions, etc.
Second, it keeps the location information of where the files are.
HDFS NameNode
65
So where is this information stored? NameNode stores this data in memory so NameNode can access this
information very quickly. But how does this information survive a NameNode restart? NameNode persists
this information on disk as well.
As we saw, NameNode is very critical for the cluster operation, so we cannot risk losing NameNode data.
We are already storing the data to disk. What if that disk fails?
We will store this data on two disks.
What if the machine that NameNode is running catches fire and takes down both the disks? We can write
to remote storage. It could be an external filer that can be mounted by NFS. So a copy of data is available
even if the NameNode machine melts down!
So on production systems, NameNode is configured to save the data at multiple locations:
a disk on the NameNode host
another disk on the NameNode host
a remote filer (usually mounted by NFS)
This way, we can be sure the precious NameNode data is saved against disk / host failures.
So lets see how NameNode updates its data. The following diagram shows how file system data is modified.
1. a client makes a request (say, to create a file)
2. NameNode first updates disk images
Writes to disk 1
Writes to disk 2
Writes to remote disk / filer (via an NFS mount)
3. it then updates its in-memory data structures
4. after all this is done, it returns a SUCCESS code to the client
HDFS NameNode
66
Why is data on disk persisted before in-memory data structures? This way, even if
the NameNode goes down between writes, the meta data will be in a consistent state.
How is the meta data stored?
We learned NameNode saves meta data on disk. On a busy cluster there could be thousands of file system
operations (create, delete, move) per minute and, on a big cluster, the size of the meta data could approach
gigabytes. Updating such a large file would be slow and unwieldy. So how does the NameNode solve this?
The approach consists of Snapshots and EditLogs. The process is like this:
1. When NameNode boots up, it loads the latest Snapshot of the meta data.
2. As it handles file system changes, it writes to a separate file called EditLog. As EditLog exceeds a
certain size, a new EditLog file is created.
3. Next time NameNode boots, it loads the latest snapshot and replays the EditLogs until it is up to date
on file system status.
4. Once all EditLogs are replayed, then NameNode is ready to accept file system operations. They will
be recorded in EditLogs as well.
5. This process repeats many times during a NameNode's life cycle.
As you can imagine, the EditLogs can add up in size and replaying them each time NameNode
boots up can take longer and longer. It makes sense to create new snapshots out of EditLogs.
HDFS NameNode
67
This is where the Secondary NameNode comes into play.
Secondary Namenode
A Secondary NameNode is another daemon that is responsible for merging edit logs into snapshots. The
name 'secondary' can mislead people into thinking this is a 'backup' NameNode, and can take over when
NameNode goes down. That is not what Secondary NameNode does. It is actually a 'sidekick' for NameN-
ode.
Here is what Secondary NameNode does:
1. Contacts NameNode and gets latest Snapshot. It uses HTTP to communicate.
2. Also gets EditLogs from NameNode
3. Loads both snapshots and EditLogs into memory, applies EditLogs to the Snapshot, creates another
Snapshot.
4. Uploads the newly minted Snapshot back to NameNode.
HDFS NameNode
68
5. NameNode accepts the new Snapshot and replaces its current one.
6. This process repeats.
So why make Secondary NameNode another daemon instead of part of NameNode?
This is for two reasons:
To merge the snapshots and EditLogs into a new snapshot, they have to be loaded into memory. On
a large cluster, the image files can be quite large (gigabytes) and loading them into memory can con-
sume large amounts of memory. By moving it to another host, we can minimize the memory needs of
NameNode.
Also on a large, busy cluster the EditLogs can be large and merging them can be very resource consum-
ing. By offloading to another host, NameNode is not overloaded.
On a large cluster (more than 25 nodes) it is advisable to run Secondary Namenode on a separate host. For
smaller clusters, Secondary NameNode can be run on the same node as NameNode
69
Chapter 18. How MapReduce Works --
The Internals
This chapter looks 'under the hood' of Map Reduce and explains how things work.
18.1. Job Submission Process
We already described a map reduce job. Now let's look "under the hood" at what happens when we run a
job. There is actually quite a bit going on behind the scenes. Lets start with an overall submission process
diagram. Note that this covers classic Map Reduce or MRV1.
How MapReduce
Works -- The Internals
70
Figure 18.1. Map Reduce Job Submission Process - Hadoop 1
Whoa, that is quite a lot; let's take it in stride
Client Side : Step 1) Job Submission
Map Reduce jobs are submitted by a client.
Note
You don't need to login to a Hadoop cluster to submit a job. Actually in most cases Hadoop
admins won't allow users to login to the cluster. Gateway machines would be set up that have
access to the Hadoop cluster. Users would login to gateway machines and submit their jobs.
How MapReduce
Works -- The Internals
71
Step 2) Getting JobID
Each Map Reduce job gets assigned a unique ID. Clients contact the Job Tracker to grab a new jobID.
Step 3) Copying job resources to HDFS
Before running Map Reduce jobs on cluster nodes, we need to distribute a bunch of things to the nodes
themselves (Remember code to data concept?). Following are some artifacts that get distributed.
The Map Reduce program jar file, which contains the compiled Java code.
Any other dependent jar files that are used by Map Reduce job.
Any other files.
Step 4) Submit the job
The client submits the job to Job Tracker
Job Tracker side : Getting the Job going
Step 6) Job Splits
One important concept of Map Reduce is parallelism. The input is divided up and processed across the
cluster. Inputs are chopped up into what are called 'input splits'. Job Tracker next calculates input splits.
Step 7) Job is kicked off
Job Tracker starts the job and Task Trackers will start executing the job.
Tasktracker side : Executing the Job
Job Tracker doesn't run any jobs. Task Trackers actually execute the jobs.
Step 8) Getting Job Resources locally
Remember the job resources that were copied to HDFS by the client? The Tasktracker nodes now need to
copy these resources locally. These files are copied from HDFS to local storage. A 'local run directory' is
created and the contents are copied there and 'un jarred'.
Step 9) Spinning up the map or reduce job
Now that all job resources are locally available, the actual job is ready to be run. Tasktracker spins up a
separate Java Virtual Machine (JVM) and runs map or reduce programs. So why is map/reduce run in a
separate VM? This way any errors in map/reduce code don't kill the Tasktracker.
Step 10) Heartbeats -- reporting back to Job Tracker
Tasktracker periodically sends updates to Job Tracker. This is called the 'heartbeat'. The Heartbeat contains
information like job status, resource usage, etc. Job Tracker collects all reports from Task Trackers and
figures out how the job is progressing.
How MapReduce
Works -- The Internals
72
18.1. Measuring Progress of Tasks
Map Reduce tasks report their progress. Tasks that don't report progress for a while can be marked as
'hung' and killed off. So it is important for a task to keep updating its progress. Progress is reported by
the following activities:
Reading an input record (progress updated automatically, no user intervention required).
Writing an output record (automatic).
Incrementing counters (manual, user has to do this explicitly in map reduce code).
Setting status message (manual).
Explicitly setting progress status (manual).
18.2. Dealing With Task Failures
While running tasks on a cluster of commodity machines, some things are bound fail. Machines fail, disks
go bad, etc. Luckily Map Reduce is built to handle failures.
How is a task failure detected?
An exception thrown by a map or reduce program. The child JVM notifies Tasktracker of this failure.
The exception is then propagated up the chain to Job Tracker.
The child JVM exits abnormally. Say it runs out of memory, etc.
Hanging tasks - at times map or reduce tasks become unresponsive. It could be a long running task
(crawling web sites) or stuck in a loop. If a task hasn't updated it's progress in a while (10 minutes), then
that task is marked as non-responsive and noted as failed.
This timeout interval can be set by a mapred.task.timeout configuration parameter. If this is set to zero,
then the timeouts are disabled (in other words Map Reduce doesn't mark the tasks as failed).
So what happens when a task fails? The Tasktracker detects the failure and sends the failure up the chain
to Job Tracker. Job Tracker will retry this task. The task is retried on another node other than where it
failed once. The reasoning is the node may be failing (hardware failure, etc.) and Job Tracker wants to
try the task somewhere else.
A failed task is retried 3 times. If it keeps failing, then the entire Job is marked failed. A coding error (e.g.
Null Pointer Exception) can make the task fail consistently, eventually failing the job.
The maximum number of retries is controlled by 'mapred.map.max.attempts' and
'mapred.reduce.max.attempts'.
18.2. Job Scheduling
A large cluster may be shared with many users. Users submit jobs and these jobs could be running on
the cluster at the same time. They will compete for resources, open map/reduce slots, etc. Map Reduce
schedulers will manage how this is orchestrated. There are three schedulers that can be used.
How MapReduce
Works -- The Internals
73
FIFO (First In First Out) Sched-
uler:
This is the default scheduler for Hadoop 1. The concept is pretty sim-
ple. Jobs from all users are slotted in to a queue. And they are executed
in the order they arrive.
FIFO has a serious problem in a multi-user environment. Imagine that
a job that will be running for 10 hours is currently running on the
cluster. Then, another user submits a quick job that would be done in
10 minutes. If we stick to 'first come first served' the short job will be
stuck behind the long running job for a very long time.
Fair Scheduler: Fair scheduler was developed by Facebook. It is designed to effective-
ly share the cluster across various users.
This scheduler allocates cluster time 'fairly' between multiple users.
So in our previous example a short job will make progress along with
the long-running job.
Fair Scheduler manages this by creating 'work pools' for each user.
Each user's task is queued in their own pool and the scheduler makes
sure each pool receives enough execution time. If there is excess ca-
pacity (open map slots or reduce slots) they are shared among the
pools.
Capacity Scheduler: Capacity Scheduler was developed by Yahoo. It shares a lot of quali-
ties with Fair Scheduler. However, it provisions multiple users across
the cluster using 'queues' instead of the pools used by Fair Scheduler.
74
Chapter 19. Hadoop Versions
Let's take a moment to explore the different versions of Hadoop. The following are/were the most known
versions.
0.21 - Aug 2010 -- A widely used release. This eventually became the Hadoop 1.0 release.
0.23 - Feb 2012 -- Was a branch created to add new features. This branch eventually became Hadoop 2.0
1.0 -- Current production version of Hadoop
2.0 -- Current development version of Hadoop
Hadoop Versions
75
19.1. Hadoop version 1.0
This is currently the production version of Hadoop. It has been in wide use for a while and has been proven
in the field. The following distributions are based on Hadoop 1.0
Cloudera's CDH 3 (Cloudera's Distribution of Hadoop) series
Hadoop Versions
76
HortonWorks's HDP 1 (HortonWorks Data Platform) series
19.2. Hadoop version 2.0
This is a development branch of Hadoop. Hadoop 2 has significant new enhancements. It has been under
development for a while. Hadoop 2 has the following new features:
HDFS fail over
Federated NameNode
Map Reduce version 2 (MRV2) also known as YARN
The following distributions currently bundle Hadoop 2:
Cloudera's CDH4 (Cloudera's Distribution of Hadoop) series.
HortonWorks HDP2 (Hortonworks Data Platform) series.
Hadoop Versions
77
19.3. History of Hadoop
78
Chapter 20. How Hadoop version 2
'looks' different from version 1
This chapter details how Hadoop version 2 'looks and feels' different from version 1. To learn about func-
tional differences between version 1 and 2, look at chapters see ???.
20.1. start/stop scripts
These are files used to start/stop Hadoop daemons: start-dfs.sh, stop-dfs.sh, start-mapred.sh, stop-
mapred.sh, start-all.sh, stop-all.sh, hadoop-daemon.sh, hadoop-daemons.sh
In Hadoop v1 the start/stop scripts are in the HADOOP_HOME/bin dir. In Hadoop2 they are in the
HADOOP_HOME/sbin dir
Note: in the following tables ""HH" stands for the Hadoop install directory.
Tar Package Version
Table 20.1. Start / Stop scripts for tar package version
Task Hadoop 1 Hadoop 2
to start HDFS $ HH/bin/start-dfs.sh
$ HH/bin/hadoop-daemon.sh
start namenode
$ HH/sbin/start-dfs.sh
$ HH/sbin/hadoop-daemon.sh
start namenode
to start Map Reduce $ HH/bin/start-mapred.sh $ HH/sbin/start-yarn.sh
to start everything $ HH/bin/start-all.sh $ HH/sbin/start-all.sh
RPM Version
Table 20.2. Start / Stop scripts for rpm package version
Task Hadoop 1 Hadoop 2
to start HDFS On NameNode:
sudo service hadoop-0.20-na-
menode start
On Data Node:
sudo service hadoop-0.20-
datanode start
On NameNode:
sudo service hadoop-hdfs-na-
menode start
On Data Node:
sudo service hadoop-hdfs-
datanode start
to start Map Reduce on Job Tracker:
sudo service hadoop-0.20-job-
tracker start
on Task Tracker:
sudo service hadoop-0.20-task-
tracker start
on Job Tracker:
sudo service hadoop-yarn-re-
sourcemanager restart
sudo service hadoop-mapre-
duce-historyserver start
sudo service hadoop-yarn-
proxyserver start
on "worker nodes":
How Hadoop version 2 'looks'
different from version 1
79
Task Hadoop 1 Hadoop 2
sudo service hadoop-yarn-
nodemanager start
20.2. Hadoop Command Split
In v1 there are bin/hadoop executables used for file system operations, administration and map reduce
operations. In v2 these are handled by separate binaries.
Table 20.3. Hadoop Command Split
Task Hadoop 1 Hadoop 2
File system operations $ HH/bin/hadoop dfs -ls $ HH/bin/hdfs dfs -ls
NameNode operations $ HH/bin/hadoop namenode -
format
$ HH/bin/hdfs namenode -for-
mat
File system admin commands $ HH/bin/hadoop dfsadmin -re-
freshNodes
$ HH/bin/hdfs dfsadmin -re-
freshNodes
20.3. Config Files
Sample files: core-site.xml, hdfs-site.xml, mapred-site.xml
In Hadoop v1 the config files are in the HADOOP_HOME/conf directory. In Hadoop v2 they are in the
HADOOP_HOME/etc/hadoop directory. The files are all intact at their new location.
20.4. Lib Dir Split
In Hadoop v1 all dependency jars are in the lib dir. In v2 they are broken out into sub directories.
HH/share/hadoop/common
HH/share/hadoop/hdfs
HH/share/hadoop/mapreduce
HH/share/hadoop/tools
HH/share/hadoop/yarn
Each of these directories also has a lib sub folder for their own dependencies. This means there may be
duplicate copies of some jars. For example
find hadoop-2 -name "log4j*.jar" yields
hadoop-2/share/hadoop/common/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/hdfs/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/mapreduce/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/yarn/lib/log4j-1.2.17.jar
This is okay, as now each component (hdfs, mapreduce, yarn) can have it dependencies in a clean way.
The split of jar files means that we have to change run scripts also. Here is an example of a java program
that connects with HDFS:
java -cp my.jar:$hadoop_home/share/hadoop/hdfs/*:$hadoop_home/share/
hadoop/hdfs/lib/*:$hadoop_home/share/hadoop/common/*:$hadoop_home/
share/hadoop/common/lib/* MyClass
How Hadoop version 2 'looks'
different from version 1
80
20.5. Running TestDFSIO
in v1
$ hadoop jar HH/hadoop-tests.jar TestDFSIO -write -nrFiles 10 -fileSize
1000
in v2
$ hadoop jar HH/share/hadoop/mapreduce/hadoop-mapreduce-client-job-
client-*-tests.jar TestDFSIO -write -nrFiles 4 -fileSize 1GB
Part IV. Hadoop Administration
Coming Soon:
Part V. Hadoop Cookbook
This section has specific guides and .
83
Chapter 21. Handy Classes
Note
The original article was published by Sujee Maniyam at his website: here [http://sujee.net/tech/
articles/hadoop/hadoop-useful-classes/]
These are some handy 'utility' classes bundled with Hadoop.
21.1. Executing Shell Commands
class : org.apache.hadoop.util.Shell / org.apache.hadoop.util.Shell.ShellCommandExecutor
jar : hadoop1 : HADOOP_HOME/lib/hadoop-core.jar
jar : hadoop 2 : HADOOP_HOME/share/hadoop/common/hadoop-common.jar
Why would we want to execute an outside executable from within a Hadoop client or map reduce? Perhaps
we have a program that converts images into another format (jpg --> png). Rather than rewriting this in
Java, we can use a shell to invoke the command:
import org.apache.hadoop.util.Shell.ShellCommandExecutor;
String[] cmd = { "ls", "/usr" };
ShellCommandExecutor shell = new ShellCommandExecutor(cmd);
shell.execute();
System.out.println("* shell exit code : " + shell.getExitCode());
System.out.println("* shell output: \n" + shell.getOutput());
/* output:
* shell exit code : 0
* shell output:
X11
X11R6
bin
etc
include
lib
libexec
llvm-gcc-4.2
local
sbin
share
stand alone
*/
21.2. StringUtils
class : org.apache.hadoop.util.StringUtils
jar : hadoop 1 : HADOOP_HOME/lib/hadoop-core.jar
Handy Classes
84
jar : hadoop 2 : HADOOP_HOME/share/hadoop/client/lib/hadoop-common.jar
StringUtils.byteDesc() : User-friendly / human-readable
byte lengths
How many megabytes is 10000000 bytes? this will tell you.
import org.apache.hadoop.util.StringUtils;
// --- human readable byte lengths -----
System.out.println ("1024 bytes = " + StringUtils.byteDesc(1024));
System.out.println ("67108864 bytes = " + StringUtils.byteDesc(67108864));
System.out.println ("1000000 bytes = " + StringUtils.byteDesc(1000000));
/* produces:
1024 bytes = 1 KB
67108864 bytes = 64 MB
1000000 bytes = 976.56 KB
*/
StringUtils.byteToHexString() : Convert Bytes to Hex
strings and vice-versa
We deal with byte arrays in Hadoop / map reduce. This is a handy way to print / debug issues
import org.apache.hadoop.util.StringUtils;
// ----------------- String Utils : bytes <--> hex ---------
String s = "aj89y1_xxy";
byte[] b = s.getBytes();
String hex = StringUtils.byteToHexString(b);
byte[] b2 = StringUtils.hexStringToByte(hex);
String s2 = new String(b2);
System.out.println(s + " --> " + hex + " <-- " + s2);
/* output:
aj89y1_xxy --> 616a383979315f787879 <-- aj89y1_xxy
*/
StringUtils.formatTime() : human readable elapsed time
How long is 100000000 ms? See below:
System.out.println ("1000000 ms = " + StringUtils.formatTime(1000000));
Handy Classes
85
long t1 = System.currentTimeMillis();
// execute a command
long t2 = System.currentTimeMillis();
t2 += 10000000; // adjust for this demo
System.out.println ("operation took : " + StringUtils.formatTimeDiff(t2, t1));
/* output:
1000000 ms = 16mins, 40sec
operation took : 2hrs, 46mins, 40sec
*/
21.3. Hadoop Cluster Status
class : org.apache.hadoop.mapred.ClusterStatus
jar : hadoop 1 : HADOOP_HOME/lib/hadoop-core.jar
jar : hadoop 2 : HADOOP_HOME/share/hadoop/client/lib/hadoop-mapreduce-client-core.jar
Find out how many nodes are in the cluster, how many mappers, reducers, etc.
package hi;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.ClusterStatus;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/*
compile into a jar and execute:
hadoop jar a.jar hi.TestCluster
output:
number of task trackers: 1
maximum map tasks : 3
currently running map tasks : 0
*/
public class TestCluster extends Configured implements Tool
{
public static void main(String[] args) throws Exception
{
int result = ToolRunner.run(new Configuration(), new TestCluster(), args);
System.exit(result);
}
@Override
public int run(String[] arg0) throws Exception
{
Handy Classes
86
JobConf dummyJob = new JobConf(new Configuration(), TestCluster.class);
JobClient jobClient = new JobClient(dummyJob);
ClusterStatus clusterStatus = jobClient.getClusterStatus();
System.out.println("number of task trackers: " + clusterStatus.getTaskTrackers());
System.out.println("maximum map tasks : " + clusterStatus.getMaxMapTasks());
System.out.println("currently running map tasks : " + clusterStatus.getMapTasks());
return 0;
}
}
87
Chapter 22. Handy Classes for HBase
Note
The original article was published by Sujee Maniyam at his website: here [http://sujee.net/tech/
articles/hadoop/hadoop-useful-classes/]
These are some handy 'utility' classes bundled with Hadoop HBase.
22.1. Bytes
class : org.apache.hadoop.hbase.util.Bytes
jar : HBASE_HOME/lib/hbase-*.jar
Handy utility for dealing with bytes and byte arrays:
Converting objects to bytes
// -------- Bytes.toBytes() : converting objects to bytes ------
int i = 10000;
byte [] intBytes = Bytes.toBytes(i);
System.out.println ("int " + i + " in bytes : " + Arrays.toString(intBytes));
float f = (float) 999.993;
byte [] floatBytes = Bytes.toBytes(f);
System.out.println ("float " + f + " in bytes : " + Arrays.toString(floatBytes));
String s = "foobar";
byte [] stringBytes = Bytes.toBytes(s);
System.out.println ("string " + s + " in bytes : " + Arrays.toString(stringBytes));
/* output:
int 10000 in bytes : [0, 0, 39, 16]
float 999.993 in bytes : [68, 121, -1, -115]
string foobar in bytes : [102, 111, 111, 98, 97, 114]
*/
Bytes.add() : create composite keys
// ---------- Bytes.add() : creating composite keys ---------
// rowkey = int + long
int i = 0;
long timestamp = System.currentTimeMillis();
byte [] rowkey2 = Bytes.add(Bytes.toBytes(i), Bytes.toBytes(timestamp));
System.out.println ("rowkey2 (" + rowkey2.length + ") : " + Arrays.toString(rowkey2));
// add also supports adding 3 byte arrays
Handy Classes for HBase
88
// rowkey = int + str + long
byte[] rowkey3 = Bytes.add(Bytes.toBytes(0) , Bytes.toBytes("hello"), Bytes.toBytes(timestamp));
System.out.println ("rowkey3 (" + rowkey3.length + ") : " + Arrays.toString(rowkey3));
/* output:
rowkey2 (12) : [0, 0, 0, 0, 0, 0, 1, 50, 8, -101, 99, -41]
rowkey3 (17) : [0, 0, 0, 0, 104, 101, 108, 108, 111, 0, 0, 1, 50, 8, -101, 99, -41]
*/