Vous êtes sur la page 1sur 26

Bayesian

Counters 0.1.0
Development Environment Tutorial For CentOS 6.3
x86_64 Workstation
By Alex Kozlov and Chris Poulin
Testing by Daniel Rule and Ken Krugler
January 31 2013
Contact: Chris Poulin: Chris@patternsandpredictions.net

P a g e | 2


Table of Contents
1

Introduction ................................................................................................................................................ 4

The Audience .............................................................................................................................................. 4

The Goal ........................................................................................................................................................ 4

Provisioning ................................................................................................................................................ 4
4.1

Virus Risk Warning ..................................................................................................................................................................... 4

4.2

Software Archive Warning ....................................................................................................................................................... 4

4.3

Development Workstation ....................................................................................................................................................... 5

4.4

HBase Cluster ................................................................................................................................................................................ 5

4.5

Network .......................................................................................................................................................................................... 6

Conventions ................................................................................................................................................ 6

Bayesian Counters 0.1.0 Development Environment ................................................................... 7


6.1

Log into the Gnome Desktop on h13.demo.dev with the poulin account. ................................................................ 7

6.2

Browse to Cloudera Manager at http://h12.demo.dev:7180/ and log in with the admin account using

Firefox or other browser in the Gnome Desktop. .......................................................................................................................... 7


6.3

Within Cloudera Manager, Hosts -> Add Host This will start the Add Hosts Wizard. Follow the

instructions of the wizard to add h13.demo.dev to the cluster but do not add any rolls to h13.demo.dev, as we
do not want the development host to participate as part of the cluster. When complete, h13.demo.dev will show
up in the Hosts list. .................................................................................................................................................................................. 8
6.4

Within Cloudera Manager, Services -> HBase (hbase1) and Actions -> Download Client Configuration and

save to /home/poulin/Desktop/hbase1-clientconfig.zip ......................................................................................................... 8


6.5

Within Cloudera Manager, Services -> MapReduce (mapreduce1) and Actions -> Download Client

Configuration and save to /home/poulin/Desktop/mapreduce1-clientconfig.zip .......................................................... 8


6.6

Open a Terminal (All shell commands going forward will be executed in Terminal) ......................................... 9

6.7

Point hbase shell to h12.demo.dev ..................................................................................................................................... 9

6.8

Test HBase connectivity and hbase shell ............................................................................................................................ 9

6.9

Install the Development Tools group ................................................................................................................................ 10

6.10 Install Apache Maven 3.0.X ................................................................................................................................................... 10


6.11 Install Apache Maven 2.2.X ................................................................................................................................................... 10
6.12 Install asciidoc doxygen help2man, source-highlight and Python 2.7.3 .............................................................. 11
6.13 Install Eclipse Juno ( in future, newer is probably OK, but optimal repeatability with Juno) ....................... 11
6.14 Append to the end of /home/poulin/.bashrc ................................................................................................................. 12
6.15 Open a 2nd Terminal and Start Eclipse Juno .................................................................................................................... 12

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 3

6.16 Click OK on workspace launcher defaults, both now and when seen through the remainder of this
tutorial. ..................................................................................................................................................................................................... 12
6.17 Install M2E extension for Eclipse Juno .............................................................................................................................. 12
6.18 Close Eclipse ............................................................................................................................................................................... 13
6.19 Obtain a copy of bcounts-0.1.0-SNAPSHOT-project.tar.bz2 from Cloudera and save to /home/poulin/ .. 13
6.20 Prepare bcounts-0.1.0-SNAPSHOT for execution from Eclipse and CLI ................................................................. 13
6.21 Check M2_REPO classpath variable set in Eclipse ......................................................................................................... 14
6.22 In Eclipse: File->Import...->(expand) General -> (highlight) Existing Projects into Workspace -> (click)
Next -> and specify root directory /home/poulin/bcounts-0.1.0-SNAPSHOT (hit enter if pasted) .......................... 15
6.23 In Eclipse: Window->Preferences->Java->Code Style->Formatter ........................................................................ 16
6.24 Click Import and navigate to /home/poulin/bcounts-0.1.0-SNAPSHOT/eclipse_formatter_apache.xml
Then Click Apply and then Click OK and then close Eclipse. .................................................................................................. 16
6.25 Create schema for Bayesian Counters examples in HBase ......................................................................................... 16
6.26 Load Iris data into HBase via CLI ........................................................................................................................................ 17
6.27 Load Iris data into HBase via Eclipse ................................................................................................................................. 17
6.28 View Iris data and schema in HBase .................................................................................................................................. 18
6.29 Perform NB inference on the Iris dataset Note: NB inference must be executed within 300 seconds of
loading iris data into hbase, or modify the 300 in the following steps to a larger number of seconds while
testing ........................................................................................................................................................................................................ 18
6.30 Perform clique scoring with random projections ......................................................................................................... 19
6.31 Create small delta of the ad.data file ................................................................................................................................. 20
6.32 Load Ad data into HBase via Eclipse .................................................................................................................................. 21
6.33 Perform NB inference on the Ad dataset Run->Run Configurations... ................................................................... 21
6.34 Create bag of words file from configuration file ............................................................................................................ 22
6.35 Edit /home/poulin/bcounts-0.1.0-SNAPSHOT/bin/sp_schema.py Change from: if len(sys.argv)<3 or
sys.argv[1] is None: ............................................................................................................................................................................... 22
6.36 Create an XML Configuration file derived from a bag-of-words file ....................................................................... 22
6.37 Convert testing files into header-less files for storing in HDFS ................................................................................ 23
6.38 Generate a 'scored_' file in current directory ................................................................................................................. 23
6.39 Create small delta of sp-training-file ................................................................................................................................. 23
6.40 Load small sample of SP into HBase via Eclipse ............................................................................................................. 23
6.41 Perform NB inference on the SP dataset Run->Run Configurations... .................................................................... 24

Bayesian Counters Javadoc .................................................................................................................. 25

Additional Resources Technologies Used In this Tutorial ..................................................... 25


2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 4

1 Introduction
Bayesian counters (B-counts) is a framework for on-line near real time model building and prediction. It can be used
to identify correlations in the data, and as a library used to respond to unusual or rare events. The underlying
technology for B-counts is HBase, a highly scalable and fault tolerant key-value map storage engine. The solution can
scale to thousands of nodes and billions of features. Finally, the initial prediction algorithm is Nave Bayes (NB). The
framework is currently being extended to incorporate Nearest Neighbors (NN) and a general Bayesian Network (BN)
learning algorithms.

2 The Audience
The steps in this tutorial are highly detailed and aim for optimal repeatability at the time of this writing, however the
audience must have Linux literacy either by experience, formal training or education and have a strong understanding
of computer and network security. Finally, this tutorial does not cover statistical analysis aspects of the solution.

3 The Goal
Preparing a development environment is usually a complex task but leads to powerful results and strong capabilities.
This tutorial will attempt to make this task as painless and repeatable as possible.

4 Provisioning
4.1

Virus Risk Warning


It is the responsibility of the customers to check every download mentioned in this document for signature
verification, run MD5 checks and virus scans and any other steps to ensure that no download poses a risk to the
customers trusted network. It is also the customers responsibility to ensure that network security, firewalls, network
level port blockage are correctly configured for the trusted network specified in this document. Both the servers and
network referenced in this tutorial must be provisioned entirely for the purpose of learning from, experimenting and
completing this tutorial and must never be adjacent too, or in any way share resources with production or otherwise
mission critical environments.

4.2

Software Archive Warning


It is the responsibility of the customer to maintain archives of all software specified in this document as the URL,
URIs, IP Addresses or other external references specified in this document may be come invalid at any time without
notice.

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 5

4.3

Development Workstation
The Workstation should be provisioned with the following:
1. CentOS Linux 6.3 x86_64 Workstation
2. jdk-6u31-linux-x64-rpm or newer version of jdk-6. Note: Do not use jdk-6u18
3. 16GB RAM & 2 Cores minimum hardware allocation
4. The hostname of the workstation for this tutorial is expected to be permanently assigned as h13.demo.dev
5. h13.demo.dev should be configured to use a certified copy of both the CentOS and EPEL repos
6. The workstation will need an example account created called poulin with permissions:
a. Account poulin must be able to sudo with root level credentials h13.demo.dev
b. Account poulin must be able to log into the gnome desktop of h13.demo.dev, either directly in the case
of local bare metal installation, vmware installation or virtualbox installation. Or via VNC SSH tunnel
client if the workstation is hosted on AWS or trusted other remote cloud/VPS, dedicated hosting
service or datacenter.

4.4

HBase Cluster
1. Navigate to https://ccp.cloudera.com/display/DOC/Documentation
2. Download and archive a copy of all documents under:
a. Cloudera Manager 4.1 Enterprise Edition Documentation
b. Cloudera Manager 4.1 Free Edition Documentation


3. Follow the steps outlined in CM-4.1-free-installation-guide.pdf to provision a pseudo distributed cluster in the
same trusted network as the Development Workstation on h13.demo.dev. The cluster must consist of:
a. A single host with the assigned hostname h12.demo.dev
b. The single host should be installed with CM4.1 as well as the hbase and all dependent rolls via the CM4.1
UI.

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 6

4.5

Network
Both the HBase Cluster and the Workstation should be on the same trusted network. With:
-

Two and only two hosts existing on the trusted network:


o

h12.demo.dev (pseudo distributed HBase cluster & Cloudera Manager 4.1)

h13.demo.dev (CentOS Linux 6.3 x86_64 Workstation)

All external in-bound ports blocked for connections into the trusted network except for SSH from other trusted
locations only.

The workstation must be able to connect out to the internet on HTTP, HTTPS , FTP and SFTP

No ports should be blocked between the HBase Cluster and the Workstation within the trusted network.

Both h13.demo.dev and h12.demo.dev to have a permanent static IP address and hostname.

The IP address on both h13.demo.dev and h12.demo.dev must support reverse lookup to hostname.

The date and time on both h13.demo.dev and h12.demo.dev must be permanently in sync.

5 Conventions
Single line boxes delimit commands to execute in the Terminal
# example


Dashed line boxes delimit some or all of the contents of a text file
# example


Double line boxes delimit some or all standard output
# example


Wave line boxes delimit hbase shell
# example


Candy cane boxes delimit overview of logic
# example

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 7

6 Bayesian Counters 0.1.0 Development Environment


6.1

Log into the Gnome Desktop on h13.demo.dev with the poulin account.
Note: Gnome Desktop comes with CentOS Linux 6.3 x86_64 Desktop install

6.2

Browse to Cloudera Manager at http://h12.demo.dev:7180/ and log in with the admin

account using Firefox or other browser in the Gnome Desktop.

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 8

6.3

Within Cloudera Manager, Hosts -> Add Host

This will start the Add Hosts Wizard. Follow the instructions of the wizard to add h13.demo.dev
to the cluster but do not add any rolls to h13.demo.dev, as we do not want the development
host to participate as part of the cluster. When complete, h13.demo.dev will show up in the
Hosts list.


6.4

Within Cloudera Manager, Services -> HBase (hbase1) and Actions -> Download Client

Configuration and save to /home/poulin/Desktop/hbase1-clientconfig.zip


6.5

Within Cloudera Manager, Services -> MapReduce (mapreduce1) and Actions -> Download

Client Configuration and save to /home/poulin/Desktop/mapreduce1-clientconfig.zip

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 9

6.6

Open a Terminal (All shell commands going forward will be executed in Terminal)

6.7

Point hbase shell to h12.demo.dev


su -l
cd /home/poulin/Desktop/
unzip ./hbase1-clientconfig.zip
cp /home/poulin/Desktop/hbase-conf/* /etc/hbase/conf/
rm -fr /home/poulin/Desktop/hbase-conf

6.8

Test HBase connectivity and hbase shell


su poulin
echo "create 'mytest', 'cf1'" | hbase shell
echo "put 'mytest', 'row1', 'cf1', 'val1'" | hbase shell
echo "put 'mytest', 'row1', 'cf1', 'val2'" | hbase shell
echo "scan 'mytest'" | hbase shell
date --date="Fri Nov 11 11:11:11 PST 2011" +%s
# start hbase shell
hbase shell
org.apache.hadoop.hbase.util.Bytes.toString("abcde".to_java_bytes)
org.apache.hadoop.hbase.util.Bytes.toInt("abcde".to_java_bytes)
org.apache.hadoop.hbase.util.Bytes.toInt("\xa\xb\xc\xd".to_java_bytes)
org.apache.hadoop.hbase.util.Bytes.toInt("\x61\x62\x63\x64".to_java_bytes)
org.apache.hadoop.hbase.util.Bytes.toInt("\141\142\143\144".to_java_bytes)
import java.text.SimpleDateFormat
import java.text.ParsePosition
SimpleDateFormat.new("yy/MM/dd HH:mm:ss").parse("11/11/11 11:11:11", ParsePosition.new(0)).getTime()
exit

=> "abcde"
=> 1633837924
=> 168496141
=> 1633837924
=> 1633837924
=> 1321038671000


2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 10

6.9

Install the Development Tools group


su l
yum groupinstall Development tools

6.10 Install Apache Maven 3.0.X


su l
cat > /etc/yum.repos.d/epel-apache-maven.repo << EOF
[epel-apache-maven]
name=maven from apache foundation.
baseurl=http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-6Server/x86_64/
enabled=1
skip_if_unavailable=1
gpgcheck=0
EOF
cat > /etc/yum.repos.d/epel-apache-maven-source.repo << EOF
[epel-apache-maven-source]
name=maven from apache foundation. Source
baseurl=http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-6Server/SRPMS
enabled=0
skip_if_unavailable=1
gpgcheck=0
EOF
yum update yum
yum install apache-maven
ln -s /usr/share/apache-maven/bin/mvn /usr/bin/mvn3
# close and reopen terminal

# confirm Apache Maven 3.0.X


mvn3 version

6.11 Install Apache Maven 2.2.X


su l
cd /usr/share/
# Reminder: check signature of download and check maven site for alternate mirror if link broken
wget http://apache.petsads.us/maven/maven-2/2.2.1/binaries/apache-maven-2.2.1-bin.tar.gz
tar -xzf apache-maven-2.2.1-bin.tar.gz
touch /usr/bin/mvn2
chmod ugo+rx /usr/bin/mvn2
vi /usr/bin/mvn2

Edit /usr/bin/mvn2
#!/bin/bash
MAVEN_HOME=/usr/share/apache-maven-2.2.1
M2_HOME=$MAVEN_HOME
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 11

PATH=$MAVEN_HOME/bin:$PATH
export MAVEN_HOME
export M2_HOME
export PATH
/usr/share/apache-maven-2.2.1/bin/mvn "$@"

6.12 Install asciidoc doxygen help2man, source-highlight and Python 2.7.3


su -l
# Reminder: Downloads from the standard repos automatically do
#

signature check if /etc/yum.repos.d directory is

configured correctly.

yum install asciidoc doxygen help2man


yum install boost-devel

cd /tmp/
curl -O http://mirrors.kernel.org/gnu/src-highlite/source-highlight-3.1.7.tar.gz.sig
curl -O http://mirrors.kernel.org/gnu/src-highlite/source-highlight-3.1.7.tar.gz
# Reminder: Confirm signature is OK before running or installing anything.
# Reminder: If mirror downloads fail, locate an alternate mirror
#

by searching the main site of the project.

tar xzvf source-highlight-3.1.7.tar.gz


cd source-highlight-3.1.7
./configure
make
make install
cd ~/
rm -fr /tmp/source-highlight-3.1.7*
cd /usr/lib/
wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz
wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz.asc
tar xzvf Python-2.7.3.tgz && cd /usr/lib/Python-2.7.3
./configure
make
# Note: the following line is not a typo and it really must be altinstall
make altinstall
chmod ugo+rx /usr/lib/Python-2.7.3
ln -s /usr/lib/Python-2.7.3/python /usr/bin/python273
python273 version
ln -s

/usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar /usr/lib/hadoop/hadoop-core.jar

6.13 Install Eclipse Juno ( in future, newer is probably OK, but optimal repeatability with Juno)

Navigate to http://www.eclipse.org/downloads/?osType=linux

Download Eclipse IDE for Java EE Developers "Linux 64 Bit" to /home/poulin/Desktop/

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 12

su -l
mv /home/poulin/Desktop/eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz /usr/lib/
chown root:root /usr/lib/eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz
cd /usr/lib/
tar -xvf ./eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz
ln -s /usr/lib/eclipse/eclipse /usr/bin/eclipse
rm -fr /usr/lib/eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz

6.14 Append to the end of /home/poulin/.bashrc


Close and re-open the Terminal after adding these lines.
export M2_OPTS="-server -Xms256m -Xmx512m"
export PATH=${PATH}:/home/poulin/bcounts-0.1.0-SNAPSHOT/bin
export CLASSPATH=`hbase classpath`
export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce
export HBASE_HOME=/usr/lib/hbase
source /home/poulin/bcounts-0.1.0-SNAPSHOT/bin/bcount-config.sh 2>> /dev/null
# Note: the following should be a single line
export
BAYESIANCOUNTERS_CLASSPATH=conf:target/classes:~/.m2/repository/net/sf/trove4j/trove4j/3.0.1/*:~/.
m2/repository/net/sf/opencsv/opencsv/2.3/*:~/.m2/repository/com/google/guava/guava/10.0.1/*

6.15 Open a 2nd Terminal and Start Eclipse Juno


su poulin
eclipse

6.16 Click OK on workspace launcher defaults, both now and when seen through the remainder
of this tutorial.

6.17 Install M2E extension for Eclipse Juno


Note: this will update /home/poulin/.eclipse directory
o

help -> eclipse marketplace -> Search tab -> find (Maven Integration for Eclipse) (enter key)

Under "Maven Integration for Eclipse" click Install.

Check all boxes under Confirm Selected Features if they are not already checked and Click Next.

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 13

o

If you accept the Eclipse Foundation Software User Agreement, Check the acceptance and Click Finish

When prompted to restart Eclipse Click Yes.

6.18 Close Eclipse


6.19 Obtain a copy of bcounts-0.1.0-SNAPSHOT-project.tar.bz2 from Cloudera and save to
/home/poulin/
6.20 Prepare bcounts-0.1.0-SNAPSHOT for execution from Eclipse and CLI
su poulin
# reminder: Any previously cashed maven downloads for the poulin
#

account will be deleted in next command

cd /home/poulin/
tar jxf bcounts-0.1.0-SNAPSHOT-project.tar.bz2
cd /home/poulin/bcounts-0.1.0-SNAPSHOT/
mkdir /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old
mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hbase-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/
mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/core-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/
mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/mapred-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/
mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hdfs-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/

cp /home/poulin/Desktop/*-clientconfig.zip /home/poulin/bcounts-0.1.0-SNAPSHOT/
unzip hbase1-clientconfig.zip
unzip mapreduce1-clientconfig.zip
cp ./hbase-conf/hbase-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/
cp ./hadoop-conf/core-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/
cp ./hadoop-conf/mapred-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/
cp ./hadoop-conf/hdfs-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/
rm -fr ./mapreduce1-clientconfig.zip
rm -fr ./hbase1-clientconfig.zip
rm -fr ./hbase-conf
rm -fr ./hadoop-conf
cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/mapred-site.xml /home/poulin/bcounts-0.1.0SNAPSHOT/src/main/resources/
cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hbase-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/src/main/resources/
cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/core-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/src/main/resources/
cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hdfs-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/src/main/resources/
mkdir /home/poulin/bcounts-0.1.0-SNAPSHOT/lib.old

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 14

mv /home/poulin/bcounts-0.1.0-SNAPSHOT/lib/* /home/poulin/bcounts-0.1.0-SNAPSHOT/lib.old/
cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.0.1.jar /home/poulin/bcounts-0.1.0-SNAPSHOT/lib/
# First build with Maven2
wget https://builds.apache.org/job/mrunit-trunk/ws/target/mrunit-1.0.0-SNAPSHOT-hadoop1.jar
rm -fr /home/poulin/.m2
# Install mrunit-1.0.0 into ~/.m2
mvn2

-DskipTests install:install-file -DgroupId=org.apache.mrunit -DartifactId=mrunit -Dversion=1.0.0-

20130107.225915-832 -Dclassifier=hadoop1 -Dpackaging=jar -Dfile=/home/poulin/bcounts-0.1.0-SNAPSHOT/mrunit-1.0.0SNAPSHOT-hadoop1.jar


mvn2 install:install-file -DgroupId=org.apache.mrunit -DartifactId=mrunit -Dversion=1.0.0-SNAPSHOT Dclassifier=hadoop1 -Dpackaging=jar -Dfile=/home/poulin/bcounts-0.1.0-SNAPSHOT/mrunit-1.0.0-SNAPSHOT-hadoop1.jar
# if the following hangs for more than 5 minutes without output, ctrl-c and then re-run it
# can ignore: [INFO] Unable to find resource *
mvn2 -DskipTests install
# redo the build with mvn3
rm -fr /home/poulin/.m2
mvn3

-DskipTests install:install-file -DgroupId=org.apache.mrunit -DartifactId=mrunit -Dversion=1.0.0-

20130107.225915-832 -Dclassifier=hadoop1 -Dpackaging=jar -Dfile=/home/poulin/bcounts-0.1.0-SNAPSHOT/mrunit-1.0.0SNAPSHOT-hadoop1.jar


mvn3 install:install-file -DgroupId=org.apache.mrunit -DartifactId=mrunit -Dversion=1.0.0-SNAPSHOT Dclassifier=hadoop1 -Dpackaging=jar -Dfile=/home/poulin/bcounts-0.1.0-SNAPSHOT/mrunit-1.0.0-SNAPSHOT-hadoop1.jar

# if the following hangs for more than 5 minutes without output, ctrl-c and then re-run it
mvn3 -DskipTests install
# run without -DskipTests switch
mvn3 install
# make maven project loadable into eclipse
mvn3 -Declipse.workspace=/home/poulin/workspace eclipse:configure-workspace
mvn3 -DdownloadSources=true -DdownloadJavadocs=true eclipse:clean eclipse:eclipse
mvn3 dependency:build-classpath

# Optional: Building a job jar only


mvn3 package
# Optional: Packaging the Source to /home/poulin/bcounts-0.1.0-SNAPSHOT/target/
mvn3 assembly:single
# Optional: Generating JavaDoc only
mvn3 javadoc:javadoc

6.21 Check M2_REPO classpath variable set in Eclipse


o

Open a new Terminal and start up Eclipse

su poulin
eclipse

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 15


o

In Eclipse: Window->preferences->java->build path->classpath variables

Once M2_REPO is observed, Click Cancel back to the parent Eclipse window

6.22 In Eclipse: File->Import...->(expand) General -> (highlight) Existing Projects into Workspace
-> (click) Next -> and specify root directory /home/poulin/bcounts-0.1.0-SNAPSHOT (hit enter if
pasted)
Ensure that bcounts is checked in the Projects box and Click Finish

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 16

6.23 In Eclipse: Window->Preferences->Java->Code Style->Formatter

6.24 Click Import and navigate to /home/poulin/bcounts-0.1.0-


SNAPSHOT/eclipse_formatter_apache.xml
Then Click Apply and then Click OK and then close Eclipse.

6.25 Create schema for Bayesian Counters examples in HBase


su poulin
echo "create 'sp', {NAME => '5min', VERSIONS => 1, TTL => 604800, BLOCKCACHE => false}, {NAME =>
'30min', VERSIONS => 1, TTL => 604800, BLOCKCACHE => false}, {NAME => '1day', VERSIONS => 1, TTL
=> 604800, BLOCKCACHE => false}, {NAME => 'All', VERSIONS => 1, TTL => 1209600, BLOCKCACHE =>
false}" | hbase shell
echo "create 'iris', {NAME => '5min', VERSIONS => 1, TTL => 86400, BLOCKCACHE => false}, {NAME =>
'30min', VERSIONS => 1, TTL => 86400, BLOCKCACHE => false}, {NAME => '1day', VERSIONS => 1, TTL =>
86400, BLOCKCACHE => false}, {NAME => 'All', VERSIONS => 1, TTL => 432000, BLOCKCACHE => false}" |
hbase shell
echo "create 'ad', {NAME => '5min', VERSIONS => 1, TTL => 259200, BLOCKCACHE => false}, {NAME =>
'30min', VERSIONS => 1, TTL => 259200, BLOCKCACHE => false}, {NAME => '1day', VERSIONS => 1, TTL
=> 259200, BLOCKCACHE => false}, {NAME => 'All', VERSIONS => 1, TTL => 432000, BLOCKCACHE =>
false}" | hbase shell
echo "create 'car', {NAME => '5min', VERSIONS => 2, TTL => 300, BLOCKCACHE => false}, {NAME =>
'30min', VERSIONS => 2, TTL => 1800, BLOCKCACHE => false}, {NAME => '1day', VERSIONS => 2, TTL =>
259200, BLOCKCACHE => false}, {NAME => 'All', VERSIONS => 2, TTL => 432000, BLOCKCACHE => false}"
| hbase shell


2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 17

6.26 Load Iris data into HBase via CLI


su poulin
cd /home/poulin/bcounts-0.1.0-SNAPSHOT
bcount com.cloudera.bayesiancounters.util.Driver loader examples/data/iris.data iris
echo "scan 'iris'" | hbase shell


Load Iris data into HBase
The iris data loaded into hbase is rectangular and newline delimited in the format:
<count-delta>,<count-delta>,<count-delta>,<classifier><newline>
During the load, the counts in hbase are incremented.
The human-readable meaning and schema of iris.data can be found in the Iris section of
the bayesiancounters-site.xml which is added to a CLASSPATH in prior steps.
For a production pipeline, will repeat this iris.load at a regular interval of deltas
or bind the UI directly to the hbase calls used by the loader code.
The loader logic can be mastered with eclipse by modifying the following section of
this tutorial:
Change from: Run->Run Configurations
Change to: Run->Debug Configurations
Then check mark next to Stop in main and then step through the code.

6.27 Load Iris data into HBase via Eclipse


o

Open a new Terminal and start up Eclipse

su Poulin
eclipse

Run->Run Configurations...

Java Application-> (right click) -> New

Click on the New_configuration to edit its settings on the right

Configure the runtime options as:

Name: IrisLoad

Main->Project: bcounts

Main->Main class: com.cloudera.bayesiancounters.util.Driver

Arguments->Program arguments: loader /home/poulin/bcounts-0.1.0-


SNAPSHOT/examples/data/iris.data iris

Arguments->VM arguments: -Dlog4j.configuration=file:debug-log4j.properties

Click on Apply, then Click on Run and then view the Console tab of the parent window

Once the data is loaded, Close Eclipse

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 18

6.28 View Iris data and schema in HBase


# view some of the data
echo "scan 'iris'" | hbase shell | tail
# view the schema via the Stargate interface
firefox http://h12.demo.dev:8080/iris/schema

6.29 Perform NB inference on the Iris dataset


Note: NB inference must be executed within 300 seconds of loading iris data into hbase, or
modify the 300 in the following steps to a larger number of seconds while testing
o

Open a new Terminal and start up Eclipse

su poulin
eclipse

Run->Run Configurations...

Java Application-> (right click) -> New

Click on the New_configuration to edit its settings on the right

Configure the runtime options as:

Name: IrisInference

Main->Project: bcounts

Main->Main class: com.cloudera.bayesiancounters.util.Driver

Arguments->Program arguments: nb iris 300 "sepal_length=5;petal_length=1.4" class=2

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 19

Arguments->VM arguments: -Dlog4j.configuration=file:debug-log4j.properties

Click on Apply, then Click on Run and then view the Console tab of the parent window

Wait for results in the Console tab and execution to complete.



NB inference on the Iris dataset
Opens connection to the iris table
Loads iris classifications from bayesiancounters-site.xml into local memory
Moves columns in hbase between tiers, e.g. T5MIN, T30MIN, etc. while computing scores
and tracking parent and child counts
The logic is derived from naive Bayes classifier theory
The resulting scores, counts and probabilities are displayed to standard output


The probabilities output of scoring can be used directly for mode complex decision making algorithms based
on benefit/loss analysis.

6.30 Perform clique scoring with random projections


o

Run->Run Configurations...

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 20

o

Java Application-> (right click) -> New

Click on the New_configuration to edit its settings on the right

Configure the runtime options as:

Name: CliqueRandom

Main->Project: bcounts

Main->Main class: com.cloudera.bayesiancounters.util.Driver

Arguments->Program arguments: cr iris 300 2 3

Arguments->VM arguments: -Dlog4j.configuration=file:debug-log4j.properties

Click on Apply, then Click on Run and then view the Console tab of the parent window

Wait for results in the Console tab and execution to complete.


Clique scoring can be used to perform variable importance analysis or for emerging trend identification.

6.31 Create small delta of the ad.data file


su poulin
head -n 1 /home/poulin/bcounts-0.1.0-SNAPSHOT/examples/data/ad.data > /tmp/ad.small
tail -n 1 /home/poulin/bcounts-0.1.0-SNAPSHOT/examples/data/ad.data >> /tmp/ad.small


2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 21

6.32 Load Ad data into HBase via Eclipse


o

Run->Run Configurations...

Java Application-> (right click) -> New

Click on the New_configuration to edit its settings on the right

Configure the runtime options as:

Name: AdLoad

Main->Project: bcounts

Main->Main class: com.cloudera.bayesiancounters.util.Driver

Arguments->Program arguments: loader /tmp/ad.small ad

Arguments->VM arguments: -Dlog4j.configuration=file:debug-log4j.properties -Xmx1024M

Click on Apply, then Click on Run and then view the Console tab of the parent window

6.33 Perform NB inference on the Ad dataset


Run->Run Configurations...
o

Java Application-> (right click) -> New

Click on the New_configuration to edit its settings on the right

Configure the runtime options as:

Name: AdInference

Main->Project: bcounts

Main->Main class: com.cloudera.bayesiancounters.util.Driver

Arguments->Program arguments: nb ad 604800 "sepal_length=5;petal_length=1.4" class=2

Arguments->VM arguments: -Dlog4j.configuration=file:debug-log4j.properties

Click on Apply, then Click on Run and then view the Console tab of the parent window

Wait for results in the Console tab and execution to complete.

Close eclipse

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 22

Note: These results are from bcounts on 2 lines of the input data only. Recommend using small or
medium sized cluster for processing the entire ad.data file. See Cloudera Manager Documentation for
cluster size specifications.

6.34 Create bag of words file from configuration file


su poulin
cd /home/poulin/bcounts-0.1.0-SNAPSHOT/
python273 ./bin/sp_bag_of_words.py ./conf/bayesiancounters-site.xml /tmp/bag-of-words
tail /tmp/bag-of-words


worker
working
wreckage
xvi
yates
young
SP_increase

6.35 Edit /home/poulin/bcounts-0.1.0-SNAPSHOT/bin/sp_schema.py


Change from: if len(sys.argv)<3 or sys.argv[1] is None:
Change to: if len(sys.argv)<2 or sys.argv[1] is None:
if len(sys.argv)<2 or sys.argv[1] is None:

6.36 Create an XML Configuration file derived from a bag-of-words file


su poulin
cd /home/poulin/bcounts-0.1.0-SNAPSHOT/
python273 ./bin/sp_schema.py /tmp/bag-of-words > /tmp/bayesiancounters-example.xml
tail /tmp/bayesiancounters-example.xml

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 23


<value>SP_increase</value>
</property>
<property>
<name>bayesiancounters.dataset.sp.col.valueset.647</name>
<value>-100, -40, 10, 40, 100</value>
</property>

6.37 Convert testing files into header-less files for storing in HDFS
su poulin
cd /home/poulin/bcounts-0.1.0-SNAPSHOT/
python273 ./bin/sp_training.py /tmp/bag-of-words \
./examples/data/training_19_2004-18_2005.dat > /tmp/sp-training-file
tail -c 32 /tmp/sp-training-file


0,0,0,0,0,0,0,0,0,0,0,0,0,0,7.2

6.38 Generate a 'scored_' file in current directory


su poulin
cd /home/poulin/bcounts-0.1.0-SNAPSHOT/
python273 ./bin/sp_testing.py /tmp/bag-of-words ./examples/data/testing_19_2005-19_2005.dat
tail -c 32 ./scored_testing_19_2005-19_2005.dat


0,1,0,0,0,0,0,0,2,2,2,1,1,3,6,0

6.39 Create small delta of sp-training-file


su poulin
head -n 1 /tmp/sp-training-file > /tmp/sp-training.small
tail -n 1 /tmp/sp-training-file >> /tmp/sp-training.small

6.40 Load small sample of SP into HBase via Eclipse


o

Open a new Terminal and start up Eclipse

su poulin
eclipse

Run->Run Configurations...

Java Application-> (right click) -> New

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 24

o

Click on the New_configuration to edit its settings on the right

Configure the runtime options as:

Name: SpLoad

Main->Project: bcounts

Main->Main class: com.cloudera.bayesiancounters.util.Driver

Arguments->Program arguments: loader /tmp/sp-training.small sp

Arguments->VM arguments: -Dlog4j.configuration=file:debug-log4j.properties

Click on Apply, then Click on Run and then view the Console tab of the parent window

6.41 Perform NB inference on the SP dataset


Run->Run Configurations...
o

Java Application-> (right click) -> New

Click on the New_configuration to edit its settings on the right

Configure the runtime options as:

Name: SpInference

Main->Project: bcounts

Main->Main class: com.cloudera.bayesiancounters.util.Driver

Arguments->Program arguments: nb sp 604800 "sepal_length=5;petal_length=1.4" class=2

Arguments->VM arguments: -Dlog4j.configuration=file:debug-log4j.properties

Click on Apply, then Click on Run and then view the Console tab of the parent window

Wait for results in the Console tab and execution to complete.

Close eclipse
Note: These results are from bcounts on 2 lines of the input data only. Recommend using small or
medium sized cluster for processing the entire ad.data file. See Cloudera Manager Documentation for
cluster size specifications.


2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 25

In this example B-counts can be used for predicting an effect of the news on stock market movements
(US Patent No. 7,516,050)

7 Bayesian Counters Javadoc


su Poulin
firefox file:///home/poulin/bcounts-0.1.0-SNAPSHOT/target/site/apidocs/index.html

8 Additional Resources Technologies Used In this Tutorial


1. Apache Hadoop Training and Certification - http://university.cloudera.com/
2. Apache Maven Project - http://maven.apache.org/
3. Bash Reference Manual - http://www.gnu.org/software/bash/manual/bashref.html
4. Bayesian Counters on HBase - http://www.slideshare.net/Hadoop_Summit/bayesian-counters
5. CentOS Wiki - http://wiki.centos.org/
6. CentOS 6.X x86_64 (Same image for VM, cloud/VPS or bare metal) - http://isoredirect.centos.org/centos/6/isos/x86_64/
7. Cloudera Documentation - https://ccp.cloudera.com/display/DOC/Documentation
8. Eclipse Documentation Current Release Eclipse Juno - http://help.eclipse.org/juno/index.jsp
9. EPEL Documentation - http://fedoraproject.org/wiki/EPEL
10. Gnome [Desktop] Library, Users, Administrators and Developers - http://library.gnome.org/
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

P a g e | 26

11. GNU Source-highlight 3.1.7 - http://www.gnu.org/software/src-highlite/source-highlight.html
12. Python v2.7.3 documentation - http://docs.python.org/release/2.7.3/
13. Ruby in Twenty Minutes - http://www.ruby-lang.org/en/documentation/quickstart/

2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution

Vous aimerez peut-être aussi