CampusDays2014 Apps3008

Building a Recommendation
Engine with HDInsight and

Hadoop in Microsoft Azure
App3008
Sebastian Brandes
Technology Evangelist, Microsoft
November 26, 2014
#CampusDays
Agenda
1. Introduction to Personalization
2. How to Build a Recommendation Engine
3. Overview of Technologies (Hadoop 2.4.0 and Mahout 0.9)
4. Moving Data into Microsoft Azure
5. Analyzing Data in Azure with HDInsight 3.1
#CampusDays
6. Introduction to Azure Machine Learning

7. Future of Apache Mahout and Azure ML uprising of Apache
Spark?
8. Demo of Spark 1.0.2 on Customized Hadoop Cluster in Azure
9. Getting Started on Your Own
10. Q&A
#CampusDays
Introduction to
Personalization
#CampusDays
Introduction to Personalization
#CampusDays
#CampusDays
#CampusDays
#CampusDays
27% of customers have seen
Personalization online
86% of those say Personalization
influenced what they purchased to some

extent
#CampusDays
31% want a more Personalized
experience
59% of customers who have experienced
Personalization believe it has a

noticeable influence on purchasing
#CampusDays
58% prefer product

recommendations from
previous purchases over
other forms of
personalization
#CampusDays
#CampusDays
How to Build a
Recommendation Engine
How to Build a Recommendation Engine

4
#CampusDays
#CampusDays
Overview of Technologies
#CampusDays
Hadoop Timeline
October 2003: Google publishes GFS paper
http://research.google.com/archive/gfs.html
December 2004: Google publishes MapReduce paper
http://research.google.com/archive/mapreduce.html
2005: Doug Cutting (Yahoo!) and Mike Cafarella (U Washington) create
Hadoop
#CampusDays
[Insert many years here.]

November 2012: HDInsight is released in Technical Preview (as PaaS
and IaaS)
August 2013: Hadoop 1.X is stable
October 2013: HDInsight is generally available
October 2013: Hadoop 2.X is generally available
#CampusDays
#CampusDays
#CampusDays
Microsoft Azure is a
cloud computing platform and infrastructure,
created by Microsoft,
for building, deploying and managing applications and services
through a global network of Microsoft-managed datacenters.
#CampusDays
Apache Hadoop is an
open-source for
software framework
distributed storage and
distributed processing of
Big Data on
clusters of commodity hardware.
A mahout is a person who rides an elephant.
#CampusDays
Errh...
Apache Mahout is a project of the Apache Software Foundation to
produce
free implementations of
distributed or otherwise scalable
machine learning algorithms focused primarily in the areas of
collaborative filtering, clustering and classification.
#CampusDays
MapReduce is
a programming model and
an associated implementation for processing and generating large
data sets with a
parallel,
distributed algorithm
on a cluster.
Traditional Hadoop Cluster
Slaves
Data
#CampusDays
Masters
Distributed
Data
Processing
(MR)
Client(s)
Distributed
Data Storage
(HDFS)
Primary
Name Node
Job Tracker
Data Node
Data Node
Task Tracker
Task Tracker
...
Secondary
Name Node
Data Node
Task Tracker
#CampusDays
HDInsight Cluster and Storage
#CampusDays
Source: http://www.rabidgremlin.com/data20
#CampusDays
MapReduce in C#
public class NaiveMapReduceProgram<K1, V1, K2, V2, V3>
{
public delegate IEnumerable<KeyValuePair<K2, V2>> MapFunction(K1 key, V1 value);
public delegate IEnumerable<V3> ReduceFunction(K2 key, IEnumerable<V2> values);
#CampusDays
private MapFunction _map;

private ReduceFunction _reduce;
public NaiveMapReduceProgram(MapFunction mapFunction, ReduceFunction
reduceFunction)
{
_map = mapFunction;
_reduce = reduceFunction;
}
[...]
MapReduce in C#
[...]
public IEnumerable<KeyValuePair<K2, V2>> Map(Dictionary<K1, V1> input)
{
var q = from pair in input
from mapped in _map(pair.Key, pair.Value)
select mapped;
#CampusDays
return q;
}
[...]
MapReduce in C#
#CampusDays
[...]
public IEnumerable<KeyValuePair<K2, V3>> Reduce(IEnumerable<KeyValuePair<K2, V2>>
intermediateValues)
{
var groups = from pair in intermediateValues
group pair.Value by pair.Key into g
select g;
var reduced = from g in groups
let k2 = g.Key
from reducedValue in _reduce(k2, g)
select new KeyValuePair<K2, V3>(k2, reducedValue);
return reduced;
}
[...]
MapReduce in C#
#CampusDays
[...]
public IEnumerable<KeyValuePair<K2, V3>> Execute(IEnumerable<KeyValuePair<K1,
V1>> input)
{
return Reduce(Map(input));
}
}
MapReduce in C#: Word Count

public IList<KeyValuePair<string, int>> MapFromMem(string key, string value)
{
var result = new List<KeyValuePair<string, int>>();
#CampusDays
foreach (var word in value.Split(' '))

{
result.Add(new KeyValuePair<string, int>(word, 1));
}
return result;
}

public IEnumerable<int> Reduce(string key, IEnumerable<int> values)
{
int sum = 0;
#CampusDays
foreach (int value in values)

{
sum += value;
}
return new int[1] { sum };
}
#CampusDays
var master = new MapReduceProgram<string, string, string, int, int>(MapFromMem,

Reduce);
var result = master.Execute(inputData).ToDictionary(key => key.Key, v =>
v.Value);
How to Build a Recommendation Engine

Broadly speaking, we can build a recommender either by
finding questions that a user may be interested in answering, based
on the questions answered by other users like him, or

finding other questions that are similar to the questions he answered
#CampusDays
already.
The first technique is known as user based recommendation, and the
second technique is known as item based recommendations.
#CampusDays
Anders
Lybecker
(User 1)
Henrik
Westergaard
Hansen (User
2)
Tina
Beyonc
Single Ladies
(S1)
Justin
Timberlake
Suit & Tie
(S2)
StenderupLarsen (User
3)
Christina
Hjortkjr
(User 4)
Medina
Jalousi (S3)
Pharrell
Williams
Happy (S4)
#CampusDays
Similarity Matrix:
S1
S2
S3
S4
S1
S2
S3
S4
Anders
Lybecker
(User 1)
Henrik
Westergaard
Hansen (User
2)
Tina
Beyonc
Single Ladies
(S1)
Justin
Timberlake
Suit & Tie
(S2)
StenderupLarsen (User
3)
Christina
Hjortkjr
(User 4)
Medina
Jalousi (S3)
Pharrell
Williams
Happy (S4)
Excel formula:
{=MMULT(A2:E5,F2:F5)
}
#CampusDays
Similarity Matrix:
Tina Stenderup-Larsen:
Preference:
S1
S2
S3
S4
User 3
R3
S1
S2
S3
S4
In Mahout 0.9 we will make use of:

User-Based Collaborative Filtering
#CampusDays
More specifically, this class:

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
Documentation: https://
builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/ta
ste/hadoop/item/RecommenderJob.html
Overview of Technology
#CampusDays
2.4.0.2.1.8.0-2176
0.9.0.2.1.8.0-2176
Microsoft Azure HDInsight

3.1
Cant Use mahout-distribution-0.9

You need the Mahout distribution especially built for HDInsight 3.1.
Its included in any new HDInsight cluster by default.
#CampusDays
This error is a sign of a package

made for Hadoop 1.x, which is
uncompatible with Hadoop 2.x.
#CampusDays
Moving Data into Azure

A very common question among customer: How to do it?
Different approaches to the task:
Manual Movement
Automated Movement (on some type of schedule)
Initial data load vs. incremental data load
#CampusDays
Challenges with private networks (corporate networks)

UI-based Explorers
Azure Storage Explorer
CloudXplorer
#CampusDays
Drag and Drop

Easy to use for manual movement
Movement of complete folders

AzCopy
Blob Loading Tool
Upload
Download
Command Line Utility

Logging
#CampusDays
Recovery
Naming Conflicts
http://azure.microsoft.com/en-us/documentation/articles/storage-use-
azcopy
/

PowerShell
Set-AzureStorageBlobContent
Uploads blob synchronously

Can use foreach to iterate through a directory
New-AzureStorageContext
Start-AzureStorageBlobCopy
#CampusDays
Asynchronously move between storage accounts
#CampusDays
C#
// Variables for the cloud storage objects.
var StorageAccount =
CloudStorageAccount.Parse("DefaultEndpointsProtocol=https;AccountName="
+ ConfigurationManager.AppSettings["AzureAccountName"] + ";AccountKey="
+ ConfigurationManager.AppSettings["AzureStorageKey"] + "");
// Create the blob client.
var blobClient = StorageAccount.CreateCloudBlobClient();
// Get the container reference.
var blobContainer = blobClient.GetContainerReference("data");
// Create the container if it does not exist.
blobContainer.CreateIfNotExists();
// Upload blob.
blob.BeginUploadFromFile(Filename, FileMode.Open, ProcessComplete,
blobname);
#CampusDays
Demo of HDInsight
The Echo Nest Taste Profile Subset

The dataset contains
real userplay counts from undisclosed partners,
all songs already matched to the Million Song Dataset.
Link: http://labrosa.ee.columbia.edu/millionsong/tasteprofile
1,019,318unique users
384,546unique MSD songs
#CampusDays
48,373,586user - song - play count triplets

b80344d063b5ccb3212f76538f3d9e43d87dca9e
...
SOAKIMP12A8C130995
SOAPDEY12A81C210A9
SOBBMDR12A8C13253B
SOBFNSP12AF72A0E22
SOBFOVM12A58A7D494
1
1
2
1
1
#CampusDays
The Echo Nest Taste Profile Subset

userId,songId,timesPlayed
1,1,1
1,2,1
,I
e
l
p
1,3,2
am 0
x
e
is
0
1,4,1
In th use 5,0
e
h
ly
t
n
r
o
fo
1,5,1
s
t
y
e
d
l
e
p
i
e
tr
sp
a
1,6,1
f
o
e
k
a
o.
s
m
e
1,7,2
d
1,8,1
1,9,1
1,10,1
...
Transforming the Data

Using C# on my Surface Pro 3 with:
Intel Core i5-4300U @ 1.90 GHz
8 GB RAM
256 GB SSD
#CampusDays
Console app written for .NET Framework 4.5 (x64), 100 lines of
optimized code.
Input file = 3 GB. More than 48 million rows...
Took more than 30 minutes to complete with 10 million rows...
A Hadoop cluster could have done that in a matter of seconds.
#CampusDays
#CampusDays
#CampusDays
#CampusDays
Analyzing Data in Azure with HDInsight 3.1

hadoop fs -rm -r input
hadoop fs -rm -r temp
hadoop fs mkdir input
#CampusDays
hadoop fs -copyFromLocal mInput.txt /user/writer/input/

hadoop fs -copyFromLocal users.txt /user/writer/input/
hadoop jar C:\apps\dist\mahout-0.9.0.2.1.8.0-2176\core\target\mahoutcore-0.9.0.2.1.8.0-2176-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-s SIMILARITY_COOCCURRENCE
--input=/user/writer/input/mInput.txt
--output=/user/writer/output
--usersFile=/user/writer/input/users.txt
hadoop fs -copyToLocal output/part-r-00000 c:\output.txt
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
Demo of Azure Machine

Learning
#CampusDays
Azure Machine Learning
#CampusDays
#CampusDays
https://
social.msdn.microsoft.com/Forums/azure/en-US/24a06c13-bde0-4955-86f3-85a2f2b37153/t
est-dataset-contains-invalid-data
#CampusDays
Future of Apache Mahout

and Azure Maching Learning
Future of Apache Mahout and Azure ML

Apache Mahout
Latest stable release: v0.9
Under development: v1.0
Replacing MapReduce with Apache Spark
Compatibility with Hadoop 2
#CampusDays
https://
issues.apache.org/jira/browse/MAHOUT-1512?jql=project%20%3D%20MAHOUT
%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%201.
0%20ORDER%20BY%20priority%20DESC
Rewriting algorithms for Apache Spark, H2O and Flink

Made available in public preview in July 2014
Half price until general availability (GA)
#CampusDays
Apache Spark is
an open-source
in-memory
data analytics cluster computing framework
originally developed in the AMPLab at UC Berkeley.
In contrast to Hadoop's two-stage disk-based MapReduce paradigm,
Spark's in-memory primitives
provide performance up to 100 times faster for certain applications.
#CampusDays
MLlib is Sparks
scalable machine learning library
consisting of common learning algorithms
and utilities,
Including classification,
regression,
clustering,
collaborative filtering,
dimensionality reduction, as well as
underlying optimization primitives.
#CampusDays
Apache Mahout vs. Apache Spark?
MapReduce
Introduced in Nov. 2011
Very mature
Community support
Proven in the field
Tested stability
MLLib much faster!

Not as mature
Hot in the community
Became an Apache
top-level project
in February 2014
Apache Spark 1.0 is a preview feature in Microsoft Azure.
#CampusDays
Guide to set it up here:

http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop
-spark-install
/
HDInsight 3.1 is required. It is compatible with HDFS and WASB so
existing data can be used for Apache Spark jobs.
A sample script for installation is available via the link above.
To run Spark queries, you can use the Scala shell or write standalone
Setting Up Hadoop with Spark 1.0.2 in Azure using

Import-Module Azure
PowerShell
#CampusDays
$subscriptionName = "Platform Internt forbrug" # Name of the Azure subscription

$clusterName = "cd2014spark"
# The HDInsight cluster name
$storageAccountName = "cd2014spark"
# Azure storage account that hosts the default container
$storageAccountKey = <YOURKEY>"
$containerName = $clusterName
$location = "West Europe"
# The location of the HDInsight cluster
$clusterNodes = 4
# The number of nodes in the HDInsight cluster
$version = "3.1"
# For example "3.1"
Select-AzureSubscription $subscriptionName
$config = New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes
$config.DefaultStorageAccount.StorageAccountName="$storageAccountName.blob.core.windows.net"
$config.DefaultStorageAccount.StorageAccountKey=$storageAccountKey
$config.DefaultStorageAccount.StorageContainerName=$clusterName
$config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection
HeadNode -Uri https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv01/spark-installer-v01.ps1
New-AzureHDInsightCluster -Config $config -Name $clusterName -Location $location -Version $version
Valid per November 26, 2014.
Word Count Example in Spark 1.0.2

Using the Scala shell:
val f = sc.textFile("/example/data/gutenberg/davinci.txt")
#CampusDays
val counts = f.flatMap(line => line.split(" "))

.map(word => (word, 1))
.reduceByKey(_ + _)
counts.toArray().foreach(println)
#CampusDays
Word Count Example in Spark 1.0.2
#CampusDays
Slides and demos are

available at:
sebastianbrandes.com
EVENT SPONSORER
TRACK SPONSORER
EXPO SPONSORER
Q&A
#Ask me about everything!
#CampusDays
Join me at the Microsoft Booth

the next 30 minutes @Meet The
Experts
Dont forget to: Evaluate this
session!
2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or
other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to
changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the
date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

CampusDays2014 Apps3008

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

CampusDays2014 Apps3008

Transféré par

Droits d'auteur :

Formats disponibles

Building a Recommendation

Engine with HDInsight and

6. Introduction to Azure Machine Learning

influenced what they purchased to some

31% want a more Personalized

Personalization believe it has a

58% prefer product

How to Build a Recommendation Engine

December 2004: Google publishes MapReduce paper

2005: Doug Cutting (Yahoo!) and Mike Cafarella (U Washington) create

[Insert many years here.]

A mahout is a person who rides an elephant.

Traditional Hadoop Cluster

HDInsight Cluster and Storage

private MapFunction _map;

MapReduce in C#: Word Count

foreach (var word in value.Split(' '))

MapReduce in C#: Word Count

foreach (int value in values)

MapReduce in C#: Word Count

var master = new MapReduceProgram<string, string, string, int, int>(MapFromMem,

How to Build a Recommendation Engine

on the questions answered by other users like him, or

In Mahout 0.9 we will make use of:

More specifically, this class:

Microsoft Azure HDInsight

Cant Use mahout-distribution-0.9

This error is a sign of a package

Moving Data into Azure

Moving Data into Azure

Automated Movement (on some type of schedule)

Initial data load vs. incremental data load

Challenges with private networks (corporate networks)

Moving Data into Azure

Drag and Drop

Moving Data into Azure

Command Line Utility

Moving Data into Azure

Uploads blob synchronously

Asynchronously move between storage accounts

Moving Data into Azure

The Echo Nest Taste Profile Subset

48,373,586user - song - play count triplets

The Echo Nest Taste Profile Subset

Transforming the Data

Analyzing Data in Azure with HDInsight 3.1

hadoop fs -copyFromLocal mInput.txt /user/writer/input/

Analyzing Data in Azure with HDInsight 3.1

Analyzing Data in Azure with HDInsight 3.1

Analyzing Data in Azure with HDInsight 3.1

Analyzing Data in Azure with HDInsight 3.1

Analyzing Data in Azure with HDInsight 3.1

Analyzing Data in Azure with HDInsight 3.1

Demo of Azure Machine

Azure Machine Learning

Azure Machine Learning

Azure Machine Learning

Future of Apache Mahout

Future of Apache Mahout and Azure ML

Rewriting algorithms for Apache Spark, H2O and Flink

Azure Machine Learning

Apache Mahout vs. Apache Spark?

MLLib much faster!