Vous êtes sur la page 1sur 78

Building a Recommendation

Engine with HDInsight and


Hadoop in Microsoft Azure
App3008
Sebastian Brandes
Technology Evangelist, Microsoft
November 26, 2014

#CampusDays

Agenda
1. Introduction to Personalization
2. How to Build a Recommendation Engine
3. Overview of Technologies (Hadoop 2.4.0 and Mahout 0.9)
4. Moving Data into Microsoft Azure
5. Analyzing Data in Azure with HDInsight 3.1

#CampusDays

6. Introduction to Azure Machine Learning


7. Future of Apache Mahout and Azure ML uprising of Apache

Spark?
8. Demo of Spark 1.0.2 on Customized Hadoop Cluster in Azure
9. Getting Started on Your Own
10. Q&A

#CampusDays

Introduction to
Personalization

#CampusDays

Introduction to Personalization

#CampusDays

Introduction to Personalization

#CampusDays

Introduction to Personalization

#CampusDays

Introduction to Personalization

#CampusDays

Introduction to Personalization

Introduction to Personalization
27% of customers have seen

Personalization online
86% of those say Personalization

influenced what they purchased to some


extent

#CampusDays

31% want a more Personalized

experience
59% of customers who have experienced

Personalization believe it has a


noticeable influence on purchasing

#CampusDays

58% prefer product


recommendations from
previous purchases over
other forms of
personalization

#CampusDays

Introduction to Personalization

#CampusDays

How to Build a
Recommendation Engine

How to Build a Recommendation Engine


4

#CampusDays

#CampusDays

Overview of Technologies

#CampusDays

Hadoop Timeline
October 2003: Google publishes GFS paper

http://research.google.com/archive/gfs.html

December 2004: Google publishes MapReduce paper

http://research.google.com/archive/mapreduce.html

2005: Doug Cutting (Yahoo!) and Mike Cafarella (U Washington) create

Hadoop

#CampusDays

[Insert many years here.]


November 2012: HDInsight is released in Technical Preview (as PaaS

and IaaS)
August 2013: Hadoop 1.X is stable
October 2013: HDInsight is generally available
October 2013: Hadoop 2.X is generally available

#CampusDays

Overview of Technologies

#CampusDays

Overview of Technologies

#CampusDays

Microsoft Azure is a
cloud computing platform and infrastructure,
created by Microsoft,
for building, deploying and managing applications and services
through a global network of Microsoft-managed datacenters.

#CampusDays

Apache Hadoop is an
open-source for
software framework
distributed storage and
distributed processing of
Big Data on
clusters of commodity hardware.

A mahout is a person who rides an elephant.

#CampusDays

Errh...
Apache Mahout is a project of the Apache Software Foundation to
produce
free implementations of
distributed or otherwise scalable
machine learning algorithms focused primarily in the areas of
collaborative filtering, clustering and classification.

#CampusDays

MapReduce is
a programming model and
an associated implementation for processing and generating large
data sets with a
parallel,
distributed algorithm
on a cluster.

Traditional Hadoop Cluster

Slaves
Data

#CampusDays

Masters

Distributed
Data
Processing
(MR)

Client(s)

Distributed
Data Storage
(HDFS)

Primary
Name Node

Job Tracker

Data Node

Data Node

Task Tracker

Task Tracker

...

Secondary
Name Node

Data Node
Task Tracker

#CampusDays

HDInsight Cluster and Storage

#CampusDays

Source: http://www.rabidgremlin.com/data20

#CampusDays

MapReduce in C#
public class NaiveMapReduceProgram<K1, V1, K2, V2, V3>
{
public delegate IEnumerable<KeyValuePair<K2, V2>> MapFunction(K1 key, V1 value);
public delegate IEnumerable<V3> ReduceFunction(K2 key, IEnumerable<V2> values);

#CampusDays

private MapFunction _map;


private ReduceFunction _reduce;
public NaiveMapReduceProgram(MapFunction mapFunction, ReduceFunction
reduceFunction)
{
_map = mapFunction;
_reduce = reduceFunction;
}
[...]

MapReduce in C#
[...]
public IEnumerable<KeyValuePair<K2, V2>> Map(Dictionary<K1, V1> input)
{
var q = from pair in input
from mapped in _map(pair.Key, pair.Value)
select mapped;

#CampusDays

return q;
}
[...]

MapReduce in C#

#CampusDays

[...]
public IEnumerable<KeyValuePair<K2, V3>> Reduce(IEnumerable<KeyValuePair<K2, V2>>
intermediateValues)
{
var groups = from pair in intermediateValues
group pair.Value by pair.Key into g
select g;
var reduced = from g in groups
let k2 = g.Key
from reducedValue in _reduce(k2, g)
select new KeyValuePair<K2, V3>(k2, reducedValue);
return reduced;
}
[...]

MapReduce in C#

#CampusDays

[...]
public IEnumerable<KeyValuePair<K2, V3>> Execute(IEnumerable<KeyValuePair<K1,
V1>> input)
{
return Reduce(Map(input));
}
}

MapReduce in C#: Word Count


public IList<KeyValuePair<string, int>> MapFromMem(string key, string value)
{
var result = new List<KeyValuePair<string, int>>();

#CampusDays

foreach (var word in value.Split(' '))


{
result.Add(new KeyValuePair<string, int>(word, 1));
}
return result;
}

MapReduce in C#: Word Count


public IEnumerable<int> Reduce(string key, IEnumerable<int> values)
{
int sum = 0;

#CampusDays

foreach (int value in values)


{
sum += value;
}
return new int[1] { sum };
}

MapReduce in C#: Word Count

#CampusDays

var master = new MapReduceProgram<string, string, string, int, int>(MapFromMem,


Reduce);
var result = master.Execute(inputData).ToDictionary(key => key.Key, v =>
v.Value);

How to Build a Recommendation Engine


Broadly speaking, we can build a recommender either by
finding questions that a user may be interested in answering, based

on the questions answered by other users like him, or


finding other questions that are similar to the questions he answered

#CampusDays

already.
The first technique is known as user based recommendation, and the
second technique is known as item based recommendations.

#CampusDays

Anders
Lybecker
(User 1)
Henrik
Westergaard
Hansen (User
2)
Tina

Beyonc
Single Ladies
(S1)
Justin
Timberlake
Suit & Tie
(S2)

StenderupLarsen (User
3)
Christina
Hjortkjr
(User 4)

Medina
Jalousi (S3)
Pharrell
Williams
Happy (S4)

#CampusDays

Similarity Matrix:
S1

S2

S3

S4

S1

S2

S3

S4

Anders
Lybecker
(User 1)
Henrik
Westergaard
Hansen (User
2)
Tina

Beyonc
Single Ladies
(S1)
Justin
Timberlake
Suit & Tie
(S2)

StenderupLarsen (User
3)
Christina
Hjortkjr
(User 4)

Medina
Jalousi (S3)
Pharrell
Williams
Happy (S4)

Excel formula:
{=MMULT(A2:E5,F2:F5)
}

#CampusDays

Similarity Matrix:

Tina Stenderup-Larsen:

Preference:

S1

S2

S3

S4

User 3

R3

S1

S2

S3

S4

In Mahout 0.9 we will make use of:


User-Based Collaborative Filtering

#CampusDays

More specifically, this class:


org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

Documentation: https://
builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/ta
ste/hadoop/item/RecommenderJob.html

Overview of Technology

#CampusDays

2.4.0.2.1.8.0-2176

0.9.0.2.1.8.0-2176

Microsoft Azure HDInsight


3.1

Cant Use mahout-distribution-0.9


You need the Mahout distribution especially built for HDInsight 3.1.
Its included in any new HDInsight cluster by default.

#CampusDays

This error is a sign of a package


made for Hadoop 1.x, which is
uncompatible with Hadoop 2.x.

#CampusDays

Moving Data into Azure

Moving Data into Azure


A very common question among customer: How to do it?
Different approaches to the task:

Manual Movement

Automated Movement (on some type of schedule)

Initial data load vs. incremental data load

#CampusDays

Challenges with private networks (corporate networks)

Moving Data into Azure


UI-based Explorers
Azure Storage Explorer
CloudXplorer

#CampusDays

Drag and Drop


Easy to use for manual movement
Movement of complete folders

Moving Data into Azure


AzCopy
Blob Loading Tool

Upload
Download

Command Line Utility


Logging

#CampusDays

Recovery
Naming Conflicts
http://azure.microsoft.com/en-us/documentation/articles/storage-use-

azcopy
/

Moving Data into Azure


PowerShell
Set-AzureStorageBlobContent

Uploads blob synchronously


Can use foreach to iterate through a directory

New-AzureStorageContext
Start-AzureStorageBlobCopy

#CampusDays

Asynchronously move between storage accounts

Moving Data into Azure

#CampusDays

C#
// Variables for the cloud storage objects.
var StorageAccount =
CloudStorageAccount.Parse("DefaultEndpointsProtocol=https;AccountName="
+ ConfigurationManager.AppSettings["AzureAccountName"] + ";AccountKey="
+ ConfigurationManager.AppSettings["AzureStorageKey"] + "");
// Create the blob client.
var blobClient = StorageAccount.CreateCloudBlobClient();
// Get the container reference.
var blobContainer = blobClient.GetContainerReference("data");
// Create the container if it does not exist.
blobContainer.CreateIfNotExists();
// Upload blob.
blob.BeginUploadFromFile(Filename, FileMode.Open, ProcessComplete,
blobname);

#CampusDays

Demo of HDInsight

The Echo Nest Taste Profile Subset


The dataset contains
real userplay counts from undisclosed partners,
all songs already matched to the Million Song Dataset.
Link: http://labrosa.ee.columbia.edu/millionsong/tasteprofile
1,019,318unique users
384,546unique MSD songs
#CampusDays

48,373,586user - song - play count triplets


b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
...

SOAKIMP12A8C130995
SOAPDEY12A81C210A9
SOBBMDR12A8C13253B
SOBFNSP12AF72A0E22
SOBFOVM12A58A7D494

1
1
2
1
1

#CampusDays

The Echo Nest Taste Profile Subset


userId,songId,timesPlayed
1,1,1
1,2,1
,I
e
l
p
1,3,2
am 0
x
e
is
0
1,4,1
In th use 5,0
e
h
ly
t
n
r
o
fo
1,5,1
s
t
y
e
d
l
e
p
i
e
tr
sp
a
1,6,1
f
o
e
k
a
o.
s
m
e
1,7,2
d
1,8,1
1,9,1
1,10,1
...

Transforming the Data


Using C# on my Surface Pro 3 with:
Intel Core i5-4300U @ 1.90 GHz
8 GB RAM
256 GB SSD

#CampusDays

Console app written for .NET Framework 4.5 (x64), 100 lines of
optimized code.
Input file = 3 GB. More than 48 million rows...
Took more than 30 minutes to complete with 10 million rows...
A Hadoop cluster could have done that in a matter of seconds.

#CampusDays

#CampusDays

#CampusDays

#CampusDays

Analyzing Data in Azure with HDInsight 3.1


hadoop fs -rm -r input
hadoop fs -rm -r temp
hadoop fs mkdir input

#CampusDays

hadoop fs -copyFromLocal mInput.txt /user/writer/input/


hadoop fs -copyFromLocal users.txt /user/writer/input/
hadoop jar C:\apps\dist\mahout-0.9.0.2.1.8.0-2176\core\target\mahoutcore-0.9.0.2.1.8.0-2176-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-s SIMILARITY_COOCCURRENCE
--input=/user/writer/input/mInput.txt
--output=/user/writer/output
--usersFile=/user/writer/input/users.txt
hadoop fs -copyToLocal output/part-r-00000 c:\output.txt

#CampusDays

Analyzing Data in Azure with HDInsight 3.1

#CampusDays

Analyzing Data in Azure with HDInsight 3.1

#CampusDays

Analyzing Data in Azure with HDInsight 3.1

#CampusDays

Analyzing Data in Azure with HDInsight 3.1

#CampusDays

Analyzing Data in Azure with HDInsight 3.1

#CampusDays

Analyzing Data in Azure with HDInsight 3.1

#CampusDays

#CampusDays

Demo of Azure Machine


Learning

#CampusDays

Azure Machine Learning

#CampusDays

Azure Machine Learning

Azure Machine Learning

#CampusDays

https://
social.msdn.microsoft.com/Forums/azure/en-US/24a06c13-bde0-4955-86f3-85a2f2b37153/t
est-dataset-contains-invalid-data

#CampusDays

Future of Apache Mahout


and Azure Maching Learning

Future of Apache Mahout and Azure ML


Apache Mahout
Latest stable release: v0.9
Under development: v1.0
Replacing MapReduce with Apache Spark
Compatibility with Hadoop 2

#CampusDays

https://
issues.apache.org/jira/browse/MAHOUT-1512?jql=project%20%3D%20MAHOUT
%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%201.
0%20ORDER%20BY%20priority%20DESC

Rewriting algorithms for Apache Spark, H2O and Flink

Azure Machine Learning


Made available in public preview in July 2014
Half price until general availability (GA)

#CampusDays

Apache Spark is
an open-source
in-memory
data analytics cluster computing framework
originally developed in the AMPLab at UC Berkeley.
In contrast to Hadoop's two-stage disk-based MapReduce paradigm,
Spark's in-memory primitives
provide performance up to 100 times faster for certain applications.

#CampusDays

MLlib is Sparks
scalable machine learning library
consisting of common learning algorithms
and utilities,
Including classification,
regression,
clustering,
collaborative filtering,
dimensionality reduction, as well as
underlying optimization primitives.

#CampusDays

Apache Mahout vs. Apache Spark?

MapReduce
Introduced in Nov. 2011
Very mature
Community support
Proven in the field
Tested stability

MLLib much faster!


Not as mature
Hot in the community
Became an Apache
top-level project
in February 2014

Apache Spark 1.0 is a preview feature in Microsoft Azure.

#CampusDays

Guide to set it up here:


http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop
-spark-install
/
HDInsight 3.1 is required. It is compatible with HDFS and WASB so
existing data can be used for Apache Spark jobs.
A sample script for installation is available via the link above.
To run Spark queries, you can use the Scala shell or write standalone

Setting Up Hadoop with Spark 1.0.2 in Azure using


Import-Module Azure
PowerShell

#CampusDays

$subscriptionName = "Platform Internt forbrug" # Name of the Azure subscription


$clusterName = "cd2014spark"
# The HDInsight cluster name
$storageAccountName = "cd2014spark"
# Azure storage account that hosts the default container
$storageAccountKey = <YOURKEY>"
$containerName = $clusterName
$location = "West Europe"
# The location of the HDInsight cluster
$clusterNodes = 4
# The number of nodes in the HDInsight cluster
$version = "3.1"
# For example "3.1"
Select-AzureSubscription $subscriptionName
$config = New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes
$config.DefaultStorageAccount.StorageAccountName="$storageAccountName.blob.core.windows.net"
$config.DefaultStorageAccount.StorageAccountKey=$storageAccountKey
$config.DefaultStorageAccount.StorageContainerName=$clusterName
$config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection
HeadNode -Uri https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv01/spark-installer-v01.ps1
New-AzureHDInsightCluster -Config $config -Name $clusterName -Location $location -Version $version

Valid per November 26, 2014.

Word Count Example in Spark 1.0.2


Using the Scala shell:
val f = sc.textFile("/example/data/gutenberg/davinci.txt")

#CampusDays

val counts = f.flatMap(line => line.split(" "))


.map(word => (word, 1))
.reduceByKey(_ + _)
counts.toArray().foreach(println)

#CampusDays

Word Count Example in Spark 1.0.2

#CampusDays

Slides and demos are


available at:
sebastianbrandes.com

EVENT SPONSORER

TRACK SPONSORER

EXPO SPONSORER

Q&A
#Ask me about everything!

#CampusDays

Join me at the Microsoft Booth


the next 30 minutes @Meet The
Experts
Dont forget to: Evaluate this
session!

2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or
other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to
changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the
date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Vous aimerez peut-être aussi