Académique Documents
Professionnel Documents
Culture Documents
#CampusDays
Agenda
1. Introduction to Personalization
2. How to Build a Recommendation Engine
3. Overview of Technologies (Hadoop 2.4.0 and Mahout 0.9)
4. Moving Data into Microsoft Azure
5. Analyzing Data in Azure with HDInsight 3.1
#CampusDays
Spark?
8. Demo of Spark 1.0.2 on Customized Hadoop Cluster in Azure
9. Getting Started on Your Own
10. Q&A
#CampusDays
Introduction to
Personalization
#CampusDays
Introduction to Personalization
#CampusDays
Introduction to Personalization
#CampusDays
Introduction to Personalization
#CampusDays
Introduction to Personalization
#CampusDays
Introduction to Personalization
Introduction to Personalization
27% of customers have seen
Personalization online
86% of those say Personalization
#CampusDays
experience
59% of customers who have experienced
#CampusDays
#CampusDays
Introduction to Personalization
#CampusDays
How to Build a
Recommendation Engine
#CampusDays
#CampusDays
Overview of Technologies
#CampusDays
Hadoop Timeline
October 2003: Google publishes GFS paper
http://research.google.com/archive/gfs.html
http://research.google.com/archive/mapreduce.html
Hadoop
#CampusDays
and IaaS)
August 2013: Hadoop 1.X is stable
October 2013: HDInsight is generally available
October 2013: Hadoop 2.X is generally available
#CampusDays
Overview of Technologies
#CampusDays
Overview of Technologies
#CampusDays
Microsoft Azure is a
cloud computing platform and infrastructure,
created by Microsoft,
for building, deploying and managing applications and services
through a global network of Microsoft-managed datacenters.
#CampusDays
Apache Hadoop is an
open-source for
software framework
distributed storage and
distributed processing of
Big Data on
clusters of commodity hardware.
#CampusDays
Errh...
Apache Mahout is a project of the Apache Software Foundation to
produce
free implementations of
distributed or otherwise scalable
machine learning algorithms focused primarily in the areas of
collaborative filtering, clustering and classification.
#CampusDays
MapReduce is
a programming model and
an associated implementation for processing and generating large
data sets with a
parallel,
distributed algorithm
on a cluster.
Slaves
Data
#CampusDays
Masters
Distributed
Data
Processing
(MR)
Client(s)
Distributed
Data Storage
(HDFS)
Primary
Name Node
Job Tracker
Data Node
Data Node
Task Tracker
Task Tracker
...
Secondary
Name Node
Data Node
Task Tracker
#CampusDays
#CampusDays
Source: http://www.rabidgremlin.com/data20
#CampusDays
MapReduce in C#
public class NaiveMapReduceProgram<K1, V1, K2, V2, V3>
{
public delegate IEnumerable<KeyValuePair<K2, V2>> MapFunction(K1 key, V1 value);
public delegate IEnumerable<V3> ReduceFunction(K2 key, IEnumerable<V2> values);
#CampusDays
MapReduce in C#
[...]
public IEnumerable<KeyValuePair<K2, V2>> Map(Dictionary<K1, V1> input)
{
var q = from pair in input
from mapped in _map(pair.Key, pair.Value)
select mapped;
#CampusDays
return q;
}
[...]
MapReduce in C#
#CampusDays
[...]
public IEnumerable<KeyValuePair<K2, V3>> Reduce(IEnumerable<KeyValuePair<K2, V2>>
intermediateValues)
{
var groups = from pair in intermediateValues
group pair.Value by pair.Key into g
select g;
var reduced = from g in groups
let k2 = g.Key
from reducedValue in _reduce(k2, g)
select new KeyValuePair<K2, V3>(k2, reducedValue);
return reduced;
}
[...]
MapReduce in C#
#CampusDays
[...]
public IEnumerable<KeyValuePair<K2, V3>> Execute(IEnumerable<KeyValuePair<K1,
V1>> input)
{
return Reduce(Map(input));
}
}
#CampusDays
#CampusDays
#CampusDays
#CampusDays
already.
The first technique is known as user based recommendation, and the
second technique is known as item based recommendations.
#CampusDays
Anders
Lybecker
(User 1)
Henrik
Westergaard
Hansen (User
2)
Tina
Beyonc
Single Ladies
(S1)
Justin
Timberlake
Suit & Tie
(S2)
StenderupLarsen (User
3)
Christina
Hjortkjr
(User 4)
Medina
Jalousi (S3)
Pharrell
Williams
Happy (S4)
#CampusDays
Similarity Matrix:
S1
S2
S3
S4
S1
S2
S3
S4
Anders
Lybecker
(User 1)
Henrik
Westergaard
Hansen (User
2)
Tina
Beyonc
Single Ladies
(S1)
Justin
Timberlake
Suit & Tie
(S2)
StenderupLarsen (User
3)
Christina
Hjortkjr
(User 4)
Medina
Jalousi (S3)
Pharrell
Williams
Happy (S4)
Excel formula:
{=MMULT(A2:E5,F2:F5)
}
#CampusDays
Similarity Matrix:
Tina Stenderup-Larsen:
Preference:
S1
S2
S3
S4
User 3
R3
S1
S2
S3
S4
#CampusDays
Documentation: https://
builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/ta
ste/hadoop/item/RecommenderJob.html
Overview of Technology
#CampusDays
2.4.0.2.1.8.0-2176
0.9.0.2.1.8.0-2176
#CampusDays
#CampusDays
Manual Movement
#CampusDays
#CampusDays
Upload
Download
#CampusDays
Recovery
Naming Conflicts
http://azure.microsoft.com/en-us/documentation/articles/storage-use-
azcopy
/
New-AzureStorageContext
Start-AzureStorageBlobCopy
#CampusDays
#CampusDays
C#
// Variables for the cloud storage objects.
var StorageAccount =
CloudStorageAccount.Parse("DefaultEndpointsProtocol=https;AccountName="
+ ConfigurationManager.AppSettings["AzureAccountName"] + ";AccountKey="
+ ConfigurationManager.AppSettings["AzureStorageKey"] + "");
// Create the blob client.
var blobClient = StorageAccount.CreateCloudBlobClient();
// Get the container reference.
var blobContainer = blobClient.GetContainerReference("data");
// Create the container if it does not exist.
blobContainer.CreateIfNotExists();
// Upload blob.
blob.BeginUploadFromFile(Filename, FileMode.Open, ProcessComplete,
blobname);
#CampusDays
Demo of HDInsight
SOAKIMP12A8C130995
SOAPDEY12A81C210A9
SOBBMDR12A8C13253B
SOBFNSP12AF72A0E22
SOBFOVM12A58A7D494
1
1
2
1
1
#CampusDays
#CampusDays
Console app written for .NET Framework 4.5 (x64), 100 lines of
optimized code.
Input file = 3 GB. More than 48 million rows...
Took more than 30 minutes to complete with 10 million rows...
A Hadoop cluster could have done that in a matter of seconds.
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
https://
social.msdn.microsoft.com/Forums/azure/en-US/24a06c13-bde0-4955-86f3-85a2f2b37153/t
est-dataset-contains-invalid-data
#CampusDays
#CampusDays
https://
issues.apache.org/jira/browse/MAHOUT-1512?jql=project%20%3D%20MAHOUT
%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%201.
0%20ORDER%20BY%20priority%20DESC
#CampusDays
Apache Spark is
an open-source
in-memory
data analytics cluster computing framework
originally developed in the AMPLab at UC Berkeley.
In contrast to Hadoop's two-stage disk-based MapReduce paradigm,
Spark's in-memory primitives
provide performance up to 100 times faster for certain applications.
#CampusDays
MLlib is Sparks
scalable machine learning library
consisting of common learning algorithms
and utilities,
Including classification,
regression,
clustering,
collaborative filtering,
dimensionality reduction, as well as
underlying optimization primitives.
#CampusDays
MapReduce
Introduced in Nov. 2011
Very mature
Community support
Proven in the field
Tested stability
#CampusDays
#CampusDays
#CampusDays
#CampusDays
#CampusDays
EVENT SPONSORER
TRACK SPONSORER
EXPO SPONSORER
Q&A
#Ask me about everything!
#CampusDays
2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or
other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to
changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the
date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.