Vous êtes sur la page 1sur 27

{

"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"### About the Author"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some of Sebastian Raschka's greatest passions are \"Data Science\" and machine
learning. Sebastian enjoys everything that involves working with data: The
discovery of interesting patterns and coming up with insightful conclusions using
techniques from the fields of data mining and machine learning for predictive
modeling.\n",
"\n",
"Currently, Sebastian is sharpening his analytical skills as a PhD candidate at
Michigan State University where he is working on a highly efficient virtual
screening software for computer-aided drug-discovery and a novel approach to
protein ligand docking (among other projects). Basically, it is about the screening
of a database of millions of 3-dimensional structures of chemical compounds in
order to identifiy the ones that could potentially bind to specific protein
receptors in order to trigger a biological response.\n",
"\n",
"You can follow Sebastian on Twitter ([@rasbt](https://twitter.com/rasbt)) or
read more about his favorite projects on [his blog]
(http://sebastianraschka.com/articles.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Principal Component Analysis in 3 Simple Steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Principal Component Analysis (PCA) is a simple yet popular and useful linear
transformation technique that is used in numerous applications, such as stock
market predictions, the analysis of gene expression data, and many more. In this
tutorial, we will see that PCA is not just a \"black box\", and we are going to
unravel its internals in 3 basic steps."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sections"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Introduction](#Introduction)\n",
" - [PCA Vs. LDA](#PCA-Vs.-LDA)\n",
" - [PCA and Dimensionality Reduction](#PCA-and-Dimensionality-
Reduction)\n",
" - [A Summary of the PCA Approach](#A-Summary-of-the-PCA-Approach)\n",
"- [Preparing the Iris Dataset](#Preparing-the-Iris-Dataset)\n",
" - [About Iris](#About-Iris)\n",
" - [Loading the Dataset](#Loading-the-Dataset)\n",
" - [Exploratory Visualization](#Exploratory-Visualization)\n",
" - [Standardizing](#Standardizing)\n",
"- [1 - Eigendecomposition - Computing Eigenvectors and Eigenvalues](#1---
Eigendecomposition---Computing-Eigenvectors-and-Eigenvalues)\n",
" - [Covariance Matrix](#Covariance-Matrix)\n",
" - [Correlation Matrix](#Correlation-Matrix)\n",
" - [Singular Vector Decomposition](#Singular-Vector-Decomposition)\n",
"- [2 - Selecting Principal Components](#2---Selecting-Principal-
Components)\n",
" - [Sorting Eigenpairs](#Sorting-Eigenpairs)\n",
" - [Explained Variance](#Explained-Variance)\n",
" - [Projection Matrix](#Projection-Matrix)\n",
"- [3 - Projection Onto the New Feature Space](#3---Selecting-Principal-
Components)\n",
"- [Shortcut - PCA in scikit-learn](#Shortcut---PCA-in-scikit-learn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sheer size of data in the modern age is not only a challenge for computer
hardware but also a main bottleneck for the performance of many machine learning
algorithms. The main goal of a PCA analysis is to identify patterns in data; PCA
aims to detect the correlation between variables. If a strong correlation between
variables exists, the attempt to reduce the dimensionality only makes sense. In a
nutshell, this is what PCA is all about: Finding the directions of maximum variance
in high-dimensional data and project it onto a smaller dimensional subspace while
retaining most of the information.\n",
"\n",
"<img
src='http://sebastianraschka.com/Images/2015_pca_in_3_steps/pca_sketch.png'
alt=''>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PCA Vs. LDA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both Linear Discriminant Analysis (LDA) and PCA are linear transformation
methods. PCA yields the directions (principal components) that maximize the
variance of the data, whereas LDA also aims to find the directions that maximize
the separation (or discrimination) between different classes, which can be useful
in pattern classification problem (PCA \"ignores\" class labels). \n",
"***In other words, PCA projects the entire dataset onto a different feature
(sub)space, and LDA tries to determine a suitable feature (sub)space in order to
distinguish between patterns that belong to different classes.*** "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PCA and Dimensionality Reduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Often, the desired goal is to reduce the dimensions of a $d$-dimensional
dataset by projecting it onto a $(k)$-dimensional subspace (where $k\\;<\\;d$) in
order to increase the computational efficiency while retaining most of the
information. An important question is \"what is the size of $k$ that represents the
data 'well'?\"\n",
"\n",
"Later, we will compute eigenvectors (the principal components) of a dataset
and collect them in a projection matrix. Each of those eigenvectors is associated
with an eigenvalue which can be interpreted as the \"length\" or \"magnitude\" of
the corresponding eigenvector. If some eigenvalues have a significantly larger
magnitude than others that the reduction of the dataset via PCA onto a smaller
dimensional subspace by dropping the \"less informative\" eigenpairs is
reasonable.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A Summary of the PCA Approach"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Standardize the data.\n",
"- Obtain the Eigenvectors and Eigenvalues from the covariance matrix or
correlation matrix, or perform Singular Vector Decomposition.\n",
"- Sort eigenvalues in descending order and choose the $k$ eigenvectors that
correspond to the $k$ largest eigenvalues where $k$ is the number of dimensions of
the new feature subspace ($k \\le d$)/.\n",
"- Construct the projection matrix $\\mathbf{W}$ from the selected $k$
eigenvectors.\n",
"- Transform the original dataset $\\mathbf{X}$ via $\\mathbf{W}$ to obtain a
$k$-dimensional feature subspace $\\mathbf{Y}$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparing the Iris Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### About Iris"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the following tutorial, we will be working with the famous \"Iris\"
dataset that has been deposited on the UCI machine learning repository \n",
"([https://archive.ics.uci.edu/ml/datasets/Iris]
(https://archive.ics.uci.edu/ml/datasets/Iris)).\n",
"\n",
"The iris dataset contains measurements for 150 iris flowers from three
different species.\n",
"\n",
"The three classes in the Iris dataset are:\n",
"\n",
"1. Iris-setosa (n=50)\n",
"2. Iris-versicolor (n=50)\n",
"3. Iris-virginica (n=50)\n",
"\n",
"And the four features of in Iris dataset are:\n",
"\n",
"1. sepal length in cm\n",
"2. sepal width in cm\n",
"3. petal length in cm\n",
"4. petal width in cm\n",
"\n",
"<img
src=\"http://sebastianraschka.com/Images/2014_python_lda/iris_petal_sepal.png\"
alt=\"Iris\" style=\"width: 200px;\"/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading the Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to load the Iris data directly from the UCI repository, we are going
to use the superb [pandas](http://pandas.pydata.org) library. If you haven't used
pandas yet, I want encourage you to check out the [pandas tutorials]
(http://pandas.pydata.org/pandas-docs/stable/tutorials.html). If I had to name one
Python library that makes working with data a wonderfully simple task, this would
definitely be pandas!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal_len</th>\n",
" <th>sepal_wid</th>\n",
" <th>petal_len</th>\n",
" <th>petal_wid</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>145</th>\n",
" <td> 6.7</td>\n",
" <td> 3.0</td>\n",
" <td> 5.2</td>\n",
" <td> 2.3</td>\n",
" <td> Iris-virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>146</th>\n",
" <td> 6.3</td>\n",
" <td> 2.5</td>\n",
" <td> 5.0</td>\n",
" <td> 1.9</td>\n",
" <td> Iris-virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>147</th>\n",
" <td> 6.5</td>\n",
" <td> 3.0</td>\n",
" <td> 5.2</td>\n",
" <td> 2.0</td>\n",
" <td> Iris-virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>148</th>\n",
" <td> 6.2</td>\n",
" <td> 3.4</td>\n",
" <td> 5.4</td>\n",
" <td> 2.3</td>\n",
" <td> Iris-virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>149</th>\n",
" <td> 5.9</td>\n",
" <td> 3.0</td>\n",
" <td> 5.1</td>\n",
" <td> 1.8</td>\n",
" <td> Iris-virginica</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal_len sepal_wid petal_len petal_wid class\n",
"145 6.7 3.0 5.2 2.3 Iris-virginica\n",
"146 6.3 2.5 5.0 1.9 Iris-virginica\n",
"147 6.5 3.0 5.2 2.0 Iris-virginica\n",
"148 6.2 3.4 5.4 2.3 Iris-virginica\n",
"149 5.9 3.0 5.1 1.8 Iris-virginica"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv(\n",
" filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data', \n",
" header=None, \n",
" sep=',')\n",
"\n",
"df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']\n",
"df.dropna(how=\"all\", inplace=True) # drops the empty line at file-end\n",
"\n",
"df.tail()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# split data table into data X and class labels y\n",
"\n",
"X = df.ix[:,0:4].values\n",
"y = df.ix[:,4].values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our iris dataset is now stored in form of a $150 \\times 4$ matrix where the
columns are the different features, and every row represents a separate flower
sample.\n",
"Each sample row $\\mathbf{x}$ can be pictured as a 4-dimensional vector \n",
"\n",
"\n",
"$\\mathbf{x^T} = \\begin{pmatrix} x_1 \\\\ x_2 \\\\ x_3 \\\\ x_4
\\end{pmatrix} \n",
"= \\begin{pmatrix} \\text{sepal length} \\\\ \\text{sepal width}
\\\\\\text{petal length} \\\\ \\text{petal width} \\end{pmatrix}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exploratory Visualization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get a feeling for how the 3 different flower classes are distributes along
the 4 different features, let us visualize them via histograms."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import plotly.plotly as py\n",
"from plotly.graph_objs import *\n",
"import plotly.tools as tls"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<iframe id=\"igraph\" scrolling=\"no\"
style=\"border:none;\"seamless=\"seamless\"
src=\"https://plot.ly/~rasbt/319.embed\" height=\"525\" width=\"100%\"></iframe>"
],
"text/plain": [
"<plotly.tools.PlotlyDisplay object>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# plotting histograms\n",
"\n",
"traces = []\n",
"\n",
"legend = {0:False, 1:False, 2:False, 3:True}\n",
"\n",
"colors = {'Iris-setosa': 'rgb(31, 119, 180)', \n",
" 'Iris-versicolor': 'rgb(255, 127, 14)', \n",
" 'Iris-virginica': 'rgb(44, 160, 44)'}\n",
"\n",
"for col in range(4):\n",
" for key in colors:\n",
" traces.append(Histogram(x=X[y==key, col], \n",
" opacity=0.75,\n",
" xaxis='x%s' %(col+1),\n",
" marker=Marker(color=colors[key]),\n",
" name=key,\n",
" showlegend=legend[col]))\n",
"\n",
"data = Data(traces)\n",
"\n",
"layout = Layout(barmode='overlay',\n",
" xaxis=XAxis(domain=[0, 0.25], title='sepal length (cm)'),\n",
" xaxis2=XAxis(domain=[0.3, 0.5], title='sepal width (cm)'),\n",
" xaxis3=XAxis(domain=[0.55, 0.75], title='petal length
(cm)'),\n",
" xaxis4=XAxis(domain=[0.8, 1], title='petal width (cm)'),\n",
" yaxis=YAxis(title='count'),\n",
" title='Distribution of the different Iris flower
features')\n",
"\n",
"fig = Figure(data=data, layout=layout)\n",
"py.iplot(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Standardizing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whether to standardize the data prior to a PCA on the covariance matrix
depends on the measurement scales of the original features. Since PCA yields a
feature subspace that maximizes the variance along the axes, it makes sense to
standardize the data, especially, if it was measured on different scales. Although,
all features in the Iris dataset were measured in centimeters, let us continue with
the transformation of the data onto unit scale (mean=0 and variance=1), which is a
requirement for the optimal performance of many machine learning algorithms."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"X_std = StandardScaler().fit_transform(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1 - Eigendecomposition - Computing Eigenvectors and Eigenvalues"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The eigenvectors and eigenvalues of a covariance (or correlation) matrix
represent the \"core\" of a PCA: The eigenvectors (principal components) determine
the directions of the new feature space, and the eigenvalues determine their
magnitude. In other words, the eigenvalues explain the variance of the data along
the new feature axes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Covariance Matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The classic approach to PCA is to perform the eigendecomposition on the
covariance matrix $\\Sigma$, which is a $d \\times d$ matrix where each element
represents the covariance between two features. The covariance between two features
is calculated as follows:\n",
"\n",
"$\\sigma_{jk} = \\frac{1}{n-1}\\sum_{i=1}^{N}\\left( x_{ij}-\\bar{x}_j
\\right) \\left( x_{ik}-\\bar{x}_k \\right).$\n",
"\n",
"We can summarize the calculation of the covariance matrix via the following
matrix equation: \n",
"$\\Sigma = \\frac{1}{n-1} \\left( (\\mathbf{X} - \\mathbf{\\bar{x}})^T\\;
(\\mathbf{X} - \\mathbf{\\bar{x}}) \\right)$ \n",
"where $\\mathbf{\\bar{x}}$ is the mean vector \n",
"$\\mathbf{\\bar{x}} = \\sum\\limits_{k=1}^n x_{i}.$ \n",
"The mean vector is a $d$-dimensional vector where each value in this vector
represents the sample mean of a feature column in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Covariance matrix \n",
"[[ 1.00671141 -0.11010327 0.87760486 0.82344326]\n",
" [-0.11010327 1.00671141 -0.42333835 -0.358937 ]\n",
" [ 0.87760486 -0.42333835 1.00671141 0.96921855]\n",
" [ 0.82344326 -0.358937 0.96921855 1.00671141]]\n"
]
}
],
"source": [
"import numpy as np\n",
"mean_vec = np.mean(X_std, axis=0)\n",
"cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-
1)\n",
"print('Covariance matrix \\n%s' %cov_mat)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The more verbose way above was simply used for demonstration purposes,
equivalently, we could have used the numpy `cov` function:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NumPy covariance matrix: \n",
"[[ 1.00671141 -0.11010327 0.87760486 0.82344326]\n",
" [-0.11010327 1.00671141 -0.42333835 -0.358937 ]\n",
" [ 0.87760486 -0.42333835 1.00671141 0.96921855]\n",
" [ 0.82344326 -0.358937 0.96921855 1.00671141]]\n"
]
}
],
"source": [
"print('NumPy covariance matrix: \\n%s' %np.cov(X_std.T))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we perform an eigendecomposition on the covariance matrix:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Eigenvectors \n",
"[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]\n",
" [-0.26335492 -0.92555649 0.24203288 -0.12413481]\n",
" [ 0.58125401 -0.02109478 0.14089226 -0.80115427]\n",
" [ 0.56561105 -0.06541577 0.6338014 0.52354627]]\n",
"\n",
"Eigenvalues \n",
"[ 2.93035378 0.92740362 0.14834223 0.02074601]\n"
]
}
],
"source": [
"cov_mat = np.cov(X_std.T)\n",
"\n",
"eig_vals, eig_vecs = np.linalg.eig(cov_mat)\n",
"\n",
"print('Eigenvectors \\n%s' %eig_vecs)\n",
"print('\\nEigenvalues \\n%s' %eig_vals)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Correlation Matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Especially, in the field of \"Finance,\" the correlation matrix typically used
instead of the covariance matrix. However, the eigendecomposition of the covariance
matrix (if the input data was standardized) yields the same results as a
eigendecomposition on the correlation matrix, since the correlation matrix can be
understood as the normalized covariance matrix."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Eigendecomposition of the standardized data based on the correlation matrix:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Eigenvectors \n",
"[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]\n",
" [-0.26335492 -0.92555649 0.24203288 -0.12413481]\n",
" [ 0.58125401 -0.02109478 0.14089226 -0.80115427]\n",
" [ 0.56561105 -0.06541577 0.6338014 0.52354627]]\n",
"\n",
"Eigenvalues \n",
"[ 2.91081808 0.92122093 0.14735328 0.02060771]\n"
]
}
],
"source": [
"cor_mat1 = np.corrcoef(X_std.T)\n",
"\n",
"eig_vals, eig_vecs = np.linalg.eig(cor_mat1)\n",
"\n",
"print('Eigenvectors \\n%s' %eig_vecs)\n",
"print('\\nEigenvalues \\n%s' %eig_vals)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Eigendecomposition of the raw data based on the correlation matrix:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Eigenvectors \n",
"[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]\n",
" [-0.26335492 -0.92555649 0.24203288 -0.12413481]\n",
" [ 0.58125401 -0.02109478 0.14089226 -0.80115427]\n",
" [ 0.56561105 -0.06541577 0.6338014 0.52354627]]\n",
"\n",
"Eigenvalues \n",
"[ 2.91081808 0.92122093 0.14735328 0.02060771]\n"
]
}
],
"source": [
"cor_mat2 = np.corrcoef(X.T)\n",
"\n",
"eig_vals, eig_vecs = np.linalg.eig(cor_mat2)\n",
"\n",
"print('Eigenvectors \\n%s' %eig_vecs)\n",
"print('\\nEigenvalues \\n%s' %eig_vals)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can clearly see that all three approaches yield the same eigenvectors and
eigenvalue pairs:\n",
" \n",
"- Eigendecomposition of the covariance matrix after standardizing the
data.\n",
"- Eigendecomposition of the correlation matrix.\n",
"- Eigendecomposition of the correlation matrix after standardizing the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Singular Vector Decomposition"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While the eigendecomposition of the covariance or correlation matrix may be
more intuitiuve, most PCA implementations perform a Singular Vector Decomposition
(SVD) to improve the computational efficiency. So, let us perform an SVD to
confirm that the result are indeed the same:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[-0.52237162, -0.37231836, 0.72101681, 0.26199559],\n",
" [ 0.26335492, -0.92555649, -0.24203288, -0.12413481],\n",
" [-0.58125401, -0.02109478, -0.14089226, -0.80115427],\n",
" [-0.56561105, -0.06541577, -0.6338014 , 0.52354627]])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"u,s,v = np.linalg.svd(X_std.T)\n",
"u"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2 - Selecting Principal Components"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sorting Eigenpairs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The typical goal of a PCA is to reduce the dimensionality of the original
feature space by projecting it onto a smaller subspace, where the eigenvectors will
form the axes. However, the eigenvectors only define the directions of the new
axis, since they have all the same unit length 1, which can confirmed by the
following two lines of code:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Everything ok!\n"
]
}
],
"source": [
"for ev in eig_vecs:\n",
" np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))\n",
"print('Everything ok!')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to decide which eigenvector(s) can dropped without losing too much
information\n",
"for the construction of lower-dimensional subspace, we need to inspect the
corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the
least information about the distribution of the data; those are the ones can be
dropped. \n",
"In order to do so, the common approach is to rank the eigenvalues from highest
to lowest in order choose the top $k$ eigenvectors."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Eigenvalues in descending order:\n",
"2.91081808375\n",
"0.921220930707\n",
"0.147353278305\n",
"0.0206077072356\n"
]
}
],
"source": [
"# Make a list of (eigenvalue, eigenvector) tuples\n",
"eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in
range(len(eig_vals))]\n",
"\n",
"# Sort the (eigenvalue, eigenvector) tuples from high to low\n",
"eig_pairs.sort()\n",
"eig_pairs.reverse()\n",
"\n",
"# Visually confirm that the list is correctly sorted by decreasing
eigenvalues\n",
"print('Eigenvalues in descending order:')\n",
"for i in eig_pairs:\n",
" print(i[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explained Variance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After sorting the eigenpairs, the next question is \"how many principal
components are we going to choose for our new feature subspace?\" A useful measure
is the so-called \"explained variance,\" which can be calculated from the
eigenvalues. The explained variance tells us how much information (variance) can be
attributed to each of the principal components."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<iframe id=\"igraph\" scrolling=\"no\"
style=\"border:none;\"seamless=\"seamless\"
src=\"https://plot.ly/~rasbt/320.embed\" height=\"525\" width=\"100%\"></iframe>"
],
"text/plain": [
"<plotly.tools.PlotlyDisplay object>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tot = sum(eig_vals)\n",
"var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]\n",
"cum_var_exp = np.cumsum(var_exp)\n",
"\n",
"trace1 = Bar(\n",
" x=['PC %s' %i for i in range(1,5)],\n",
" y=var_exp,\n",
" showlegend=False)\n",
"\n",
"trace2 = Scatter(\n",
" x=['PC %s' %i for i in range(1,5)], \n",
" y=cum_var_exp,\n",
" name='cumulative explained variance')\n",
"\n",
"data = Data([trace1, trace2])\n",
"\n",
"layout=Layout(\n",
" yaxis=YAxis(title='Explained variance in percent'),\n",
" title='Explained variance by different principal components')\n",
"\n",
"fig = Figure(data=data, layout=layout)\n",
"py.iplot(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot above clearly shows that most of the variance (72.77% of the variance
to be precise) can be explained by the first principal component alone. The second
principal component still bears some information (23.03%) while the third and
fourth principal components can safely be dropped without losing to much
information. Together, the first two principal components contain 95.8% of the
information."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Projection Matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's about time to get to the really interesting part: The construction of the
projection matrix that will be used to transform the Iris data onto the new feature
subspace. Although, the name \"projection matrix\" has a nice ring to it, it is
basically just a matrix of our concatenated top *k* eigenvectors.\n",
"\n",
"Here, we are reducing the 4-dimensional feature space to a 2-dimensional
feature subspace, by choosing the \"top 2\" eigenvectors with the highest
eigenvalues to construct our $d \\times k$-dimensional eigenvector matrix
$\\mathbf{W}$."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('Matrix W:\\n', array([[ 0.52237162, -0.37231836],\n",
" [-0.26335492, -0.92555649],\n",
" [ 0.58125401, -0.02109478],\n",
" [ 0.56561105, -0.06541577]]))\n"
]
}
],
"source": [
"matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1), \n",
" eig_pairs[1][1].reshape(4,1)))\n",
"\n",
"print('Matrix W:\\n', matrix_w)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3 - Projection Onto the New Feature Space"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this last step we will use the $4 \\times 2$-dimensional projection matrix
$\\mathbf{W}$ to transform our samples onto the new subspace via the equation \n",
"$\\mathbf{Y} = \\mathbf{X} \\times \\mathbf{W}$, where $\\mathbf{Y}$ is a
$150\\times 2$ matrix of our transformed samples."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"Y = X_std.dot(matrix_w)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<iframe id=\"igraph\" scrolling=\"no\"
style=\"border:none;\"seamless=\"seamless\"
src=\"https://plot.ly/~rasbt/321.embed\" height=\"525\" width=\"100%\"></iframe>"
],
"text/plain": [
"<plotly.tools.PlotlyDisplay object>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"traces = []\n",
"\n",
"for name in ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'):\n",
"\n",
" trace = Scatter(\n",
" x=Y[y==name,0],\n",
" y=Y[y==name,1],\n",
" mode='markers',\n",
" name=name,\n",
" marker=Marker(\n",
" size=12,\n",
" line=Line(\n",
" color='rgba(217, 217, 217, 0.14)',\n",
" width=0.5),\n",
" opacity=0.8))\n",
" traces.append(trace)\n",
"\n",
"\n",
"data = Data(traces)\n",
"layout = Layout(showlegend=True,\n",
" scene=Scene(xaxis=XAxis(title='PC1'),\n",
" yaxis=YAxis(title='PC2'),))\n",
"\n",
"fig = Figure(data=data, layout=layout)\n",
"py.iplot(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Shortcut - PCA in scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For educational purposes, we went a long way to apply the PCA to the Iris
dataset. But luckily, there is already implementation in scikit-learn. "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA as sklearnPCA\n",
"sklearn_pca = sklearnPCA(n_components=2)\n",
"Y_sklearn = sklearn_pca.fit_transform(X_std)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<iframe id=\"igraph\" scrolling=\"no\"
style=\"border:none;\"seamless=\"seamless\"
src=\"https://plot.ly/~rasbt/322.embed\" height=\"525\" width=\"100%\"></iframe>"
],
"text/plain": [
"<plotly.tools.PlotlyDisplay object>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"traces = []\n",
"\n",
"for name in ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'):\n",
"\n",
" trace = Scatter(\n",
" x=Y_sklearn[y==name,0],\n",
" y=Y_sklearn[y==name,1],\n",
" mode='markers',\n",
" name=name,\n",
" marker=Marker(\n",
" size=12,\n",
" line=Line(\n",
" color='rgba(217, 217, 217, 0.14)',\n",
" width=0.5),\n",
" opacity=0.8))\n",
" traces.append(trace)\n",
"\n",
"\n",
"data = Data(traces)\n",
"layout = Layout(xaxis=XAxis(title='PC1', showline=False),\n",
" yaxis=YAxis(title='PC2', showline=False))\n",
"fig = Figure(data=data, layout=layout)\n",
"py.iplot(fig)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

Vous aimerez peut-être aussi