00 vote positif00 vote négatif

1 vues61 pageskrjslksf

Jul 01, 2019

4 [John Stredwick] Introduction to Human Resource Ma(BookZa.org)

© © All Rights Reserved

DOCX, PDF, TXT ou lisez en ligne sur Scribd

krjslksf

© All Rights Reserved

1 vues

00 vote positif00 vote négatif

4 [John Stredwick] Introduction to Human Resource Ma(BookZa.org)

krjslksf

© All Rights Reserved

Vous êtes sur la page 1sur 61

Analytics

Analytics can be defined as a process that involves the use of statistical techniques

(measures of central tendency, graphs, and so on), information system software (data

mining, sorting routines), and operations research methodologies (linear programming) to

explore, visualize, discover and communicate patterns or trends in data.

Skills of a Business Analyst

4. Be A Good Listener.

8. Stakeholder Management.

Types of Analytics

contained in a data set or database.

Example: An age bar chart is used to depict retail shoppers for a department store that

wants to target advertising to customers by age.

The purpose is to get a rough picture of what generally the data looks like and what

criteria might have potential for identifying trends or future business behavior.

measures of dispersion (standard deviation), charts, graphs, sorting methods, frequency

distributions, probability distributions, and sampling methods

research methods to identify predictive variables and build predictive models to identify

trends and relationships not readily observed in a descriptive analysis.

Example: Multiple regression is used to show the relationship (or lack of relationship)

between age, weight, and exercise on diet food sales. Knowing that relationships exist

helps explain why one set of independent variables influences dependent variables such

as business performance.

To build predictive models designed to identify and predict future trends. Statistical

methods like multiple regression and ANOVA.

Information system methods like data mining and sorting. Operations research methods

like forecasting models.

research methodologies (applied mathematical techniques) to make best use of allocable

resources.

Example: A department store has a limited advertising budget to target customers. Linear

programming models can be used to optimally allocate the budget to various advertising

media.

opportunities.

Business analytics

Business analytics begins with a data set (a simple collection of data or a data file) or

commonly with a database (a collection of data files that contain information on people,

locations, and so on).

Types of Digital Data

1. Structured data – Structured data is a data whose elements are addressable for effective

analysis. It has

It concern all data which can be stored in database SQL in table with rows and columns.

They have relational key and can easily mapped into pre-designed fields.

Today, those data are most processed in development and simplest way to manage

information. Example: Relational data.

Semi-structured data is information that does not reside in a rational database but that

have some organizational properties that make it easier to analyze.

With some process, you can store them in the relation database (it could be very hard for

some kind of semi-structured data), but Semi-structured exist to ease space. Example:

XML data.

Unstructured data is a data that is which is not organized in a pre-defined manner or does

not have a pre-defined data model, thus it is not a good fit for a mainstream relational

database.

So for Unstructured data, there are alternative platforms for storing and managing, it is

increasingly prevalent in IT systems and is used by organizations in a variety of business

intelligence and analytics applications. Example: Word, PDF, Text, Media logs.

Unstructured data

Types of Digital Data

Structured data

Information stored DB

Strict format

Limitation

Semi-structured data

Data may have certain structure but not all information collected has identical

structure

Some attributes may exist in some of the entities of a particular type but not in

others

Unstructured data

Big data

Big data describes the collection of data sets that are so large and complex that software

systems are hardly able to process them (Isson and Harriott, 2013, pp. 57–61).

Little Data

Isson and Harriott (2013, p. 61) define little data as anything that is not big data.

Little data describes the smaller data segments or files that help individual businesses

keep track of customers.

As a means of sorting through data to find useful information, the application of analytics

has found new purpose.

Command Schedule

R itself is a powerful language that performs a wide variety of functions, such as data

manipulation, statistical modeling, and graphics.

One really big advantage of R, however, is its extensibility. Developers can easily write

their own software and distribute it in the form of add-on packages.

and Java — you don’t need a compiler to first create a program from your code before

you can use it.

Features

Console: In the bottom-left corner, you find the console. The console in R Studio is

identical to the console in RGui. This is where you do all the interactive work with R.

Workspace and history: The top-right corner is a handy overview of your workspace,

where you can inspect the variables you created in your session, as well as their values.

This is also the area where you can see a history of the commands you’ve issued in R.

Files, plots, package, and help: In the bottom-right corner, you have access to several

tools:

Files: This is where you can browse the folders and files on your computer.

Packages: This is where you can view a list of all the installed packages.

Help: This is where you can browse the built-in Help system of R.

The prompt indicates where you type your commands to R; you see a blinking cursor to

the right of the prompt.

To quit your R session, type the following code in the console, after the command

prompt (>): > q()

R is Case sensitive

> 24+7+11

[1] 42

> 5*2

[1] 10

> 25/5

[1] 5

Printing Message

Vector

A vector is the simplest type of data structure in R. The R manual defines a vector as “A

single entity consisting of a collection of things.”

You also can construct a vector by using operators. An operator is a symbol you stick

between two values to make a calculation.

The symbols +, -, *, and / are all operators, and they have the same meaning they do in

mathematics.

A vector as a row or column of numbers or text. The list of numbers {1,2,3,4,5}, for

example, could be a vector with vector of length 5.

Unlike most other programming languages, R allows you to apply functions to the whole

vector in a single operation without the need for an explicit loop.

> c(1,2,3,4,5)

[1] 1 2 3 4 5

>x

[1] 1 2 3 4 5

Next, we’ll add the value 2 to each element in the vector x and print the result:

>x+2

[1] 3 4 5 6 7

> sum(1:5)

[1] 15

One very handy operator is called sequence, and it looks like a colon (:). Type the

following in your console:

> 1:5

[1] 1 2 3 4 5

You can also add one vector to another. To add the values 6:10 element-wise to x, you do

the following:

> x + 6:10

[1] 7 9 11 13 15

This feature of R is extremely powerful because it lets you perform many operations in a

single step.

>x

[1] 1 2 3 4 5

In R, the assignment operator is <-, which you type in the console by using two

keystrokes: the less-than symbol (<) followed by a hyphen (-). The combination of these

two symbols represents assignment.

> y <- 10

>x+y

[1] 11 12 13 14 15

>Assign(“j”,4)

>J

[1] 4

Calculations

Now create a new variable z, assign it the value of x+y, and print its value:

> z <- x + y

>z

[1] 11 12 13 14 15

You must present text or character values to R inside quotation marks —either single or

double. R accepts both.

Packages

https://cran.r-project.org/web/packages/available_packages_by_name.html

a collection of tasks.

>Install.packages(‘Package Name’)

>Library(‘package name’)

Installation

Installation

Loading

Command

install.packages(“ Coefplot”)

Unloading Package

detach("package:ggplot2", unload=TRUE)

Installation

Loading

Command

install.packages(“ Coefplot”)

Unloading Package

detach("package:ggplot2", unload=TRUE)

Uninstall Packages

remove.packages(pkgs, lib)

Arguments

lib : a character vector giving the library directories to remove the packages from. If

missing, defaults to the first element in .libPaths().

remove.packages("coefplot")

c() function is used to combine numeric and text values into vectors.

> cy

You can use the paste() function to concatenate multiple text elements. By default,

paste() puts a space between the different elements, like this:

Readline() prompts an user to an input and stores the input as a character vector

readline("User Name")

User NameArjun Reddy

In R Studio, click anywhere in the source editor, and press Ctrl+Shift+Enter or click the

Source button in the console.

Multiple Commands

ls()

rm(cy)

>m<- c(1,2,3,4,5)

>m

>sum(m)

> paste(firstnames,lastname)

Names must start with a letter or a dot. If you start a name with a dot, the second

character can’t be a digit.

and dots (.). Although you can force R to accept other characters in names, you

R doesn’t use the dot (.) as an operator,

so the dot can be used in names for objects as well. This style is called dotted style.

Ex: print.default()

Naming Convention in R

Order of Operations

1. Exponentiation

2. Multiplication and division in the order in which the operators are presented

3. Addition and subtraction in the order in which the operators are presented

The mod operator (%%) and the integer division operator (%/%) have the same priority

as the normal division operator (/) in calculations.

You can change the order of the operations by using parentheses i.e. ().

Calculating Logarithms and Exponentials

In R, you can take the logarithm of the numbers from 1 to 3 like this:

> log(1:3)

Whenever you use one of these functions, R calculates the natural logarithm if you don’t

specify any base.

You calculate the logarithm of these numbers with base 6 like this:

> log(1:3,base=6)

>x<-log(1:20)

> exp(x)

Manipulating Operators

Actually, operators are also functions. It helps to know, though, that operators can, in

many cases, be treated just like any other function if you put the operator between

backticks and add the arguments between parentheses,

> '+'(4,6)

> 10

This may be useful later on when you want to apply a function over rows, columns, or

subsets of your data

Vector Types

Integer vectors, containing integer values. (An integer vector is a special kind

of numeric vector.)

Datetime vectors, containing dates and times in different formats

> o<-c(1,2,3,4,5)

>o

[1] 1 2 3 4 5

> is.numeric(o)

[1] TRUE

> is.integer(o)

[1] FALSE

Alternatively, you can specify the length of the sequence by using the argument

length.out.

[1] -2.7 -2.2 -1.7 -1.2 -0.7 -0.2 0.3 0.8 1.3

Understanding indexing in R

>numbers

[1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

> numbers[10]

[1] 21

All trigonometric functions are available in R: the sine, cosine, and tangent functions and

their inverse functions. You can find them on the Help page you reach by typing ?Trig.

So, you may want to try to calculate the cosine of an angle of 180 degrees like this:

> cos(120)

[1] 0.814181

This code doesn’t give you the correct result, however, because R always works with

angles in radians, not in degrees.

Instead, use a special variable called pi. This variable contains the value of —you

guessed it — π (3.141592653589 . . .).

The correct way to calculate the cosine of an angle of 120 degrees, then, is this:

> cos(120*pi/180)

[1] -0.5

> 2/0

[1] Inf

To check whether a value is finite, use the functions is.finite() and is.infinite().

Ex: >is.infinite(x)

> 0/ 0

>[1]Nan

Structure of a vector

The str() function gives you the type and structure of the object.

> str(fn)

Next to the vector type, R gives you the dimensions of the vector. This example has only

one dimension, and that dimension has indices ranging from 1 to 3.

Finally, R gives you the first few values of the vector. If you want to know only how long

a vector is, you can simply use the length() function.

Combining Vectors

> i<-c("KK",1,2,3)

>i

> j<-c("LL",6,7,8)

> k<-c(i,j)

> k [1] "KK" "1" "2" "3" "LL" "6" "7" "8"

**The c() function stands for concatenate. It doesn’t create vectors — it just combines

them.

Repeating Vectors

[1] "KK" "1" "2" "3" "LL" "6" "7" "8" "KK" "1" "2" "3" "LL" "6" "7" "8" "KK" "1" "2"

"3" [21] "LL" "6" "7" "8“

You also can repeat every value by specifying the argument each, like this:

[1] 2 2 2 4 4 4 2 2 2

p<-c(0,0,99)

rep(p,times=3)

[1] 0 0 99 0 0 99 0 0 99

R has a little trick up its sleeve. You can tell R for each value how often it has to be

repeated. To take advantage of that magic, tell R how often to repeat each value in a

vector by using the times argument:

[1] 0 0 0 0 7 7

use the argument length.out to tell R how long you want it to be.

> rep(1:3,length.out=7)

[1] 1 2 3 1 2 3 1

Vector Manipulations

r<-c(1,3,5,7,9)

s<(c(2,4,6,8,10)

>r

[1] 1 3 5 7 9

>s

[1] 2 4 6 8 10

r[3]<-111

s[4]<-222

>r

[1] 1 3 111 7 9

>s

[1] 2 4 6 222 10

> r.copy<-r

> s.copy<-s

>r<-r.copy

>s<-s.copy

prod(x) Calculates the product of all values in x

cummin(x) Gives the minimum for all values in x from the start of the vector until the positio

cummax(x) Gives the maximum for all values in x from the start of the vector until the positio

Diff(x) Gives for every value the difference between that value and the next value in the v

Comparing Values in R

prod(x) Calculates the product of all values in x

cummin(x) Gives the minimum for all values in x from the start of the vector until the positio

cummax(x) Gives the maximum for all values in x from the start of the vector until the positio

Diff(x) Gives for every value the difference between that value and the next value in the v

> x<-c(3,5,7,9)

>y<-c(1,2,4,5)

>x

> sum(x)

[1] 24

x>100

> x>4

Which Function

The which() function takes a logical vector as argument. Hence, you can save the

outcome of a logical vector in an object and pass that to the which() function, as in the

next example.

You also can use all these operators to compare vectors value by value.

> which(x>5)

[1] 3 4

>z<-x>y

>z

>Which(z)

[1] 1 2 3 4

u<- x-y

u

[1] 2 3 3 4

>y

>is.character(y)

>True

>is.character(y)

>True

> length(y)

[1] 3

> nchar(y)

[1] 2 2 2

Nchar() function tells you that y has length 3 and that the 3 elements in y has 2 characters each

Subsetting

The process of referring to a subset of a vector through indexing its elements is also

called subsetting.

> letters

[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" [13] "m" "n" "o" "p" "q" "r" "s" "t" "u"

"v" "w" "x" [25] "y" "z"

> LETTERS

[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" [13] "M" "N" "O" "P" "Q" "R" "S"

"T" "U" "V" "W" "X" [25] "Y" "Z"

> letters[15]

[1] “o”

> LETTERS[20:26]

Head or Tail

You can use the head() function to get the first element of a variable.

By default, both head() and tail() returns six elements, but you can tell it to return any

specific number of elements in the second argument.

> tail(LETTERS,6)

[1] “a” “b” “c” “d” “e” “f” “g” “h” “i” “j”

> islands

This built-in dataset islands, a named vector that contains the surface area of the world’s

48 largest land masses (continents and large islands).

> str(islands)

Named num [1:48] 11506 5500 16988 2968 16 ... - attr(*, "names")= chr [1:48] "Africa"

"Antarctica" "Asia" "Australia" ...

Celebes 73

You use the names() function to retrieve the names in a named vector:

> names(islands)[1:5]

SORT

names(sort(islands)[6:1])

[1] "Taiwan" "Kyushu" [3] "Timor" "Prince of Wales" [5] "Hainan" "Vancouver"

> month.days <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)

> names(month.days) <- month.name

> month.days

31 28 31 30

31 30 31 31

30 31 30 31

Now you can use this vector to find the names of the months with 31 days:

> names(month.days[month.days==31])

[7] “December”

A collection of combined letters and words is called a string. Whenever you work with

text, you need to be able to concatenate words (string them together) and split them apart.

In R, you use the paste() function to concatenate and the strsplit() function to split.

Splitting text

> wordplay <-"The quick brown fox jumps over the lazy dog“

> wordplay

[1] "The quick brown fox jumps over the lazy dog“

> strsplit(wordplay,"")

[[1]] [1] "T" "h" "e" " " "q" "u" "i" "c" "k" " " "b" "r" [13] "o" "w" "n" " " "f" "o" "x" " "

"j" "u" "m" "p" [25] "s" " " "o" "v" "e" "r" " " "t" "h" "e" " " "l" [37] "a" "z" "y" " " "d"

"o" "g"

> strsplit(wordplay," ")

[[1]]

[1] "The" "quick" "brown" "fox" "jumps" "over" [7] "the" "lazy" "dog"

[1] “The” “quick” “brown” “fox” “jumps” “over” “the” “lazy” “dog”

To find the unique elements of a vector, including a vector of text, you use the unique()

function

> unique(tolower(words))

[1] "the" "quick" "brown" "fox" "jumps" "over" [7] "lazy" "dog"

In the variable words, “the” appears twice: once in lowercase and once with the first letter

capitalized. To get a list of the unique words, first convert words to lowercase and then

use unique

> unique(toupper(words))

[1] "THE" "QUICK" "BROWN" "FOX" "JUMPS" "OVER" [7] "LAZY" "DOG"

> toupper(words)

[1] "THE" "QUICK" "BROWN" "FOX" "JUMPS" "OVER" [7] "THE" "LAZY" "DOG"

Factoring in Factors

R has a special data structure for categorical data, called factors. Factors are closely

related to characters because any character vector can be represented by a factor.

In real-world problems, you often encounter data that can be described using words rather

than numerical values. For example, cars can be red, green, or blue (or any other color);

people can be left-handed or right-handed, male or female; energy can be derived from

coal, nuclear, wind, or wave power.

You can use the term categorical data to describe these examples

Factors are special types of objects in R. They’re neither character vectors nor numeric

vectors, although they have some attributes of both.

Factors behave a little bit like character vectors in the sense that the unique categories

often are text.

Factors also behave a little bit like integer vectors because R encodes the levels as

integers.

Creating a factor

levels: An optional vector of the values that x might have taken. The default is

lexicographically sorted, unique values of x.

labels: Another optional vector that, by default, takes the same values as levels. You can

use this argument to rename your levels, as we explain in the next paragraph.

Just remember that levels refer to the input values of x, while labels refer to the output

values of the new factor.

Searching by position

To find substrings, you can use the grep() function, which takes two essential arguments:

Searching by position

If you know the exact position of a subtext inside a text element, you use the substr()

function to return the value. To extract the subtext that starts at the third position and

stops at the sixth position of state.name, use the following:

Data Types

o Numeric

o Integer

o Character/String

o Date/POSIXct

o Logical(True/False

Class function

> class(x)

[1] "numeric"

> i<-5L

> i [1] 5

> is.integer(i)

[1] TRUE

> is.numeric(i)

[1] TRUE

Data Frames: On surface data frame is like an Excel spreadsheet with rows and columns.

Command: data.frames()

> x<-c(3,5,7,9)

>y<-c("tt","rr","ii","ee")

>z<--4:-1

> thedf<-data.frame(x,y,z)

> thedf

xyz

1 3 tt -4

2 5 rr -3

3 7 ii -2

4 9 ee -1

Checking attributes

> nrow(thedf)

[1] 4

> ncol(thedf)

[1] 3

> dim(thedf)

Since each data frame is an individual vector, it can be accessed individually and each

has its own class.

>thedf$x

[1] 3 5 7 9

> thedf[2,3]

[1] -3

List

> list(1,2,3)

[[1]] [1] 1

[[2]] [1] 2

[[3]] [1] 3

>list(c(1,2,3)) # Create single element list with one element being a vector

[[1]] [1] 1 2 3

More on Lists

Mattrices

Matrix are mathematical structures similar to data frames with rows and columns but all

data are of same type.(Usually numerics)

>m

n<-matrix(21:40,nrow=5)

>ncol(m) # No of Columns

Matrix Operations

>m+n #Addition

>m*n #Multiplication

>m ==n # Test equality

Matrix Operations

>m+n #Addition

>m*n #Multiplication

Array

accessed using square brackets.

The first element is row index, second is the column index, and remaining elements are

outer dimensions.

> thearrary

Basic Statistics

>x<-sample(x=1:100,size=100,replace=TRUE)

x

> mean(x)

[1] 50.5

> median(x)

[1] 50.5

> summary(x)

Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 25.75 50.50 50.50 75.25 100.00

Compute a table giving various descriptive statistics about the series in a data frame or in

a single/multiple time series

>stat.desc(x)

Arguments

Basic do we have to return basic statistics (by default, it is TRUE)? These are: the

number of values (nbr.val), the number of null values (nbr.null), the number of missing

values (nbr.na), the minimal value (min), the maximal value (max), the range (range, that

is, max-min) and the sum of all non-missing values (sum)

Desc do we have to return various descriptive statistics (by default, it is TRUE)? These

are: the median (median), the mean (mean), the standard error on the mean (SE.mean),

the confidence interval of the mean (CI.mean) at the p level, the variance (var), the

standard deviation (std.dev) and the variation coefficient (coef.var) defined as the

standard deviation divided by the mean

Norm do we have to return normal distribution statistics (by default, it is FALSE)? the

skewness coefficient g1 (skewness), its significant criterium (skew.2SE, that is,

g1/2.SEg1; if skew.2SE > 1, then skewness is significantly different than zero), kurtosis

coefficient g2 (kurtosis), its significant criterium (kurt.2SE, same remark than for

skew.2SE), the statistic of a Shapiro-Wilk test of normality (normtest.W) and its

associated probability (normtest.p)

SquareRoot

> sqrt(x)

Variance

> var(x)

Weighted Mean:

> grades<-c(95,72,87,66)

> cor(Orange$age, Orange$circumference)

[1] 0.9135189

Load RODBC

>head(mtcars)

> mydata

>round(res, 2)

> cor(economics[,c(2,4:6)])

> cor(economics[,c(2:6)])

>install.packages("Hmisc")

>require(“Hmisc)

> res2

The output of the function rcorr() is a list containing the following elements : - r : the

correlation matrix - n : the matrix of the number of observations used in analyzing each

pair of variables - P : the p-values corresponding to the significance levels of

correlations.

>res2$r

# Extract p-values

>res2$P

Regression

Regression analysis is a tool for building statistical models that characterize relationships among

a dependent variable and one or more independent variables, all of which are numerical.

The analysis is carried out through the estimation of a relationship and the results serve

the following two purposes:

1. Answer the question of how much y changes with changes in each of the x's (x1,

x2,...,xk),

First prepare a scatter plot to verify the data has a linear trend.

Given one variable regression tells us what we expect from the other variable.

Y=ax+by+c

Y= Dependent variable

a=Intercept

b=Slope

Scatter Plots and Correlation

A scatter plot (or scatter diagram) is used to show the relationship between two variables

between two variables

Y = b0 + b1X

where

b0 is the intercept

b1 is the slope

The most widely used criterion for measuring the goodness of fit of a line

The line that gives the best fit to the data is the one that minimizes this sum; it is called the least

squares line or sample regression line.

The slope of a regression line represents the rate of change in y as x changes. Because y is

dependent on x, the slope describes the predicted values of y given x.

Linear Relations

• We know from algebra lines come in the form y = mx + b, where m is the slope and b is

the y-intercept.

• In statistics, we use y = a + bx for the equation of a straight line. Now a is the intercept

and b is the slope.

• The slope (b) of the line, is the amount by which y increases when x increase by 1 unit.

• The intercept (a), sometimes called the vertical intercept, is the height of the line when x

= 0.

> x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

> y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

> print(summary(relation))

Test

Regression using dataset

> relation3

> summary(relation2)

#Outcome= Intercept+Slope

#Objective is to build a simple regression model that we can use to predict Distance

(dist) by establishing a statistically significant linear relationship with Speed (speed).

1. Scatter plot: Visualize the linear relationship between the predictor and response

2. Box plot: To spot any outlier observations in the variable. Having outliers in your

predictor can drastically affect the predictions as they can easily affect the direction/slope

of the line of best fit.

3. Density plot: To see the distribution of the predictor variable. Ideally, a close to

normal distribution (a bell shaped curve), without being skewed to the left or right is

preferred. Let us see how to make each one of them.

> head(cars)

(cars$speed)$out)) # box plot for 'speed‘

Density Plot

sub=paste("Skewness:", round (e1071::skewness(cars$speed), 2))) # density plot for

'speed'

>polygon(density(cars$speed), col="red")

sub=paste("Skewness:", round(e1071::skewness(cars$dist), 2))) # density plot for 'dist'

>polygon(density(cars$dist), col="red")

[1] 0.8068949

> head(cars)

>print(linearMod)

dist = Intercept + (β ∗ speed)

=> dist = −17.579 + 3.932∗speed

Cross Check

Multiple Regression

than two variables.

In multiple regression we have more than one predictor variable and one response

variable.

formula is a symbol presenting the relation between the response variable and predictor

variables.

Multiple Regression

Objective: The goal of the model is to establish the relationship between "mpg" as a

response variable with "disp","hp" and "wt" as predictor variables.

Solution: We create a subset of these variables from the mtcars data set for this purpose.

> head(input)

> model <- lm(mpg~disp+hp+wt, data = input) # Directly also possible without creating

subset

>a <- coef(model) # Get the Intercept and coefficients as vector elements.

>print(a)

Final Equation

Y = a+Xdisp.x1+Xhp.x2+Xwt.x3

or

Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is −

Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104

Logistic Regression

In logistic regression, we are only concerned about the probability of outcome dependent

variable ( success or failure).

criteria:

The Logistic Regression is a regression model in which the response variable (dependent

variable) has categorical values such as True/False or 0/1.

y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

family is R object to specify the details of the model. It's value is binomial for logistic

regression.

>input1 <- mtcars[,c("am","cyl","hp","wt")]

Directly also possible without creating subset

>print(summary(am.data))

In the summary as the p-value in the last column is more than 0.05 for the variables "cyl"

and "hp", we consider them to be insignificant in contributing to the value of the variable

"am".

Only weight (wt) impacts the "am" value in this regression model.

Cross Check

-11.9483662=19.70288+(-9.14947 *3.460)

Statistical Functions

colKurtosis calculates column kurtosis,

Unsupervised Learning:

determine how the data are organized. It is distinguished from supervised learning (and

reinforcement learning) in that the learner is given only unlabeled examples.

Supervised Learning:

A machine learning technique whereby a system uses a set of training examples to learn

how to correctly perform a task

Clustering

consisting of input data without labeled responses.

processes, generative features, and groupings inherent in a set of examples.

that observations in the same cluster are similar in some sense.

data analysis used in many fields.

them.

Clustering is very much important as it determines the intrinsic grouping among the

unlabeled data present. There are no criteria for a good clustering. It depends on the

user’s need

Applications of Clustering in different fields

marketing purposes.

2. Biology : It can be used for classification among different species of plants and

animals.

3. Libraries : It is used in clustering different books on the basis of topics and

information.

4. Insurance : It is used to acknowledge the customers, their policies and identifying the

frauds.

5. City Planning : It is used to make groups of houses and to study their values based on

their geographical locations and other factors present.

6. Earthquake studies : By learning the earthquake affected areas we can determine the

dangerous zones.

Clustering Types

Clustering plays a big role in machine learning by partitioning the data into groups.

2 Major types:

Hierarchical Clustering

K-Means Clustering

K-means clustering

attributes/features into K number of group. K is positive integer number.

The grouping is done by minimizing the sum of squares of distances between data and

the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify

the data.

Medicine A 1 1

Medicine B 2 1

Medicine C 4 3

Medicine D 5 4

We also know before hand that these objects belong to two groups of medicine (cluster 1

and cluster 2). The problem now is to determine which medicines belong to cluster 1 and

which medicines belong to the other cluster.

weight index pH

Medicine A 1 1 1

Medicine B 2 1 1

Medicine C 4 3 2

Medicine D 5 4 2

Hierarchical Clustering

Hierarchical clustering involves creating clusters that have a predetermined ordering from

top to bottom.

Hierarchical clustering is where you build a cluster tree (a dendrogram) to represent data,

where each group (or “node”) links to two or more successor groups.

The groups are nested and organized as a tree, which ideally ends up as a meaningful

classification scheme.

Each node in the cluster tree contains a group of similar data; Nodes group on the graph

next to other, similar nodes.

Clusters at one level join with clusters in the next level up, using a degree of similarity;

The process carries on until all nodes are in the tree, which gives a visual snapshot of the

data contained in the whole set.

The total number of clusters is not predetermined before you start the tree creation.

Dendrogram

between similar sets of data. They are frequently used in biology to show clustering

between genes or samples, but they can represent any type of grouped data.

The clade is the branch. Usually labeled with Greek letters from left to right (e.g. α β,

δ…).

Each clade has one or more leaves. The leaves in the above image are:

Single (simplicifolius): F

Double (bifolius): D E

Triple (trifolious): A B C

Dendrogram Interpretation

The clades are arranged according to how similar (or dissimilar) they are. Clades that are

close to the same height are similar to each other; clades with different heights are

dissimilar — the greater the difference in height, the more dissimilarity (you can

measure similarity in many different ways; One of the most popular measures is

Pearson’s Correlation Coefficient).

Leaves A, B, and C are more similar to each other than they are to leaves D, E, or F.

Leaves D and E are more similar to each other than they are to leaves A, B, C, or F.

Note that on the above graph, the same clave, β joins leaves A, B, C, D,and E. That

means that the two groups (A,B,C & D,E) are more similar to each other than they are to

F.

Hierarchical Clustering

> car1<-hclust(d=dist(mtcars))

> plot(car1)

> car2<-hclust(dist(mtcars),method="single")

> plot(car2)

> car3<-hclust(dist(mtcars),method="complete")

> plot(car3)

> car4<-hclust(dist(mtcars),method="average")

> plot(car4)

> car5<-hclust(dist(mtcars),method="complete")

> plot(car5)

Specify type of tree produced by clustering by splitting the observation into predefined

groups.

Where to cut

> install.packages("rattle.data")

> library(rattle.data)

> head(wine)

> wine1<-hclust(d=dist(wine))

> plot(wine1)

> rect.hclust(wine1,h=600, border="orange")

K-Means Clustering

belongs to the cluster with the nearest mean serving as a prototype of the cluster .

> wine2<-kmeans(x=wine,centers = 3)

> print(wine2)

> require(useful)

> plot(wine2,data=wine)

Hartigan’s Rule

> plot(wine2,data=wine)

> wine4<-FitKMeans(wine, max.clusters=20, nstart=25,seed=454356)

> wine4

> PlotHartigan(wine4)

Data Import in R

For Stata and Systat, use the foreign package. For SPSS and SAS use the Hmisc

package for ease and functionality.

If you have a .txt or a tab-delimited text file, you can easily import it with the basic R

function read.table().

header = FALSE)

> df

Group Manipulation

Apply

It can take Character, numeric or logical elements but only of same type in the matrix

#Create a mtrix

> thematrix<-matrix(1:9,nrow=3)

> thematrix

> apply(thematrix,2,sum)

[1] 6 15 24

# Sum of rows

> apply(thematrix,1,sum)

[1] 12 15 18

OR

> rowSums(thematrix)

[1] 12 15 18

> colSums(thematrix)

[1] 6 15 24

Lapply

Lapply works by applying function to each element of a list and returns as a list.

> thelist<-list(A=matrix(1:9,3),B=1:5,C=matrix(1:4,2),D=2)

> thelist

Sapply

> sapply(thelist,sum)

ABCD

45 15 10 2

Since vector is the technical form of a list, sapply and lapply functions take vector as the

input.

> lapply(thenames,nchar)

Mapply

This function applies to each element of multiple lists.

> firstlist<-list(A=matrix(1:16,4),B=matrix(1:16,2),C=1:5)

> secondlist<-list(A=matrix(1:16,4),B=matrix(1:16,8),C=15:1)

> firstlist

>secondlist

> mapply(identical,firstlist,secondlist)

A B C

> simpleFunc<-function(x,y) +

NROW(x)+NROW(y)

> mapply(simpleFunc,firstlist,secondlist)

A B C

8 10 20

# Aggregate Function

> require(ggplot2)

> library(plyr)

Aggregate Function

> head(diamonds)

# Calculate average price of each type of cut

> aggregate(price~cut,diamonds,mean)

cut price

1 Fair 4358.758

2 Good 3928.864

4 Premium 4584.258

5 Ideal 3457.542

> aggregate(price~cut+color,diamonds,mean)

Model Diagnostics

Step 1: Analysis of residuals (difference between actual response and fitted values)

Step 2: Import

https://data.montgomerycountymd.gov/api/views/2qd6-

mr43/rows.csv?accessType=DOWNLOAD

> library(readr)

Analytics/USA_Housing.csv")

Or

Data introduction

> View(USA_Housing)

> names(USA_Housing)<-

c("avgincome","avghouseage","avgnorooms","avgbedrooms","population","price","addre

ss")

> head (USA_Housing)

Model Building

#Simple Regression/OLS

> house1<-lm(price~avghouseage,data=USA_Housing)

>house1

> house2<-lm(price~avghouseage+avgnorooms,data=USA_Housing)

> house3<-lm(price~avghouseage+avgnorooms+avgbedrooms,data=USA_Housing)

> house4<-

lm(price~avghouseage+avgnorooms+avgbedrooms+population,data=USA_Housing)

> house5<-

lm(price~avghouseage+avgnorooms+avgbedrooms+population+avgincome,data=USA_

Housing)

Visualizations of Regression

>library(coefplot,ggplot2)

> coefplot(house1)

> coefplot(house2)

> coefplot(house3)

> coefplot(house4)

> coefplot(house5)

Visualizations of Regression

> plot(house1) # Q-Q Plot-If model good fit then standardized residuals will fall along a

straight line when plotted against the theoretical quantiles of

> plot(house2,which=2)

> plot(house3,which=2)

> plot(house4,which=2)

> plot(house5,which=2)

Visualizations of Regression

# Histogram of reseduals

> ggplot(house1,aes(x=.resid))+geom_histogram()

> ggplot(house2,aes(x=.resid))+geom_histogram()

> ggplot(house3,aes(x=.resid))+geom_histogram()

> ggplot(house4,aes(x=.resid))+geom_histogram()

> ggplot(house5,aes(x=.resid))+geom_histogram()

Comparing Models

#Use ANOVA to return a table of results including sum of squares (RSS) which is a

measure of error, the lower the better.

> anova(house1,house2,house3,house4,house5)

Problem with RSS is that it always improves when additional variables are added.

(2) Bayesian information criterion (BIC) or Schwarz information criterion (also SIC,

SBC, SBIC) is a criterion for model selection among a finite set of models; the model

with the lowest BIC is preferred.

> AIC(house1,house2,house3,house4,house5)

> BIC(house1,house2,house3,house4,house5)

# Rule of thumb/Andrew Gelman: for every added variable in the model the deviance

should drop by two. this applies even for categorical/factor variables

Step 3: The model is then used to make predictions about the kth section of data.

Step 4: This is repeated K times when every section is held out for testing once and

included in the model fitting k-1 times.

K Fold

> library(boot)

> houseg1<-glm(price~avghouseage,data=USA_Housing,

family=gaussian(link="identity"))

> identical(coef(house1),(coef(houseg1))

TRUE

> housecv1<-cv.glm(USA_Housing,houseg1,K=5)

> housecv1$delta

First one Mean Squared Error, Second Adjusted Cross Validation error

> houseg2<-glm(price~avghouseage +avgnorooms,data=USA_Housing,

family=gaussian(link="identity"))

family=gaussian(link="identity"))

> houseg4<-glm(price~avghouseage

+avgnorooms+avgbedrooms+population,data=USA_Housing,

family=gaussian(link="identity"))

> houseg5<-glm(price~avghouseage

+avgnorooms+avgbedrooms+population+avgincome,data=USA_Housing,

family=gaussian(link="identity"))

> housecv2<-cv.glm(USA_Housing,houseg2,K=5)

>

> housecv3<-cv.glm(USA_Housing,houseg3,K=5)

>

> housecv4<-cv.glm(USA_Housing,houseg4,K=5)

>

> housecv5<-cv.glm(USA_Housing,houseg5,K=5)

cvresults<-

as.data.frame(rbind(housecv1$delta,housecv2$delta,housecv3$delta,housecv4$delta,hous

ecv5$delta))

> cvresults

Or

## Bien plus que des documents.

Découvrez tout ce que Scribd a à offrir, dont les livres et les livres audio des principaux éditeurs.

Annulez à tout moment.