Vous êtes sur la page 1sur 133

How to use R

From Installation to Text Analytics

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Table of Contents
From the Team ............................................................................................................... 5
Copyright ........................................................................................................................ 6
Something helpful .......................................................................................................... 7

Section 1 ................................................................................................................ 8
PART 1 WHAT IS R ....................................................................................................... 9
Introduction ................................................................................................................... 9
What is R?....................................................................................................................... 9
Why R? ........................................................................................................................... 9
Summary ...................................................................................................................... 10
PART 2 INSTALLATION ............................................................................................... 11
Introduction ................................................................................................................. 11
How can R be installed? ............................................................................................... 11
Overview of the R GUI .................................................................................................. 12
Installation of R Studio ................................................................................................. 14
Overview of R Studio GUI ............................................................................................. 15
Summary ...................................................................................................................... 19

Section 2 .............................................................................................................. 20
PART 1 DATA TYPES ................................................................................................... 21
Introduction ................................................................................................................. 21
Why are DATA TYPES important? ................................................................................ 21
Data types .................................................................................................................... 22
Creating data types in R ............................................................................................... 24
Summary ...................................................................................................................... 29
PART 2 DATA STRUCTURES & VECTORS .................................................................... 30
Introduction ................................................................................................................. 30
What is a data structure ............................................................................................... 30
What is the difference between data structure and data type ................................... 31
Type of data structure Vectors.................................................................................. 32
How to create a Vector in R ......................................................................................... 33
Mixing up data types in a Vector ................................................................................. 35
Replacing the contents of a Vector .............................................................................. 37
Arithmetic functions between Vectors ........................................................................ 38
Identifying elements in a Vector .................................................................................. 41
Replacing contents in a Vector..................................................................................... 42
Using a function to Index ............................................................................................. 43
Speeding up the task with Operators .......................................................................... 45
Summary ...................................................................................................................... 48
PART 3 DATAFRAMES ................................................................................................ 49
Introduction ................................................................................................................. 49
What is a Dataframe .................................................................................................... 49
Creating a Dataframe in R ............................................................................................ 50
Functions that can be carried out with Dataframes .................................................... 52
Summary ...................................................................................................................... 58
PART 4 LIST & MATRIX............................................................................................... 59
Introduction ................................................................................................................. 59
What is a List ................................................................................................................ 59
What is a Matrix ........................................................................................................... 60
Creating a List in R ........................................................................................................ 61
Creating a Matrix in R ................................................................................................... 63
Creating a Matrix out of a Dataframe .......................................................................... 63
Summary ...................................................................................................................... 65
PART 5 FACTORS ........................................................................................................ 66
Introduction ................................................................................................................. 66
What is a factor ............................................................................................................ 66
How to create a factor in R .......................................................................................... 67
Summary ...................................................................................................................... 69

Section 3 .............................................................................................................. 70
PART 1 PACKAGES ..................................................................................................... 71
Introduction ................................................................................................................. 71
What is a Package ........................................................................................................ 71
Installing and Loading a Package ................................................................................. 72
Importing an Excel file into R ....................................................................................... 75
Importing a CSV file into R ........................................................................................... 77
SUMMARY .................................................................................................................... 79
PART 2 Exporting and Reading data in R ................................................................... 80
Introduction ................................................................................................................. 80
Exporting to Excel ......................................................................................................... 81
Exporting to CSV ........................................................................................................... 82
Reading a file in R ......................................................................................................... 83
Summary ...................................................................................................................... 85

Section 4 .............................................................................................................. 86
PART 1 Logical operators and If condition ................................................................ 87
Introduction ................................................................................................................. 87
What is a logical operator ............................................................................................ 87
How to execute a logical operator in R ........................................................................ 89
What is IF Condition in R .............................................................................................. 93
Summary ...................................................................................................................... 94

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 3


All rights reserved
PART 2 Merging data ................................................................................................. 95
Introduction ................................................................................................................. 95
What is merging of data ............................................................................................... 95
What are the different ways to merge data ................................................................ 96
How to carry out a merge in R ..................................................................................... 98
Summary .................................................................................................................... 100

Section 5 ............................................................................................................ 101


PART 1 Understanding text analytics ...................................................................... 102
Introduction ............................................................................................................... 102
What is text analytics ................................................................................................. 103
How is Text Analytics useful ....................................................................................... 104
Important terms in Text Analytics.............................................................................. 106
What is needed to create a Word Cloud.................................................................... 109
Summary .................................................................................................................... 112
PART 2 Understanding text analytics ...................................................................... 113
Introduction ............................................................................................................... 113
How to create a Word Cloud...................................................................................... 113
Step 1: Creating a dataframe ..................................................................................... 113
Step 2: Installing the tm package ............................................................................... 117
Step 3: Understanding TDM ....................................................................................... 119
Step 4: Creating a corpus ........................................................................................... 119
Step 5: Cleaning the corpus ....................................................................................... 121
Step 6: Creating the term document matrix .............................................................. 124
Step 7: Calculating frequencies .................................................................................. 127
Step 8: Creating the word cloud ................................................................................ 128
Summary .................................................................................................................... 132

4 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
From the Team
A note from the Online Education team

ANALYTICS TRAINING INSTITUTE

Hello,

Thanks to the many students who have signed up for our courses, we are delighted to
offer all our online lectures as downloadable material. We know that learning should
be continuous, so through this material we hope that you will take your time within
your busy schedule to really understand the concepts and techniques of this
fascinating open source tool R!

We have tried to make your learning easy by highlighting key takeaways and screen
grabs of the tool so that you can continue your learning offline as well.

We welcome you to post any comments or questions on this material in the


Discussion forum as many of you have been doing or just reach out to us at
help@analyticstraining.in.

Enjoy the learning experience and thank you for choosing us to be your partner in
your journey of discovery!

The team at ATI

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 5


All rights reserved
Copyright

(c) 2014 Redwood Associates Business Solutions Private Limited

All rights reserved. Without limiting rights under the copyright reserved above, no
part of this publication may be reproduced, stored, introduced into a retrieval
system, distributed or transmitted in any form or by any means, including without
limitation photocopying, recording, or other electronic or mechanical methods,
without the prior written permission of the publisher, except in the case of brief
quotations embodied in critical reviews and other non commercial uses permitted bu
copyright law. The scanning, uploading, and/or distribution of this document via the
internet or via any other means without the permission of the publisher is illegal and
punishable by law. Please do not participate or encourage electronic piracy of
copyrightable materials.

For permission requests, email help@analyticstraning.in

6 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
SOMETHING HELPFUL

Here are a few things that you would probably find helpful before you begin.

Sections:
There are 5 sections in this material starting with the installation of R and R Studio
right up to generating a Word Cloud in R which showcases the text mining capabilities
of R.

Online videos:
This material is a supplement to the online videos available on
https://www.udemy.com/analyticstraining/?dtcode=VQRaQsx1KWR2
This material corresponds to the section on R.

Material:
The online class format supports downloadable material for each section. So perhaps
it would be a good idea to check each section for additional downloadable material
like case studies or sample data to work on and so forth.

Ready to begin?

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 7


All rights reserved
Section 1
Overview and Installation

8 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
PART 1 WHAT IS R

INTRODUCTION

R is an open source statistical tool which not just manages data but also carries out a
lot of sophisticated analytical processes as well.
Before looking at how R works, it is important to get a good overview of R.

So, heres what will be covered in this tutorial:

- What is R?
- Why R?

WHAT IS R?

So, to begin lets start with a very basic question. What is R?

1. R as we already know is a statistical tool which is at par with other statistical


tools like SAS, SPSS and Python in terms of what it can do.

2. R can manage and analyse data. It can execute all statistical techniques like
liner regression, logistical regression, forecasting, decision trees and any other
technique that you can think of.

WHY R?

So what makes R stand out when compared to other statistical tools? Let us break it
down.
1. Firstly, R can work with any type of data and can handle data of any size. So
whether the data you are working with is small or really big, R will be able to
handle it.
2. R can work with data received in any type of file format, whether text, CSV,
SASS and so on.

3. R offers really great visualization of data. It can connect with Google maps and
Motion charts.

4. Next and this is what makes R so much more powerful than other statistical
tools it is open source. Open source does not just mean that it can be used
for free, but that anyone can contribute to it as well.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 9


All rights reserved
5. R does not use much code, even if it is handling large volumes of data or
carrying out complicated analytical techniques.

6. As was mentioned earlier, R being open source means anyone can contribute
to it. This is why R has a huge community of contributors who almost on a
daily basis keep adding functionality to it. This is the reason why even the
most complicated techniques can be executed in R by just calling a function.
So, when using R we as users do not need to worry about how to perform a
linear regression or a logistics regression. The code to execute this and many
other advanced analytical functions is already built in and refined by those in
the R community on a regular basis.

7. R is used by a lot of big corporations like Facebook, Google, Mozilla, Llyods


and Merck, among others. This goes a long way in validating the capability of
R and adds to its credibility.

SUMMARY

In this material, we covered the reasons which make R a powerful statistical tool.

To summarize,

R is an open source statistical tool that can be used freely by anyone.

It is improved upon everyday by a large community of contributors who


periodically keep adding new codes and functions to it.

R can work with big or small data, and also with the different formats in which
data is usually presented.

It does not use much code and offers great data visualization making it a
popular statistical tool in many global corporations.

10 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
PART 2 INSTALLATION

INTRODUCTION

To start leveraging the power of R, it first needs to be installed.

So, heres what will be covered in this tutorial:

- Installation of R
- Overview of a typical GUI or Graphic User Interface of R
- Installation of R Studio
- Overview of the GUI of R Studio

HOW CAN R BE INSTALLED?

To begin, lets look at how to install R.

To install R click on the link displayed:

http://cran.r-projeact.org/bin/windows/base/old/3.0.2

On opening this link different options to download R based on system configuration


and operating systems are available - like R for 32bit system or R for a 64bitsystem
or R for Windows.
Download the version of R that is best suited to the operating system being used.

When you update your version of R, the earlier version is NOT automatically
uninstalled. Further, R Studio allows you to run multiple versions of R (though not in
same session) Therefore in R Studio, find out which version of R is running by typing
R.Version().
The default version of R that R Studio runs can be changed from Tools>Options> R
General.

Before proceeding with the rest of this tutorial, we suggest that you download R in
case you already havent.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 11


All rights reserved
OVERVIEW OF THE R GUI

Shown here is a typical R GUI or Graphic User Interface.

At first glance, all that is visible on the R GUI is a single screen, which is known as the
Console.
The Console can be used to input data as well as view output. But we recommend
that the Console is used to only view output.
Commands or inputs in R are referred to as Scripts. To write a script, go to File and
select New Script.

12 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
A new blank window called R Editor opens.

Think of a Script in R as code or syntax that is written in order to tell R what it needs
to execute.
For eg, lets enter a = 1, which means R is being told to create a variable a and store
a value of 1 against it.

To execute this script or code, press Control +Enter. As shown in the image below,
the command is executed and the output is displayed in the Console.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 13


All rights reserved
The output a=1 is displayed in red in the Console. So, essentially input is specified in R
Editor and the output is displayed in the Console.

INSTALLATION OF R STUDIO

A more user friendly option available to users is R Studio. It has a better GUI and
comes with more options.

To take a more detailed look at R Studio, let us first install it.

To download R Studio, click on the link displayed on the screen:

http://www.rstudio.com/ide/download/

We recommend that installation of R Studio is complete before proceeding with this


tutorial.

When you update your version of R, the earlier version is NOT automatically
uninstalled. Further, R Studio allows you to run multiple versions of R (though not in
same session) Therefore in R Studio, find out which version of R is running by typing
R.Version().
The default version of R that R Studio runs can be changed from Tools>Options> R
General.

14 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
OVERVIEW OF R STUDIO GUI

As shown in the image, there are 4 sections in R studio.

Lets briefly go through each section.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 15


All rights reserved
Section1: Editor

The first section is the Editor section where the script or code that R needs to execute
is written. To add more than one script, use the plus sign on the top left hand corner.
It is possible to add as many scripts as required using this option.

Using the example looked at earlier, the script or code a = 1 is entered. To execute
this code, press Control plus Enter.

16 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Section 2: Console

The output appears in the Console window which can be found right below the Editor
window. When values appear in the Console section it means that the script or the
code has been executed.

Section 3: Workspace

To the right of the Editor section, is the Workspace section, where the data being
worked on can be viewed.

This includes even data that has been imported from an external source.

In the example used, a new variable a with a value of 1 was created. Since this is
the data currently being worked on, both a and 1 are displayed in the Workspace
section.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 17


All rights reserved
Section 4: File, Plots, Packages and Help

The other sections in R Studio are File, Plots, Packages and the Help section.

The Help section helps in locating functions in R Studio. In the Search field, type what
is being searched for and click Enter. For eg, if plot is entered in Search, everything
related to it is displayed just below.

The other tabs available are packages (which will be covered later), Plot and Files
which displays all the files that are currently being worked on.

For the rest of the series of tutorials on R, we will be working with R Studio as it has a
better GUI than R Editor.

18 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
SUMMARY

We are now ready to start using R to manage data and carry out other types of
actions on data.

To summarize:

The links to install both R and R Studio have been provided.

An overview of the typical GUI of R has been looked at. Since individual
screens need to be opened, a better option is R Studio.

R Studio is more user friendly as all the relevant sections are available at a
single glance removing the need to have multiple screens open at a time.

There are 4 sections in R Studio.


a. The first section is the Editor section which is used to enter scripts or
codes for R to execute.
b. The second section is the Console where output is displayed.
c. The data that is generated or being worked upon can be found in the
Workspace section.
d. Files, packages and Help make up the fourth section.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 19


All rights reserved
Section 2
Data Types and Data Structures

20 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
PART 1 DATA TYPES

INTRODUCTION

What is Analytics without data? Likewise how can you leverage the amazing
capabilities of R without understanding data?

This tutorial is an indepth look at types of data or data types.

So, heres what will be covered in this tutorial:

- Understanding why data types are important


- Different data types
- Creating data types in R

WHY ARE DATA TYPES IMPORTANT?

To begin, it is important to understand why data types are useful and why it is
necessary to be able to distinguish between different types of data.

Suppose, you have been asked to evaluate five different brands of cars let us call
them Brand A, Brand B, Brand C, Brand D and Brand E. If you were asked to calculate
the mean of these five cars, how would you go about it?
It most likely would be an impossible operation to carry out because all you have is
the name or the brand of these cars and as you know you cannot calculate the mean
of names!
Now, the situation would have been different if you had some numeric data about
these cars. This emphasizes the need to understand the type of data you have to
work with because certain types of functions can be carried out on certain types of
data. Like calculating mean is not possible with character data types like names or
brands.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 21


All rights reserved
DATA TYPES

Data can be of different types. The different types of data one would commonly
come across are:

Numeric:

Refers to any number or numeric value.

Eg: 1.2, 2.1 etc

Numeric data types include even decimals.

Integer:

Refers to any number without a fractional part.

Eg: 1, 2, 3..

Logical:

Refers to any values which are either True or False.

Eg: if x = 1, y = 2, then x being greater than y is False

Character:

Refers to textual data.

Eg: learning, education.

Factor:

Refers to data in categories.

Eg: City, Gender

Each data type will now be discussed in some detail.

22 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
1.1 Numeric data types

Numeric data type is any number or numeric value like 2.1, 1.2 and so on. It could be
an integer or a decimal value.

In R Studio, to create a numeric data type the syntax y<-3.1 (or y is equal to 3.1) is
used. This means that a variable y is being created against which a numeric value of
3.1 is being stored.

To indicate equal to we can either use the symbol <- or =

1.2 Integer data types

Integer data type indicates any data which stores integer values.

In R Studio, numeric data types can be converted to integer data types by using the
following syntax:
as.integer(numeric value)
Eg: as.integer (3.1)

1.3 Logical data types

Logical data type indicates any data where the value is either True or False, but never
both.

In R Studio, the following syntax can be used to create a logical data type:
if x <-1, y<-2, then x > y is FALSE
(x is equal to 1, y is equal to 2, then x being greater than y is false)

1.4 Character data types

Character data type stores characters or strings.

In R Studio, they have to be written within double quotes. For example, the text
learning would be written as learning.

1.5 Factor data types

Factor data type refers to categorical types of data like gender or cities.

This data type will be covered in more detail in a separate tutorial.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 23


All rights reserved
CREATING DATA TYPES IN R

How to create a numeric data type in R

In R studio, lets create a numeric data type with a variable name of num1 and a
numeric value of 3.1 stored against it.

Enter these values using the code

num1<-3.1

and press Control + Enter to execute this statement.

The output is displayed in the Console area in blue indicating that the code has been
executed. Simultaneously, the values num1 and 3.1 are displayed in the Workspace
section.

In order to identify the data type of the variable num 1 use a function called Class.
Type the words class and the name of the variable in brackets as shown below.

class(num1)

and press Control + Enter to execute it.

In the Console area numeric is displayed indicating that the data type of num1 is
numeric.

24 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
How to create an integer data type in R

A numeric data type can be converted into an integer data type in R.

In the example used above, the number 3.1 when converted to an integer gives a
value of 3. To convert this numeric data type to an integer data type in R Studio, the
function as.Integer(numeric variable) is used.

Let us create a new variable num3 and store the integer value against this variable.

Enter the code

num3 <-as.Integer(num1)

and press Control + Enter. The values will be displayed in the Workspace section
when the code is executed.

In order to determine the data type of num3, the following function will be used

class(num3)

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 25


All rights reserved
Press Control + Enter to execute this statement. The output in this case would be
integer as shown in the Console section.

How to print the contents of a variable

To print the value of any variable like num1 or num3, enter the value, say

num3

and press Enter.

The value will be displayed in the Console like in the case of num3, where the value 3
is displayed.

Alternatively, use the code

print(num3) (print and the variable name within brackets)

and execute this.

26 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
How to create a character data type in R

To create a character data type in R, let us create a variable char1 and store a value
of hello against it.

The code to execute this is


char1<-hello or
char1= hello

Remember that equal to can also be indicated by using the equal to sign.

When this code is executed, in the Workspace section a variable char1 has been
created and a value hello stored against it.

To find out the data type of this variable, use the class function discussed earlier.
Enter the code

class(char1)

and press Enter.

The value character is displayed in the Console area.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 27


All rights reserved
An important fact to remember about character data types is that these values are
always mentioned within double quotes. So anytime a value is entered within double
quotes R will recognize it as a character data type.

Logical and factor data types will be discussed in more depth in a later section.

28 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
SUMMARY

We have covered important data types in this tutorial. Understanding these data
types will help in managing and working with data in R.

To summarize:
It is important to understand data types in order to determine what type of
actions can be carried out with a specific type of data.

Different data types are available numeric, integer, character, logical and
factor.

Different data types can be created in R using the proper syntax. Eg num1<-
3.1, as.integer(3.1), char1<-hello

The function class is used to determine the data type of a variable.

The function print is used to print the contents of a variable.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 29


All rights reserved
PART 2 DATA STRUCTURES & VECTORS

INTRODUCTION

The data that you are working on, needs to work for you. In other words it has to be
arranged in a way that helps you manage, store it and analyze it better.

This tutorial will deal with data structures.

So, heres what will be covered in this tutorial:


- Understanding data structures
- Vectors a type of data structure
- Creating vectors in R

WHAT IS A DATA STRUCTURE

A data structure in simple terms is a way of storing and organizing data.

Let us understand this better with the help of an example. Shown here is a table with
different types of information stored in it.

When storing information of different types, it will need to be stored across more
than one variable. For eg, if the data to be stored relates to employee records, then
the variables across which this data would be stored would be Name, Age, Address,
Nationality, Assessment scores and so on. This collection of information displayed
across different variables is referred to as a data structure.

30 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
WHAT IS THE DIFFERENCE BETWEEN DATA STRUCTURE AND
DATA TYPE

A data structure is different from data type because of the number of values stored.

Lets look at this with the help of an example. If a variable Name has been created,
and a value Bob stored against it, it will result in the creation of a character data
type. In a data type only one value is stored.

But when different information related to Bob apart from his name, is stored, like his
age, address, nationality and assessment score then it results in the creation of a data
structure. A data structure stores more than one value.

A simple way to look at a data structure is to think of an Excel sheet with rows and
columns where the columns are made up of different data types. In the example
used, the Name column will store character data types, the Age column will store
integer data types, the Score column will store numeric data types and so on.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 31


All rights reserved
TYPE OF DATA STRUCTURE VECTORS

The first type of data structure that will be discussed is referred to as Vectors.

A Vector is like a column in an Excel sheet. Going back to the example used earlier,
Vectors would be Name, Age, Address, Nationality and so on.

In Vectors, all the elements within a Vector should be of the same data type.

Vectors cannot have a combination of data types!

So, if Age is a Vector, then all the elements under age should be of the data type
integer. This Vector cannot have any other data type within it like character or
number, nor can they be a combination of data types.

Vector Not a Vector

So, Vector is therefore a data structure which contains elements of the same data
type. Visualize a single column in an Excel sheet which contains values of the same
data type.

32 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
HOW TO CREATE A VECTOR IN R

In R Studio, a Vector can be created through a function known as c operator or


concatenate.
So, lets create a Vector called vector 1, and store 4 values in it. This vector will
contain elements of the numeric data type.

To create this vector enter the code

vector1<-(9,8,2,7)

and execute this code.

Two events will take place.

First, the Console will display vector1 with its corresponding values.

Second, in the Workspace section the variable vector 1 will be displayed along with
the data type of its values - which is numeric - and the number of values which is 4.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 33


All rights reserved
To print the contents of vector 1, write the name of the Vector and press Control +
Enter. In the Console the values 9, 8, 2 and 7 will be displayed. Here 1 represents the
column number of the Vector.

34 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
MIXING UP DATA TYPES IN A VECTOR

Now let us look at something interesting. As discussed, a Vector can only contain
elements of the same data type. There can be no mixing of data types within a
Vector. So what happens if a second Vector is created and along with numeric data
types, a character data type is inserted into it?

Shown here, is the code to create a new Vector called vector 2 with some values.
Inserted into these values is a character value bob.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 35


All rights reserved
When this code is executed, the output is displayed in the Console. But the syntax
used includes elements of different data types, whereas we know that vectors can
only store elements of the same data type. So why is no error being displayed?

When the contents of vector 2 are printed, all values in the Vector are displayed in
the Console in quotes. This indicates that by default R has converted all numeric data
types in the Vector to character data types by adding quotes to all the numbers. This
is why R does not display any error on executing this code!

R recognizes the rule of common data types and converts uncommon values to a
single data type.

36 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
REPLACING THE CONTENTS OF A VECTOR

Values in a Vector can also be overwritten. So one data type can always be replaced
with another data type within the same Vector.

In the example we looked at earlier, vector 2 contains 11 values all of which are
character data types. Suppose we want to replace these 11 values with 4 values of
numeric data type. These 4 values are 1, 2, 3 and 8.

Let us enter the code

Vector2<-(1,2,3,8)

and press Control + Enter.

In the Workspace the data type of vector 2 has now changed to numeric and has 4
values stored against it.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 37


All rights reserved
ARITHMETIC FUNCTIONS BETWEEN VECTORS

It is also possible to carry out arithmetic functions between Vectors like addition,
subtraction, multiplication and division. The only pre requisite to execute these
functions is that the data types in each Vector should be of equal length.

As you can see in the workspace both vector 1 and vector 2, are of numeric data type
and have 4 values each, which means they are both of the same length.

It is possible to carry out any type of arithmetic function on these 2 vectors such as
vector 1 + vector 2 or vector 1 vector 2 and so on.

Let us enter the code

vector1 + vector2

and press Control + Enter.

The output is displayed in the Console.

Lets cross check these values. Vector 2 comprises the values 1, 2, 3 and 8. To check
the values of vector 1, enter vector 1 and press Control + Enter. The values displayed
in the console are 9, 8, 2 and 7.

38 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
So, when the statement is executed addition is carried out by adding element 1 of
vector 1 to element 1 of vector 2, element 2 of vector 1 to element 2 of vector 2 and
so on.

So, when 9, the first element of vector 1 is added to 1, the first element of vector 2
the result is 10 which is shown in the Console.

You can cross check the rest of the results as well!

In this example the vectors were both of equal length. Lets look at what happens in
the event the elements in the vector are of unequal length.

Vector 1 has 4 elements. Let us add this to a new vector c which has 3 elements 1, 2
and 3.

Let us enter the code

vector1 + c(1,2,3)

and press Control + Enter.

On executing this code, a warning message is displayed in the console but the
addition function has still been executed. How?

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 39


All rights reserved
The first three elements of vector 1 have been added to the three elements of vector
c. But the fourth element in vector 1 has been left out as there is no corresponding
fourth element in vector c. Therefore for accurate results it is better to add elements
of the same length.

40 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
IDENTIFYING ELEMENTS IN A VECTOR

Another interesting feature in Vectors is referred to as indexing. This feature allows a


particular element in a Vector to be accessed.

For eg, we know that vector 1 contains 4 elements, 9, 8, 2 and 7. Let us suppose that
we want to find out the third element in vector 1 which is 2.

Let us enter the code

vector1 [3]

and press Control + Enter.

Entering 3 indicates that we want to access the third element of vector 1.

We can see a value of 2 displayed in the console which as we know is the third
element in vector 1.

So to index a Vector, next to the name of the Vector enter within square brackets the
number of the element that needs to be accessed. Eg, vector1[3]

Indexing helps in identifying values in a vector based on their position.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 41


All rights reserved
REPLACING CONTENTS IN A VECTOR

Now, let us suppose that we want to create a new Vector called new_vector. In this
new Vector we want to populate the same elements as vector 1 but without the
second element. So in new_vector we only want to store the first, third and fourth
elements of vector 1.

Let us enter the code

new_vector<-vector1[-2]

and press Control + Enter.

Entering minus next to 2 indicates that we want to exclude the second element of
vector 1 in new_vector.

When the code is executed we can see in the Workspace section that the vector
new_vector has been created with three values of numeric data type.

To view the contents of new_vector, enter the name of the vector and press Control
+ Enter.

In the console, 9,2 and 7 are displayed. 8 is not displayed as it is the second element
in vector 1 and hence has been excluded.

If a Vector has only three elements but if a value of 10 is being entered in square
brackets, then it means that we are trying to index elements that are greater than

42 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
what are actually present in the Vector. This situation is referred to as an Index out of
Boundary.

If there are only 3 elements in a Vector, then how can you locate the 10 th element?!
Hence the term Index out of Boundary.

USING A FUNCTION TO INDEX

Indexing in Vectors can also be done with the help of logical functions. Heres how.
Let us create a new Vector called vector 2 with 4 elements in it 1,2,3,4.
Enter the code

vector2=c(1,2,3,4)

and press Control + Enter.

We already have a Vector, called vector 1 which has the elements 9,8,2,7.

Let us now use a logical function to find the the third element in vector 1.

Enter the logical function

vector1[vector2==3]

By entering vector 2==3, we are trying to locate the position of the value 3 in vector
2. The value 3 is the third element in vector. So, when the code is executed, in the
Console the third element in vector 1 needs to be displayed. Since the third element
in vector 1 is 2, we should be able to see this number in the Console.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 43


All rights reserved
Press Control + Enter. On executing this code we can see in the Console the value 2.

Using the position of 3 in vector 2, the logical function tries to find the equivalent
position in vector 1.

44 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
SPEEDING UP THE TASK WITH OPERATORS

Operators help in executing certain types of tasks quickly and more efficiently. Let us
understand this better with the help of an example.

Let us assume that a Vector called Age needs to be created which needs to store the
first 100 natural numbers i.e., numbers from 1 to 100. One way to execute this is to
write the code age<-(1,2,3..) and so on mentioning all numbers till 100. This
obviously is not a feasible option. Sometimes numbers could run till 100, at other
times till even 1000!

In these types of situations, a good option would be to use Operators. Here are a few
common Operators that are used in R.

Colon Operator

The Colon Operator can be used to create Vectors like the Age Vector quite easily by
using the code

age<-1:100

To execute this code press Control +Enter.

To view the contents of the Vector, enter age and press Control + Enter.
In the Console values from 1 to 100 are displayed.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 45


All rights reserved
In a Colon Operator the value before the colon is the first value in the series and the
value after the colon is the last value in the series. So, when 500:505 is entered, the
series will begin from 500, then move to 501, 502, 503,504 and end with 505.

Colon operators remove the necessity of writing a long series of numbers!

Sequence Operator

Let us suppose that a Vector is to be created with some numbers, which are not
continuous but have some sort of order to it. An example would be 1,3,5,7,9 and so
on. To create this Vector, the Sequence Operator can be used.

Lets create a Vector called Age and populate it with the values values 1,3,5,7,9 and
so on till 101. To do this, enter the code

age<-(1,101,2)

and press Control + Enter.

46 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
To view the contents of this Vector, enter age and press Control +Enter.
In the Console the entire series is displayed.

In the code entered, 1 represents the start point, 101 represents the end point and 2
represents how the numbers should increment.

Sequence operators can populate vectors with data that follow a logical sequence.

So, now we know that vectors can be created in 3 different ways firstly through c or
concatenate, secondly with a colon operator and lastly with a sequence operator.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 47


All rights reserved
SUMMARY

In this tutorial we have looked at the importance of having a proper structure to


store and organize ones data. One such structure is called a Vector.

To summarize:
Data structures are needed to store and organize data.

A data structure comprises different data types like characters, numbers,


integers and so on which are displayed in the data structure as variables like
age or name.

A Vector is a data structure which can store values of a single data type like
only characters or only numbers or only integers.

Vectors can be created in three ways - using concatenate, Colon Operator or


Sequence Operator.

Arithmetic functions like addition, subtraction and so on can be carried out


between Vectors provided they are of equal length.

Indexing is a function which can locate a particular element within a Vector.

It is also possible to create a new Vector by copying or modifying the contents


of another Vector.

48 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
PART 3 DATAFRAMES

INTRODUCTION

Vectors, the first type of data structure that was looked at is actually quite closely
linked to the next type of data structure that will be discussed.

This tutorial will deal with Dataframes.

So, heres what will be covered in this tutorial:

- Understanding Dataframes
- Creating Dataframes in R
- Different functions related to Dataframes

WHAT IS A DATAFRAME

Shown here is a table with columns like Name, Age and Score.

Each column is in fact a Vector. So, Name constitutes one Vector, Age another and
Score another. So, a Dataframe is nothing but a collection of Vectors of equal length.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 49


All rights reserved
CREATING A DATAFRAME IN R

Here is a table with some data. This data needs to be converted into a Dataframe
called Records.

The table has 4 columns which individually become 4 Vectors in the Dataframe.

So, the first step in creating the Dataframe is to create the four Vectors.

To create the four Vectors in R Studio, viz, Name, Gender, Age and Income, enter the
code

Name<- c(Aryan, Gopal, Zubin, Ravi, Umesh, Anita)


Gender<- c(M, M, F, M, M, F)
Age<- c(20,21,24,26,26,23)
Income<- c(20000,30000,35000,40000,41000,50000)

Select the code entered and press Control + Enter.

50 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
The next step is to actually create a Dataframe around this called Records. Enter the
code

Records<- data.frame(Name, Gender, Age,Income)

and press Control + Enter.

The order in which the Vectors are entered is important. If Name is entered first, it
will be the first Vector displayed in the Dataframe. Likewise if Gender is entered first
it will be the first Vector displayed in the Dataframe.

On executing this statement, the Console shows that the code has been executed.

In Workspace the name of the Dataframe Records is displayed together with its
Vectors, their data types and the number of values in each.

When we double click on records we can see the entire Dataframe displayed.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 51


All rights reserved
FUNCTIONS THAT CAN BE CARRIED OUT WITH DATAFRAMES

How to print the names of variables

To find out the names of the variables in a Dataframe, enter the code names
followed by the name of the Dataframe whose variables need to be determined.

So, to find out the names of the variables in the Dataframe Records, enter the code

names(Records)

and press Control + Enter.

In the Console is displayed all the variables of the Dataframe Records - which is
Name, Gender, Age and Income.

This is a useful function when working with a Dataframe that contains a large number
of variables.
In this tutorial, a simple Dataframe with just four variables has been created. There
could be a situation where a large Excel table with lots of variables is imported and
the names of the variables used in this table need to be determined.

How to find a particular element

To find a particular element in a Dataframe, a function called Indexing is used. When


the Dataframe Records is opened from the Workspace section, we can see that it has
4 columns with 6 rows.

In the Workspace section, 6 observations indicates 6 rows and 4 variables means 4


columns.

52 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Let us assume that we want to find out the gender of a particular person, say Gopal,
who is listed in the table. The value Gopal can be found in the second row and
Gender is the second column in the Dataframe Records.
Enter the code

Records[2,2]

and press Control + Enter

In the code, the first 2 indicates the second row in the Dataframe where the value
Gopal is displayed. The second 2 indicates the column Gender.

When this code is executed, Mis displayed in the Console indicating that the gender
of Gopal is male.

How to view the elements in a row

It is also possible to view all the elements in any row of a Dataframe. For eg, to view
the elements of only the first row in the Dataframe Records, enter the code

Records[1, ]

and press Control + Enter.

The space left after the comma indicates that all the elements in the row need to be
fetched. When this code is executed, in the Console the entire elements of row 1 of
Records is displayed.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 53


All rights reserved
To view the elements of more than one row, say for eg, the first 4 rows of the
Dataframe Records, enter the code

Records[c(1:4),]

and press Control + Enter.

On executing this code the elements of the 4 rows is displayed in the Console.

Given below is also the code to access only rows 3 and 4, and the resulting output in
the Console.

How to view the elements in a column

There are three ways to find out the content/s of a particular column in a Dataframe.
Let us look at each of these with the help of an example.

Let us find out the contents of the column Name in the Dataframe Records.

54 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
1. Enter the code

Records.$Name

and press Control + Enter.

On executing this code we can see all the values under the Name column displayed.

2. Enter the code

Records*,Name+

and press Control + Enter.

Here the first field is empty because it relates to rows and the second field is the
name of the column whose contents are to be retrieved.

On executing this code the contents of the column are displayed in the Console.

3. Enter the code

Records[,1]

and press Control + Enter.

This method works if the column number is known beforehand.

In this case we know that Name is the first column in Records. On executing this
code, the contents of Name are displayed in the Console.

Of the three ways to find the values in a column, two work with the name or the label
of the column and the third requires the number of the column.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 55


All rights reserved
How to add columns to a Dataframe

To add a column to a Datafram, the only pre requisite is that the new column to be
added should be the same size as the other columns in the Dataframe.

In the Dataframe Records, there are 6 rows and 4 columns. To add a fifth column to
this Dataframe enter the code

Records$newc<-(100:106)

and press Control + Enter.


Entering 100:106 indicates that the new column needs to have 6 rows with the values
100, 101, 102, 103, 104, 105 and 106.

When this code is executed, an error is displayed in the Console as 100:106 adds up
to seven rows and not six.

So, the code needs to be changed as follows

Records$newc<-(100:105)

and press Control + Enter.

On executing this code in the Workspace the number of columns in Records is now 5.

Also, when Records is opened, the column New is displayed with values from 100 to
105.

56 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
How to remove a column from a Dataframe

It is also possible to remove a column from a Dataframe. For eg, to remove the
column New from the Dataframe Records enter the code

Records$new<-NULL

and press Control + Enter.

When this code is executed the data in the Workspace is updated to show only 4
columns.

When Records is opened, the column new is no longer visible.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 57


All rights reserved
SUMMARY

In this tutorial we have looked at another important data structure called Dataframe.

To summarize:

A Dataframe is a data structure made up of vectors of equal length.

To create a Dataframe in R, first vectors need to be created.

There are various types of functions that can be carried out with Dataframes.
These are:
a. Printing the contents of a Dataframe
b. Indexing or locating specific values in a Dataframe
c. Finding out the values of a row in a Dataframe
d. Finding out the values of a column in a Dataframe
e. Adding a column with values to a Dataframe
f. Removing a column from a Dataframe

58 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
PART 4 LIST & MATRIX

INTRODUCTION

Data structures can also be in the form of a list or a matrix.

This tutorial will deal with two more types of data structures List and Matrix.

So, heres what will be covered in this tutorial:

- Understanding Lists
- Understanding a Matrix
- Creating Lists in R
- Creating a Matrix in R
- Different functions related to List and Matrix

WHAT IS A LIST

Just like a Dataframe, a List is also made up of Vectors. But unlike the Dataframe, the
Vectors in a List can be of equal or unequal length.

However, the Vectors in a List should comprise elements of the same data type.

For eg, in the table below, n,s and b are 3 Vectors of different data types numeric,
character and logical. Each of them are of unequal length. A combination of these
Vectors can make up a List.

n s b
2 aa TRUE
3 bb FALSE
5 cc TRUE
dd FALSE
ee FALSE

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 59


All rights reserved
WHAT IS A MATRIX

A Matrix is a collection of data elements arranged in a 2 dimensional rectangular


layout.

To create a Matrix in R of 10 elements arranged in 5 columns and 2 rows, the syntax


to be used is shown below. In this example, a matrix called my.matrix will be
created.

So, enter the code

my.matrix<-matrix(c(1:10), ncol=5, nrow=2)

and press Control + Enter.

The result is shown below.

Unlike a Dataframe where each column stores different elements like Name or Age,
in a Matrix all the columns need to have the same type of elements either only
numbers or only characters and so on.

A Matrix cannot have one column with character data types, one column with
integers and so on.

60 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
CREATING A LIST IN R

There are two ways in which a List can be created in R.

1. By creating the Vectors in the List

To understand how to create a List in R, the table shown earlier will be converted into
a List.
In that table, column n has only numeric data, s has only character data and b
contains logical data (only True or false). Each of these columns are Vectors.

To create the List, enter the code

n = c(2,3,5)
s = c(aa, bb, cc, dd, ee)
b = c(TRUE, FALSE, TRUE, FALSE, FALSE)

and press Control + Enter.

The output is displayed in the Console.

In the Workspace section, each Vector n, s and b is displayed together with its data
type and the number of values in it.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 61


All rights reserved
2. By creating a separate List around Vectors

To create a List X with the three Vectors n, s and b and a fourth Vector called 3, enter
the code

X = list(n,s,b,3)

and press Control + Enter.

On executing this statement, in the Workspace section x with a value List against it is
visible. It also indicates that the List has 4 Vectors.

To view the contents of the List, select the name of the List and press Enter. The
values in X will be displayed in the Console.

62 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
CREATING A MATRIX IN R

Let us create a matrix in R, called my.matrix with 5 columns and 2 rows. This Matrix
needs to store 10 elements.

To create this Matrix, enter the code

my.matrix<-matrix(c(1:10, ncol=5, nrow=2)

and press Control + Enter.

Here, the first argument indicates the number of elements to be stored in the Matrix,
the second argument relates to the number of columns in the Matrix and the third
argument relates to the number of rows in the Matrix.

On executing this code the workspace indicates that a Matrix has been created.

On double clicking the name of the Matrix, a 2x5 matrix with 10 elements is visible.

CREATING A MATRIX OUT OF A DATAFRAME

A Dataframe can also be converted into a Matrix. Let us understand this with the help
of an example. First the Dataframe needs to be created, with some sample elements.
To create a Dataframe called data_frame, enter the code

data_frame<-data.frame(a=c(1,2,3), b=c(1,2,3))

and press Control + Enter.

To convert this Dataframe to a Matrix (let us call it next.matrix), use the function

next.matrix<-as.matrix(data_frame)

and press Control + Enter.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 63


All rights reserved
The output is displayed in the Console and the details of the Matrix in the Workspace
section. On opening the Matrix, a 2x5 matrix is displayed. Here the column names
displayed are V1, V2, V3 etc as no specific column names have been mentioned in the
code.

Now let us find out the data type of the second column in the Dataframe data_frame.
The data type of the second column is numeric, but this can be found out by using
the code

class(data_frame$b)

On executing this statement, numeric is displayed in the Console.

Now let us find out the data type of the second column in the Matrix next.matrix.
The second column in next.matrix is b. To find out the data type of b, enter the code

class(next.matrix[,2])

and press Control + Enter.

In the Console, character is displayed.

So, if the same elements in the Dataframe were used to create the Matrix, why does
the data type of the column differ? Column a or the first column in the dataframe
that was used to create next.matrix has elements of the character data type. So as a
Matrix needs to have elements of the same data type, every element in the Matrix
including the elements in the second column b have been converted to character
data type. This is why the data type of the second column of the Matrix is character.

This underlies the key difference between a Dataframe and a Matrix i.e, all the
elements in a Matrix need to be of the same data type.

64 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
SUMMARY

In this tutorial we have covered two more data structures List and Matrix.

To summarize:

List is a combination of vectors, either of equal or unequal length


.
A Matrix is a collection of data elements, where all the elements need to be of
the same data type.

There are two ways in which a List can be created in R. The first is by
generating the Vectors in the List individually. The second is by combining the
Vectors into a consolidated List.

A Matrix can be created using code where the number of elements, columns
and rows is specified.

A Dataframe can also be converted to a Matrix. If the Dataframe has different


data types, only one data type will be stored when converted to a Matrix.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 65


All rights reserved
PART 5 FACTORS

INTRODUCTION

An important data type used in data structures is factor. Factor as already mentioned
refers to data types of categorical nature.

So, heres what will be covered in this tutorial:

- Understanding factors
- Creating a factor in R

WHAT IS A FACTOR

In R, let us assume that a Vector called fac_list has been created with names of cities
like city1, city2, city 3 etc.

fac_list<-c(city1, city2, city3, city4)

The names of these cities are categories in themselves. So each city which is
originally a character data type can be converted into factor or a separate category in
R.

Let us take another example. In a Vector like gender, there are invariably two values,
male and female, each of which are categories in their own right.

So, the utility of the data type factors is to convert values into categories.

Vectors are the base on which factors are generated from.

66 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
HOW TO CREATE A FACTOR IN R
To use factors to create categories out of values, let us assume that the values in the
Vector fac_list are to be converted to categories or factors.

First create the Vector fac_list using the code mentioned above.

Now enter the code

fact1<-as.factor(fac_list)

and press Control + Enter.

In the Workspace section, fact1 with 5 factors is visible.

To find out the data type of fact1, use the code

class(fact1)

and press Control + Enter.

On executing this code, factor is displayed in the console area.

To view the values in fact1, enter the code

summary(fact1)

and press Control + Enter.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 67


All rights reserved
In the Console area is shown each of the values in fact_list displayed as categories.

The values under each indicate the number of times they appear in the Vector
fact_list.
For eg, city 1 appears only once hence the value 1, but city 2 appears twice which is
indicated by the value 2. Likewise the number of times the other categories appear is
also indicated.

68 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
SUMMARY
In this tutorial the data type factors was explained in some detail.

To summarize:

Factors are a data type which converts values into categories. For eg, names
of cities to city, male and female to gender and so on.

In R, factors can be created out of values in a vector.

It is also possible to view the number of times a category appears in a Vector.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 69


All rights reserved
Section 3
Data Handling

70 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
PART 1 PACKAGES

INTRODUCTION
One section of the R Studio GUI comprises a section of Packages. They allow for amny
important functions to be carried out.

So, here is what will be covered in this tutorial:

- Understanding packages

- Installing and loading packages

- Importing data into R

- Exporting data from R

WHAT IS A PACKAGE
Packages are collections of R functions, data and compiled code put together in a
well defined format. They can be thought of as prepared routines that are available
in R.

Packages are like a bundle of everything that is needed to carry out a specific
function in R.

Let us understand the importance of packages through an example.

Suppose we want to carry out a linear regression in R to create a linear model. One
way to do this is to write all the logic and code to carry out a linear regression and
then execute it. Another way is to access a linear regression function from an
external file, pass your data through it and execute it. This pre made function is what
is referred to as packages in R.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 71


All rights reserved
As we have already discussed R has a huge community of contributors. These
contributors create these premade functions or packages in R which can then be
used by all users of R. So, if a user needs to forecast something in R, all they need to
do is look for the forecasting package in R and use it. Packages definitely make R a
user friendly tool.

By using the right package in R, one can save time and effort in carrying out a
particular function.

INSTALLING AND LOADING A PACKAGE


When talking about packages there are two common terminologies that are used.
The first is installing a package and the second is loading a package.
To understand these terminologies, let us look at an example. Using this example will
not just show us how to use a package, but also demonstrate how to import data into
R.

Let us assume that in one of the drives in the system being used, an Excel file called
Excel_import is to be imported into R. In R the code to import an excel file is
read.xlsx.

But if we were to execute this code, it would not work. This is because xlsx is a
function present within a package and it will only work if this package is installed. So
certain functions in R are linked to packages and will only work if those packages are
installed in R.

To execute certain functions, it is important to install and load a package in R.

72 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Installing a Package

Let us now look at how to install a package. The option Packages is available in R
Studio on the right hand side.

Click on Packages, then Install packages.

In the field Packages enter the name of the package that needs to be installed. In
the example being used, the package to be imported is called xlsx. So, enter xlsx.

Make sure that when installing a package like xlsx you are connected to the internet,
as R will need to download the package from a server. Like in the case of xlsx it will be
downloaded from the server Repository.

After entering the name of the package to be installed, click on the Install button.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 73


All rights reserved
Once the package has been installed look for it by entering its name in the search
field in the Packages section. As it appears in the search results, we know that the
package has already been installed.

To find out if a package is already installed in R, look to see if it comes up in a Search.

Loading a package

Installing a package adds it to your system, but post that the package needs to be
loaded. Loading means using the package in R to carry out or execute the function.
To load a package in R, the common code that is used is library followed by the name
of the package within brackets. So, enter the code

library(xlsx)

and press Control + Enter.

In the console the text in red indicates that the package has been loaded in R.

74 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
IMPORTING AN EXCEL FILE INTO R

Now let us import an Excel file into R, as the package to import the file is installed and
loaded in R.

To do this the code to import the Excel file needs to be entered. A breakdown of this
code is mentioned here:

read.xlsx(file= file path.xlsx,sheet.index=1)

- filpath.xlsx is the file path or the location of the file to be imported


- sheet.index=1 indicates that only the first sheet in the Excel file needs to be
imported (So, if 2 is entered instead of 1, then the second sheet will be
imported from the Excel file)
The path of the file to be imported can be found under Properties.

Let us assume that the name of the sample Excel file to be imported is Excel_import.
To find out the file path of this Excel file, right click next to the file and look under
Properties.

In the space left for the file path, paste the file path of the Excel file. When pasting or
writing the file path, make sure that back slash is entered twice in the file path.

After the file path enter the name of the file which is Excel_Import. Then the sheet
to be imported needs to be mentioned. We can either enter 1 or sheet dot Index
equal to 1.

The imported excel sheet needs to be stored in a Dataframe in R. So we will create a


new Dataframe data1. Let us execute this code by entering Control + Enter.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 75


All rights reserved
In the Console the presence of the red dot means that the data is being compiled.

In the Workspace section, a new dataframe data1 is created which has 99


observations or rows and 4 variables or columns. On opening data1 we can see the
data that has been imported. Check to see if the correct data has been imported by
opening the Excel file that has been imported.

76 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
IMPORTING A CSV FILE INTO R

To import a comma separated value file or a CSV file, the code read.csv is to be used.
Importing a CSV file does not require any package to be imported as this function is
inbuilt in R.

The code to import a CSV file is shown here:

read.csv(file= file path.csv,sep= , )

- file path.csv is the file path or the location of the file to be imported
- sep = , indicates that the file to be imported is a comma separated value
file

Assume that a csv file called CSV_Import is to be imported. Copy the file path of this
file which can be found under Properties.

In R, enter the code to import the file by first entering the name of the Dataframe
where the imported file will be stored, which is data2. Then enter the code read.csv
followed by the filepath which has been copied earlier. Then the name of the file to
be imported is entered which is CSV_Import followed by the file type which is CSV.

Remember to add 2 back slashes to the file path just as we did in the case of the
Excel file import. The last part in the code is the separator which is a comma.

Press control plus enter to import the file.

In Workspace, data2 has been created with 99 observations and 4 variables.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 77


All rights reserved
An easier way to import a CSV file is to create a Dataframe (like a) and use the code:

read.csv (file.choose ( ) )

The space after choose ( ) is to select the file from the menu in the system. This
option is a menu driven option and removes the need to copy and paste the file path
in the code.

After pressing Control + Enter, in the Select File option which appears look for the
CSV file to be imported (which in our case is CSV_Import).

On selection in the Workspace a Dataframe a is created which has the same


observations and rows as the earlier Dataframe created.

78 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
SUMMARY
In this tutorial the utility of Packages was explained in some detail. It also covered the
importing of Excel and CSV data into R.

To summarize:

Packages are a bundle of pre defined functions. They help in executing certain
processes in R with ease.

Certain functions in R can only be done through packages.

To use a package in R, first install it and then load it.

To import an Excel file in R, install and load the package xlsx.

To import a CSV file in R, use the code read.csv.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 79


All rights reserved
PART 2 EXPORTING AND READING DATA IN R

INTRODUCTION

Just like data can be imported into R whether in Excel or CSV format it can also be
exported from R.

So, here is what will be covered in this tutorial:

- Exporting data from R to Excel

- Exporting data from R to CSV

- Reading a file in R

80 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
EXPORTING TO EXCEL

In an earlier section, a Dataframe called a has already been created when CSV files
were imported to R. Let us assume that the contents of this Dataframe will now be
exported to an Excel sheet.

To do this, enter the following code:

write.xlsx (data, file= file path)

So, if the contents of Dataframe a is to be exported to a sample Excel file called


abc, the code to be entered is shown below:

Let us deconstruct this code.

The data to be exported is specified as a and the location to which is to be exported


is mentioned after file. Also mentioned is the name of the Excel sheet where the
data is to be stored which in this case is abc.

When this code is executed, the data is exported to the location specified. You can
always check this by going to the location where the Excel file is saved, and checking
its contents.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 81


All rights reserved
EXPORTING TO CSV

Exporting to a CSV file is similar to exporting to an Excel file. The code to carry out
this function is shown here:

In the code shown, a is the Dataframe whose contents are to be exported, filepath is
the location where the CSV file is to be saved and the comma against the separator
(sep) indicates that the data has to be exported in CSV format.

After executing the code, go to the desktop of your system and look for the location
where the CSV file has been stored. Verify that the contents of the Dataframe a
have been exported in CSV format.

82 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
READING A FILE IN R
Like in the case of exporting data, reading data is also carried out with the help of
code, which in this case is read.table.

Shown here is the code to read a sample text file in R.

Assume that on the desktop of your system, a text file called Consultants is
available whose contents are to be read through R. Assume that this file contains a
set of email ids all separated by commas. When this data is read in R, we want to
make sure that each email id is an element in itself.

Let us now deconstruct the code to read data.

The dataframe where the contents of the text file are to be displayed is mentioned
which in the code displayed is a. The location of the text file is given next. Comma
is written against separator as all values in the text file Consultants are separated
by commas. When we execute this code, in the Console a red dot appears indicating
that the data is being compiled.

In the Workspace section, a Dataframe called a is visible. On opening this Dataframe,


all the email ids in the file Consultants are stored as separate elements.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 83


All rights reserved
Assume that in the desktop of your system, is another text file called Backup codes.
Here the elements are separated by a space. To read the contents of this file, using
the same code used earlier, replace the comma with a space.

On executing this code, a Dataframe b is created in the Workspace section. On


opening this Dataframe, the contents of the text file are displayed as separate
elements. So as the contents were separated in the text file with a space, a space was
used in the code as a separator.

84 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
SUMMARY
In this section, exporting data from R whether in Excel or CSV format was
covered. Also, the code to read text files in R was also looked at.

To summarize:

Data can be exported in Excel format using the code write.xlsx.

Data can also be exported in CSV format.

To read code in R, use the code read.table.

To read elements separated by a comma, use the separator ,.

To read elements separated by a space, leave space as a separator.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 85


All rights reserved
Section 4
Logical Operations and Conditions

86 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
PART 1 LOGICAL OPERATORS AND IF CONDITION

INTRODUCTION

Locating values in R is fairly simple with the use of logical operators and conditions.

So, here is what will be covered in this tutorial:

- Understanding logical operators

- Common types of logical operators

- Executing logical operators in R

- Understanding IF condition

- Executing IF condition in R

WHAT IS A LOGICAL OPERATOR


Logical operators are used to locate specific elements in a data structure. Here are
examples of logical operators in R Greater than, Less than, Equal to, And, OR.

An example will be used to understand each of these terms better. Assume there is a
table, that lists a few names along with certain particulars related to those names like
gender, age and income.

The utility of a logical operator in a table such as this or in a Dataframe in R is to


identify or isolate a specific element or elements, or certain row or rows.

So, if in this sample table, one wants to identify all those names where the age is
Greater than 23, then the logical operator Greater than is used. Here is the result of
using this operator on the sample table. Looking at the table we can identify 3 names
where the age is greater than 23.

Name Gender Age Income


Aryan M 20 20000
Gopal M 21 30000
Zubin F 24 35000
Ravi M 26 40000
Umesh M 26 41000
Anita F 23 50000

Age Greater than


Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 87
All rights reserved
23

Let us take a look at another example. Suppose we want to identify all those names
whose gender is male and whose income is greater than 40000. Here we need to use
3 logical operators to identify these names. These are gender Equal to male,
followed by the logical operator AND, followed by income Greater than 40000.
Here is the result of applying these logical operators.

Name Gender Age Income


Aryan M 20 20000
Gopal M 21 30000
Zubin F 24 35000
Ravi M 26 40000
Umesh M 26 41000
Anita F 23 50000

Gender Equal to male AND


income Greater than 40000

So from these examples we can see that logical operators are very useful in
extracting particular information from a Dataframe or a table.

88 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
HOW TO EXECUTE A LOGICAL OPERATOR IN R

We will now look at how to work with these logical operators in R through a simple
exercise. In an earlier section, a Dataframe called Records was created using the
information mentioned in the sample table above. But for purposes of this exercise,
we will and create this Dataframe again.

To create the Dataframe Records again, first create the vectors Name, Age and
Income before creating Records.

After the Dataframe has been created, the following three tasks will be carried out:

1. The vectors in Records will be printed


2. The elements where the age is less than 23 will be identified
3. The elements where gender is male and age greater than 21 will be identified

Finding out the rows where the age is less than 23

Let us begin with finding out the elements or rows where the age is less than 23.
From the table, we know that there are 2 rows where the age is less than 23. These
can be found against the names Aryan and Gopal.

Name Gender Age Income


Aryan M 20 20000
Gopal M 21 30000
Zubin F 24 35000
Ravi M 26 40000
Umesh M 26 41000
Anita F 23 50000

Age Less than 23

So how do we get the same result in R?

Here is the the code to create the Dataframe Records.

Press Control plus Enter.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 89


All rights reserved
The Dataframe is created and information relating to this is displayed under
Workspace. On double clicking this Dataframe Records, the information shown in
the table is present in the Dataframe.

Let us now find out the elements or rows in this Dataframe where the age is less than
23. When discussing data structures we touched upon the code to find out the
number of rows. The first rule to remember is to use square brackets after the name
of the Dataframe, and the second rule is that the first argument within the bracket
relates to rows and the second argument relates to columns.

So as we need to find out the rows where the age is less than 23, the logical
statement is mentioned in the first argument and the second argument has been left
blank as in nothing is mentioned after the comma.

The code to find the rows where the age is less than 23 is:

So now let us deconstruct this code.

First mention the Dataframe name which is Records followed by the dollar sign ($)
and the name of the column from where data needs to be identified. In our example
this would be Age. Then enter the logical operator less than ( < ) followed by 23.

On pressing Control plus Enter, we can see in the console section two rows. The rows
displayed tally with the results that we arrived at when we looked at the data
displayed in the table.

Remember to identify rows, enter only the first argument in the code, as the second
argument relates to columns.

90 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Let us now assume that we want to count the number of rows where the age is less
than 23. The details of the rows where the age is less than 23 is displayed, but we
now need a count of these rows.

First we need to create a dataset data1 and attach it to the code we used earlier.

data1<-Records[Records$Age<23]

As you recall from earlier sections, this has the effect of attaching the results of the
code to the dataset data1. So the two rows that we saw in the Console now belong
to the dataset data1.

Now to find out the number of rows in data1, enter the code

nrow (data1)

On pressing Control plus Enter, 2 is displayed in the Console.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 91


All rights reserved
Finding out the rows where the age is less than 23

Going back to the table, we can see that there are two records which meet the
conditions of gender being male and age being over 21. These are found against the
names Ravi and Umesh.

Name Gender Age Income


Aryan M 20 20000
Gopal M 21 30000
Zubin F 24 35000
Ravi M 26 40000
Umesh M 26 41000
Anita F 23 50000

Gender Equal to male AND age over 21

So how do we get the same results in R?

In R, enter the code

Records[Records$Gender== M&Records$Age>21,+

So what we have effectively stated in this code is to find in the Dataframe Records, all
rows with gender Equal to M and with age Greater than 21.

On pressing Control plus Enter,in the Console the rows with Ravi and Umesh are
displayed. This as we have seen exactly matches the requirements of all rows with
gender male and age greater than 21.

Remember that when entering character data types in R, the values need to be
entered within double quotes.

92 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
WHAT IS IF CONDITION IN R

To understand the conditional statement IF in R, let us use an example.

Let us begin by opening the Dataframe Records. In this Dataframe, let us assume we
want to add another variable called Gender_dummy.
The values to be displayed against this variable are 1 against all those rows where M
(male) is displayed and 0 against all those rows where F (female) is displayed.

The code to execute this is shown here:

Records$Gender_dummy<-ifelse(Records$Gender== M,1,0)

ifelse indicates that IF the value in the column Gender is M display 1, ELSE display 0.

Remember when entering the code to precede the name of the column with a dollar
symbol.

Press Control plus Enter. In the Workspace section, the number of variables has
increased to 5 (where it was earlier 4).

On opening the Dataframe we can see that a new variable Gender underscore
dummy has been created. In this column all 1s have been added against all those
elements where the gender is M or male and 0 against all those elements where the
gender is F or female.

So let us run through the code one more time. IF the statement gender is equal to
male is true, display 1, Else display 0 (which means that if the statement gender is
equal to male is not true, then display 0)

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 93


All rights reserved
SUMMARY
In this section, the use of logical operators like Greater than, Less than, Equal to, AND
have been covered in some detail. The use of the conditional statement ifelse has
also been covered.

To summarize:

Logical operators are used to identify certain elements in a data structure. Egs
are greater than, less than, equal to etc

When executing a logical operator in R, mention the name of the data


structure and the name of the column which contains the desired variables.

Symbols are used in R for logical operators like =,<,>,&

If condition looks for the presence of certain conditions before carrying out a
specific function

94 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
PART 2 MERGING DATA

INTRODUCTION
Data in different tables can be merged in R.

So, heres what will be covered in this tutorial:

- Understanding merging of data

- Different ways in which to merge data

- Executing a merge in R

WHAT IS MERGING OF DATA

Let us assume that an organization has prepared a database of its employee


information. One table which we will refer to as Table 1 stores details related to
Employee ID (shown as EmpID), Name and Income. Employee ID is a unique key or
identity.

A second table which we will refer to as Table 2, stores Employee ID, Address and
Nationality.

Assume that the organization wants to combine the information in these two tables
into a single table. To do this, Merge will be used. So, Merge is an operation which
helps in combining data which are present in different tables.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 95


All rights reserved
WHAT ARE THE DIFFERENT WAYS TO MERGE DATA
A merge can be carried out in different ways, or in simple terms there is more than
one way to merge data. Again lets look an example.

Shown here are two data sets. The first data set has three columns, k1, k2 and data.

The second dataset has 2 columns k1 and k3.

In order to merge these 2 datasets, an important condition needs to be met there


needs to be atleast one column in common between the two.

In our example, the column which is common between the two datasets is k1. So it is
possible to merge these two datasets as k1 is common between the 2.

Full merge

The first type of merge possible is called the Full merge. In our example, we have two
columns in the first dataset and three in the next dataset. Of these, one column k1 is
common. After a full merge one dataset with four columns will be created k1, k2,
k3 and data. So a full merge is a concatenated table with all the unique columns and
data present in the tables that were merged.

So let us look at merged table that has been created after a full merge of the 2
datasets.

96 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
As you can see, in the column k1 seven elements are visible. 1 is the common
element in k1 in both datasets. All other unique values in k1 and the other columns
are present in the new merged dataset.

Inner merge

The second type of merge is called Inner merge. In this type of merge only the row
with matching elements in the common column of the datasets to be merged are
brought together.

In our example, the only column that is common between the two datasets is k1.
Within k1, the only common element between the 2 columns is the number 1. So
when an Inner merge is carried out only the row which has the common element is
merged. The figure shown on your screen indicates the result of an Inner merge
between the two datasets.

Left outer & Right outer merge

The third type of merge is called the left outer merge. In this type of merge a
consolidated table is created, but only the contents or elements of the columns
which are to the left are merged.

The figure shown on your screen displays the results of a left outer merge.

The inverse of a left outer merge would be a right outer merge, where only the
contents or elements of the columns to the right are merged.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 97


All rights reserved
HOW TO CARRY OUT A MERGE IN R

The first thing we will do is create two dataframes X and Y which will contain the
elements of the datasets that we used as an example earlier.

The dataset X will comprise the columns k1, k2 and data, whereas the dataset Y will
comprise the columns k1 and k3.

Shown here is the code to create the dataframes X and Y.

The dataframes are created by pressing Control plus Enter.

Carrying out a full merge

Let us now look at the syntax or code to carry out a full merge of both the datasets
x and y. Enter the code:

merge(x, y, by.x = k1, by.y = k1, all=TRUE)

x and y are the datasets that are going to be merged. K1 is the column that is
common between the datasets x and y.

To indicate that a full merge needs to be carried out, all = TRUE is specified. On
pressing Control plus Enter, a fully merged dataset is displayed in the Console.

98 Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Carrying out an inner merge

To carry out an inner merge, enter the code:

merge(x, y, by.x = k1, by.y = k1)

Press Control plus Enter. In the Console, the results of an inner merge are shown,
wherein the common elements in the common column are merged.

Carrying out a left outer merge

To carry out a left outer merge, enter the code:

merge(x, y, by.x = k1, by.y = k1, all.x = TRUE)

all.x is mentioned as the dataset x is to the left . On pressing Control plus Enter, the
results are shown in the Console.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 99


All rights reserved
Carrying out a right outer merge

The code to carry out a right outer merge is shown here:

merge(x, y, by.x = k1, by.y = k1, all.y = TRUE)

Here all.y is specified, as y is the dataset to the right. On pressing Control plus Enter,
the results of the right outer merge are shown in the Console.

For datasets to be merged there has to be atleast one column in common between
them.

SUMMARY
In this section, the different ways two data structures can be merged has been
looked at in some detail.

To summarize:

The merge operation brings together elements of different datasets or tables


into a single consolidated table.

To carry out a merge, atleast one of the columns in each of the datasets to be
merged must be common.

There is more than one way to merge data.

A full merge combines all of the elements in the datasets into a consolidated
dataset.

An inner merge combines only the elements of the row which have elements in
common (within the common column)

A left outer merge combines the elements of the table or dataset to the left. A
right outer merge combines the elements of the table or dataset to the right.

100Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Section 5
Text Analytics and Word Cloud

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 101
All rights reserved
PART 1 UNDERSTANDING TEXT ANALYTICS

INTRODUCTION

Analyzing text is extremely powerful and is an integral part of our social media and
web activity.

So, here is what will be covered in this tutorial:

- Understanding textual or text analytics

- Importance of text analytics for organizations

- Common terms in text analytics

- Understanding the framework to create a Word Cloud

102Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
WHAT IS TEXT ANALYTICS

To understand this let us look at the type of information that is available to


organizations today. In todays competitive environment information is power. A lot
of this information or data is present on the web in the form of text or videos. Very
rarely is this information available to organizations in a structured format which can
be stored in a database. In fact organizations need to take data that is available out
there and structure it so that it is useful for them. But this can be a daunting task
especially where most of the information is in text format.

This is where textual or text analytics plays a key role.

So if we were to define text analytics we could say that it is the process of deriving
high quality information from unstructured text. Simply put, it is making sense or
giving structure to data or information which is not structured.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 103
All rights reserved
HOW IS TEXT ANALYTICS USEFUL
Let us suppose that you have been searching on the web for anything related to
computer games. On your search results page you will often find ads and
recommended pages related to computer games. A lot of what you see is dependent
on your search history or the keywords that you have been using.

Likewise, when you are on the Newsfeed page of Facebook, you can see posts on
Suggested pages or Ads displayed on the right hand side of your page. Maybe you
have been looking for something specific on Facebook or have been spending time on
a certain company page. Those suggested pages or ads could be very similar to the
pages that you have been looking for or spending time on in Facebook.

If you have a Gmail account, you would find in your Spam folder a lot of mail that you
yourself did not actually send to Spam. Well all of these examples that we have cited
is a result of using text analytics. Take the example of Spam filtering. There have
been instances when you have flagged of mail from a certain recipient as Spam. Your
mail service provider will now automatically look for those words in a string and send
104Copyright (c) 2014 Redwood Associates Business Solutions Private Limited
All rights reserved
any mail with that text to Spam. Likewise in facebook what you search for or write is
being analysed to come up with suggested pages and display ads.

Text analytics is an exciting and useful part of analytics. To understand this concept
better, in the sections ahead we are going to focus on two aspects:
1. Understanding the common terms used in text analytics; and
2. Completing a a text analytics project using data from a popular social medium
- Twitter.
The project will focus on the framework to create a Word Cloud out of a set of
tweets on Big data, R and analytics.

You will need to execute this project in R using the concepts that will be
discussed in this tutorial. We will of course be guiding you along the way.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 105
All rights reserved
IMPORTANT TERMS IN TEXT ANALYTICS

Corpus

The first term that we will look at is called Corpus. Corpus is the data structure that is
used to manage the text that is being analyzed.

A simple way to look at a corpus is to think of a dictionary.

It is a data structure of the relevant words in a piece of text. Let us assume that we
are analyzing a blog on democracy. The corpus will be a list of all relevant words in
the blog related to democracy stored in a structured format. So like in a dictionary
when you look up the term democracy you will find all words associated with it listed
in one place, the corpus will list all the relevant words from the blog in a single place.

An important point to remember about a Corpus is that just like in the case of a
dataset, it needs to be cleaned up.

Cleaning up a Corpus

Stopwords

So what do we typically clean from a Corpus? Firstly, words which do not really make
sense in itself need to be removed. For eg, if the blog that we are analyzing, uses the
words the, or, of , am , is, are , was quite frequently these words
really carry no meaning or have little or no value and hence need to be removed from
the Corpus. These types of words are referred to as stopwords. There are around
196 stopwords that have been identified.

You need not worry about identifying these words by yourself, because in R we will
be using a Text Mining or TM package which will help you in identifying and removing
stopwords from your Corpus.

In addition to the 196 stopwords identified, you can also add your own stopwords
based on what you think is useful or not. For example, if you think that names of

106Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
people in the text you are analyzing is not useful, these can be added to the list of
stopwords to be cleaned from the Corpus.

Numbers

Secondly, we can also remove numbers from the Corpus. So, if numbers have been
used to demarcate points like 1, 2, 3 and so on, these can be removed from the
Corpus as they have no meaning by themselves.

Punctuation

Thirdly we can also remove punctuations like commas, semi colons, colons, full stops
etc from the Corpus.

Treatment of case

Fourthly, we can decide whether the same words used in a text need to begin with
upper case or lower case.

For eg, if democracy is spelt in one place with lower case but in another sentence
begins with upper case, then we need to decide if in the Corpus democracy should
always start with upper case or lower case.

Stemming

The next type of clean up that can be done is through a process called Stemming.
To understand Stemming, let us assume that in the blog we are analyzing a word
participate which has been mentioned in different ways like participated
participating participatory etc across the blog. All these words relate to the same
root word which is participate.

The process of Stemming will ensure that all these words will eventually add up to
that one word no matter the tense used. Another example would be a verb like fly
which can be represented in an article as flew flying flown etc. Stemming will
ensure that in the end this is all represented by the one word fly.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 107
All rights reserved
Framework

So, the framework to start analyzing text begins with creating a Corpus which is a
data structure to store text. This is then followed by the process of cleaning up the
Corpus wherein the following is carried out:

Stopwords are removed


Numbers are removed
Punctuation is removed
Treatment of case is decided
Stemming is done

Another important term is Tokenize. In this process a sentence is broken down into
individual tokens so that each word in that sentence is a separate entity. So the
sentence Parliament is the seat of democracy, when tokenized would be:
Parliament, is, the, seat, of, democracy.
This method is also used in search engines like Google when they look at keywords.
For eg, if the keywords analytics jobs is entered, it would be first broken down into
2 tokens analytics and jobs.

108Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
WHAT IS NEEDED TO CREATE A WORD CLOUD

TDM

Having arrived at a clean Corpus, we now need to decide what to do with it.
Remember that a Corpus in itself is not an output, but a dictionary to be used to
create something else. So if our final objective is to create a Word Cloud out of a
Corpus, the Corpus needs to be converted into a format which enables a Word Cloud
to be created from it.

To understand this better, we need to know what is required to create a Word Cloud.
Two very simple components make up a Word Cloud words and the number of
times or frequency with which those words appear. For example, look at the table
shown below:

Words Frequency
People 20
Democracy 35
Freedom 40

The numbers next to those words represent the number of times they appear in a
piece of text like a blog or an article.

When a Word Cloud is created, the frequency will determine the size of the word
within the Word Cloud.

For example in the image shown below, the larger the size of the words, the more
frequently they would have recurred in the content or the text from which this Word
Cloud was created.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 109
All rights reserved
So, the structure which has been described above is referred to as a Term Document
Matrix or TDM. So to take a Corpus and make it Word Cloud ready we need to create
a TDM. A TDM is made up of rows and columns. The columns represent the words
and the rows represent the frequency of their occurrence.

So lets stop for a while and ask ourselves a question. Hey, I have a blog and I want to
create a Word Cloud out of it. How can I do it?
Well, everything we have discussed so far should answer our question. Quite simply:
1. Create a Corpus
2. Clean up the Corpus
3. Create a TDM or Term Document Matrix out of it
4. Create your Word Cloud

110Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Installing the TM package in R

Creating a Word Cloud in R is possible through a package called the TM or Text


Mining package.

For example, to help with Step 2 which is cleaning up of the Corpus the TM package
uses a function called tm_map. To carry out various types of processes like removing
Stopwords the correct argument needs to be entered after tm_map.

The TM package comes with some really good documentation which you need to go
through to understand how to execute each of the steps we have talked about.
Remember to also use the Help feature in R for specific queries.

Before we move onto our project of creating a Word cloud out of a set of tweets,
lets make sure we do the following.

1. Download the file comprising the tweets that we need to convert into a Word
Cloud. You can find this in the Download section of this tutorial.

When opening this file, remember to right click and select Open with R studio.

2. When the file is opened in R Studio, it will be visible in the Workspace section
with the number of tweets in it visible which is 320.

3. Import and install the list of packages that are mentioned in the Download
section of this tutorial. The packages are:
a) Twitter: This package is needed to read the tweets that have been
downloaded
b) Word Cloud: This is required to create the Word Cloud
c) TM: As mentioned earlier, the TM or Text mining package is needed to
create the Corpus, clean up the Corpus and create the TDM
d) Snowball: This package is required to enable Stemming to be carried out.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 111
All rights reserved
SUMMARY
In this section, the meaning and importance of Text Analytics was covered. Some
important terms in Text Analytics and the framework to create a Word Cloud has also
been explained.

To summarize:

Text Analytics gives structure to unstructured textual data.

In R, Text Analytics is done with the help of the TM or Text Mining or TM


package.

In Text Analytics, the first step is to create the Corpus.

The next step is to clean the Corpus, by removing Stopwords, numbers,


punctuation, stemming etc

To convert a Corpus to a Word Cloud, a TDM or Term Document Matrix needs


to be created.

112Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
PART 2 UNDERSTANDING TEXT ANALYTICS

INTRODUCTION

Word Clouds are a product of text analytics. They are not so difficult to create.

So, here is what will be covered in this tutorial:

- Understanding how to create a Word Cloud from twitter data

- The syntax used to carry out some important steps

- Understanding how to use a few packages in R

HOW TO CREATE A WORD CLOUD

STEP 1: CREATING A DATAFRAME


Open the dataset of sample tweets in R Studio.

Before we begin, youll should have downloaded the sample set of tweets and
opened it in R studio. You should have also imported and installed the list of packages
that were specified in the previous section.

As you can see in the workspace section, a list of 320 tweets is displayed.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 113
All rights reserved
Double click on this list and you will see a list displayed in Notepad.

From this list it is pretty evident that there are no details of the actual content of the
tweets. So what we have is essentially unstructured text.

To convert this unstructured text into a Dataframe, enter the code shown below:

library(twitter)

df<-do.call(rbind,lapply(tweets,as.dataframe))

Let us now deconstruct this code.

df = the twitter data is going to be stored in a dataframe called df

do.call = a function which is calling another function multiple times. In the case of our
code the function that is being called multiple times is rbind or row bind.

For a detailed explanation of the syntax do.call, go to the Help section in R studio.
Type the words do.call in the Search field and press Enter. As you can see a detailed
explanation of the function do dot call is shown in Help. You can go through this
explanation to understand this function better.

114Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
rbind = an action to bind or combine together the 320 rows of tweets

lapply = a function which converts the tweets that are being combined into a
dataframe

Let us now execute the code mentioned above. Press Control plus Enter. As you can
see in the Workspace section, a dataframe df with 320 observations is visible. These
320 observations is nothing but the twitter data which has been converted into a
dataframe.

Let us open the dataframe. As you can see each row is numbered, with the first row
relating to the first tweet, the second row the second tweet and so on. This
dataframe will run into 320 rows which corresponds to the 320 tweets in our original
data structure.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 115
All rights reserved
The dataframe has 14 variables. For the purposes of our exercise we will focus on the
text column of the dataframe.

In order to view the dimension of the dataframe, enter the code

dim(df)

Press Control plus Enter and you can see in the Console the numbers 320 and 14
displayed.

116Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
STEP 2: INSTALLING THE TM PACKAGE
Once the tm package has been loaded, we can find out how to use it from the Help
section. Go to Help and enter tm in the Search field. Press Enter. Click on the link tm.

You can see two links shown here a Description file and Overview of user guides
and package vignettes.

We will click on the second option which is the Overview of user guides and package
vignettes.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 117
All rights reserved
Once we do that, a PDF file on the Introduction to the tm package will open.

As we scroll through this document, you will see every conceivable task that is
possible with the tm package listed.

It lists out how to eliminate stopwords to how to carry out stemming to creating a
Term Document Matrix. Term Document Matrix or TDM, as we know is essential to
creating a Word Cloud as it lists out the set of words along with the frequency or the
number of times they appear in a given text.

118Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
STEP 3: UNDERSTANDING TDM
There are actually two types of matrix that can be created out of a Corpus. The first is
a Term Document Matrix where the terms or the words are rows. The second is
Document Term Matrix where the documents are rows.

As you can see in the image below, an example of a Document Term Matrix has been
shown. Here the documents are mentioned in rows.

Let us interpret this matrix. Listed under Docs are the names or numbers of the text
that have been analysed. Listed as columns are words which appear in these
documents. So, if against Doc 127 and against the word able the number 10 was
mentioned, it would mean that the word able has appeared 10 times in the
document 127.

STEP 4: CREATING A CORPUS


The next step in our project is to convert the dataframe df that we have created into
a Corpus. A Corpus if you recall is the data structure to store all the text that will be
analyzed.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 119
All rights reserved
In order to do this, use the syntax shown below.

myCorpus <-Corpus(VectorSource(df$text))

Let us deconstruct this syntax.

The name of the Corpus to be created is myCorpus. As we know Vector is a column


so the term VectorSource refers to the column in the dataframe whose data is to
be copied to the Corpus. Since we are only interested in the text portion of the
dataframe we need to indicate in the syntax that only the text column needs to be
added to the Corpus.

If we open the dataframe df you can see that the column which contains the contents
of the tweets is referred to as text.

So in the syntax we mention df $ text. Press Control plus Enter to create the Corpus.

120Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
STEP 5: CLEANING THE CORPUS
Now that we have created a corpus called MyCorpus which contains all the text to be
converted to a Word Cloud, we now need to proceed to the next step which is the
cleaning up of the Corpus. The function in the tm package which will help in the
clean up of the Corpus is referred to as tm_map. If you refer to the documentation
on the tm package, which we looked at earlier, all the information to transform the
Corpus has been specified in detail. Cleaning up of the corpus is also part of the
transformation of the Corpus.

Within the document, you will find the code required to carry out various processes
like eliminating white space or blank space from the Corpus, to conversion to lower
case to removal of stop words.

In the Help document, in the code shown to transform the Corpus, a sample Corpus
name reuters has been used. For the purposes of our project we need to use the
same code but replace reuters with the name myCorpus.

Shown here in R Studio is a list of codes to clean up myCorpus.

In R Studio against each of the code or syntax mentioned we will hit Control plus
Enter and start cleaning up the Corpus. We will first convert to lower case, then
remove punctuation and then remove numbers from the Corpus.

Removing URLS

It is also possible to remove urls from the Corpus with the help of a user defined
function which is shown below.

Let us deconstruct this function. The name of the function is removeurl. By


indicating http in the code followed by alnum followed by two double quotes we are
stating that where any url starting with http is present, it needs to be blanked out. X
in the code is the placeholder for the name of the Corpus which in our case is
myCorpus. In the next line we can see the code which calls the removeurl function.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 121
All rights reserved
To find out the meaning of gsub which is used in the function, type out the words
gsub in the search field of the Help section. As you can see from the text displayed
gsub is a function which is used to carry out any kind of replacement.

So in the removeurl function gsub is replacing any text starting with http with a blank.

Removing stop words

We will now look at how to remove stop words from the Corpus. As you can see in
the Console there are a number of words which by themselves do not make sense.
Some of these words are once, why, each, in, to, etc.

There are around 196 Stop Words that have been identified, but you can include
more as well.

In addition, we also want to include some other words which for the purposes of our
project are of no value or utility. These words are English, available and via.

122Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Now using the code shown below we will go ahead and remove the stop words
including the ones we have added from the Corpus.

Press Control plus Enter.

Stemming

Now that we have removed stop words, we will move onto another important
process in the clean up of the Corpus which is called Stemming. To do this we first
create a copy of the corpus by using the code shown.

Stemming will convert words like eating, eaten etc to one root word eat.

In order to carry out stemming we need to install a package in R called SnowballC. To


do this, go to Packages, Install Packages and write out the name of the package which
is SnowballC.

Since we have already installed this package we will click on Cancel, but in case you
have not then click on Install instead.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 123
All rights reserved
To carry out stemming, we need to use a function called stemDocument which is
found in the SnowballC package.

Press Control plus Enter to start stemming.

STEP 6: CREATING THE TERM DOCUMENT MATRIX


Let us pause for a while and try to recollect the next step in the framework to convert
a Corpus to a Word Cloud. After cleaning up of the corpus the next step would be
convert the Corpus into a Term Document Matrix.

Shown here is the code to convert the Corpus into a Term Document Matrix.

Let us deconstruct this code.

The code indicates that any word with a frequency from one to infinity needs to be
added to the Term Document Matrix.

This need not be mentioned in the code, because by default words with all types of
frequencies will be added to the term document matrix.

Press Control plus Enter. The Term Document Matrix has been created.

124Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
A Term Document Matrix will be the inverse of the Document Term Matrix wherein
the terms will be rows and the documents will be columns. The frequency will
indicate the number of times the word has appeared in a document. If we look at the
Console we can see a term Sparsity followed by 99%. This means that 99% of the
times these words or the words in the matrix, do not appear in the document.

To view the contents of the Term Document Matrix, go to the Workspace section
where you can see the value myTdm displayed. However, the Term Document
Matrix is in the form of a List, whereas we would like to see it in the form of a matrix.

In order to do this, we create a dataframe matrix called m and convert the List into
this matrix using the code shown below.

In the Workspace section, a matrix m is displayed. Double click on this and our Term
Document Matrix opens up! So let us break this down.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 125
All rights reserved
The first column row names indicates the words that are contained in the 320
tweets (remember all stop words have been removed, so these are the actual usable
words) The rows which are numbered 1,2 3 and so on are the number of tweets,
which we know will run into 320. The numbers indicate the number of times these
words appear in each of these 320 tweets. In most cases the number is zero
indicating that they have not appeared in those tweets. To find out the frequency or
the cumulative number of times a word appears across the 320 tweets we will need
to look at the sum of each row. So for example to find out the frequency of the word
big, we will need to add up all the numbers under each of the 320 columns against
the row big.

126Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
STEP 7: CALCULATING FREQUENCIES
In order to create a word cloud we will need to plot the word against its frequency.
The code to calculate the frequency of words and sort it in descending order is shown
here.

Let us deconstruct this code. The term rowSums and within brackets m indicates
that the summation of each row in the Term Document Matrix m will be carried out.
Decreasing = true, means that the summated amounts will be arranged in descending
order. Press Control plus Enter. The result will be stored against wordfreq.

To view the frequencies that have been calculated, select wordfreq and press Control
plus Enter. The results are displayed in the form of a List.
So an easier alternative would be to convert the List wordfreq into a matrix
wordfreq1 using the code that is shown on the screen.

In the Workspace double click on the matrix wordfreq1. Shown on the screen is a
matrix of all the words in the Corpus myCorpus along with their frequencies or the
cumulative number of times they appear in the 320 tweets. Also, the frequencies
have been arranged in descending order from the highest to the lowest.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 127
All rights reserved
STEP 8: CREATING THE WORD CLOUD
Now all that is left to be done is to generate the Word Cloud. Let us go to the Help
section in R Studio and enter the words Word Cloud. Click on the link which appears.
As you can see the arguments necessary to create a Word Cloud will be listed.

The first requirement shown is words, followed by frequencies. There are many
other options listed so that one can create a Word Cloud based on different
conditions. But a Word Cloud can be generated with just 2 pieces of information
words and their frequencies.

The Term Document Matrix that we will be using to generate the Word Cloud is
called wordfreq1. To generate the Word Cloud enter the following code:

128Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
Press Control plus Enter. The Word Cloud is being created in the Plots section of R
studio.

The Word Cloud creates the words with the highest frequencies first. So, words like
r, analysis, research and example have high frequencies and hence are displayed
quite prominently in the Word Cloud.

In the matrix, there were many words with a frequency of 1. We can choose not to
show those words in the Word Cloud. To exclude these words from the Word Cloud,
enter in the code an option to include only those words with a frequency of say 5 and
above. The code to execute this is shown below:

This code can also be found in the Help section.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 129
All rights reserved
Press Control plus Enter. You can see that fewer words are being added to the Word
Cloud.

Another thing to remember is that each time a Word Cloud is generated the position
of the words will change. As we can see, r which was earlier vertical is now horizontal
and is located in a different place. In order to ensure that the position of a word does
not change each time the Word Cloud is generated, we can use the function
set.seed.

In order to limit the number of words to be shown in the Word Cloud, we can use the
syntax max.words. We can also determine the colour of the Word Cloud by using
the syntax colour is equal to say red (within brackets)

130Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
As you can see the words are now being displayed in red.

So we have completed the objective of this project which was to generate a Word
Cloud. To generate a Word Cloud all that is needed are words and their frequencies.
Other parameters can also be defined. Modifications can also be done on the Word
Cloud like minimum frequency, maximum number of words to be displayed and
colour.

Do try out the other parameters available by referring to the content in the Help
section under Word Cloud.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 131
All rights reserved
SUMMARY
Creating a Word Cloud in R is a function of using the right package with the right set
of text or words.

To summarize:

Unstructured text can be converted into a structured format like a Dataframe


in R using the correct syntax.

The tm package in R which is needed to carry out text analysis comes with
detailed documentation.

While converting a Dataframe to a Corpus the name of the vector/column


which contains the text needs to be indicated.

The tm package document displays the code to clean up a Corpus.

Apart from cleaning up stop words, numbers, punctuation, urls can also be
removed through a user defined function.

A TDM is first created as a List and then converted to a matrix in R.

To calculate the frequency of words in a TDM, the rows against each word in
the matrix needs to be summed up.

A Word Cloud can be created once words and their frequencies are mapped
out.

Modifications to a Word Cloud include defining minimum frequency,


maximum number of words to be displayed and colour.

132Copyright (c) 2014 Redwood Associates Business Solutions Private Limited


All rights reserved
By downloading this material and signing up for our course, you have supported us in
our mission to help individuals and organizations take smarter decisions every day.

We hope to keep upgrading this material by focusing on improving quality and


providing additional lectures and material on this subject.

To send us feedback on how to improve this course, do write to us at


help@analyticstraining.in with the subject line R Handbook.

Copyright (c) 2014 Redwood Associates Business Solutions Private Limited 133
All rights reserved