Vous êtes sur la page 1sur 24

+

Data Objects in R
ANLY 510 Late Fall
Dr. Stephen Penn
December 10, 2015

+
Working Directory

While youre in R, you may want to know your present


working directory.

Or, you may want to change directories.

> getwd()

> setwd(the path)

And, you may want to see what is in the current folder.

> list.files()

+
Loading Data

Dataset1 read.csv(C:/mystatsdata/mysurvey.csv)

My preferred method is to use RStudios Tools

Tools Import Data Set From Text File

This creates the path for you, which I usually misspell

+
Other read functions

To read in a table of numeric values

> read.table(

You can also read MS Excel spreadsheets

You can issue SQL commands to pull data

However, we will just use CSV files in this class

+
Data Sets

Once youve loaded a data set, you will want to look at


it.

> str(data set)

This tells us that we have a data frame with 95 rows of


data and 45 columns.

Also, use head() and tail()

+
So, what makes up a data
frame?

A data frame is an object with many components

A single value in R is called a scalar

Such as 3

> x <- 3

The above command places the value 3 into the


variable x

We know that x is numeric, but we dont know how it is


stored

> typeof(x)

> class(x)

> mode(x)

+
Current list of variables

If you want to see the current list of variables what


exist in your R environment, issue the following
command

> ls()

You can also see data sets that come with base R, such
as the iris data set

> data()

+
Vectors

A vector is a series of numeric values

Order is important in a vector

> x <- c(1,3,5,7,9)

> x <- c(5:55)

You can access each element in a vector by its dimensional


value

> x[1]

> x[3]

+
Functions on Vectors

You can perform functions on vectors

mean(x), var(x), sd(x), max(x), min(x),


length(x)

Vectors can hold integers, doubles,


complex numbers, characters, and logicals

While a vector holds many values, it can


only hold one type of data

> length(x)

>typeof(x[1])

If you create a vector with numbers and


characters, everything will be characters

+
Arrays

An array is a vector with more than one dimension

Again, order is important in an array, just like vectors

> x <- c(1:20)

> x <- array(x, dims=c(2,10)

But, you really shouldnt think of it as multiple


dimensions, because the linear sequence is retained

> x <- array(x,dims=c(10,2))

+
Arrays

By setting the number of


rows equal to 10 and number
of columns equal to 2, we
put the 20 values into a 10x2
matrix.

+
Matrices

A matrix is an array, restricted to two dimensions

> x <- c(1:20)

> x <- array(x, dims=c(2,10)

Thats it for matrices

Well, actually, you can access rows and columns in


matrices and arrays

> ncol(x) # or nrow(x), to get the number of columns or


rows

> mean(x[1,]) # or mean(x[,1]) # gets the mean of a


column or row

+
Factors

A Factor stores categorical values

These categorical values have order, called levels

Examples of factors are the state you live in, or your


college major, or eye color

To create a factor, use the factor function

> x <factor(AZ,TX,DE,GA,MO,NY,WV,DE,AZ,M
D)

To see the levels,

> levels(x)

+
Data Frames

The MOST useful object in R!

Remember that Matrices only hold numeric data!

Thus, matrices are great for linear algebra

But what about factors? Cant put factors into matrices!

A Data Frames is a list of vectors

> typeof(iris)

> class(iris)

Each vector holds its own type of data

Every vector must have the same length

The terms data set and data frame are practically synonyms.

+
Data Frames

When you read in a CSV file, it becomes a data frame.

The top row of data typically becomes the names of the


columns, such as Petal.Length in the iris data frame.

Now, youre ready to apply lots of functions like,

sapply(dataset, function, na.rm=T)

Calculates values for all variables according to the chosen


function

I think of sapply as going acrosS

tapply(numeric variable, categorical variable, function)

Applies the chosen function to a single column split into


buckets

I think of tapply as a

Tower of different floors

+
Data Editor

Instead of manipulating your data in MS Excel, or


LibreOffice, or whatever

You can edit your data frame in R using the fix()


function

Depending your operating system, you might get a test


editor or a very simple spreadsheet

I dont recommend using fix() because of its lack of


functionality, compared to MS Excel or notepad++ or
whatever else you use

+
Attributes

Everything in R is an objectvectors, functions,


everything!

Objects can have attributes

Vectors have no attributes

Pretty much everything else does

+
Attributes

Attributes are important, especially when you use


functions to create more objects.

For example, lets create a correlation matrix using


iris data set

+
Attributes

Note how we reference certain columns in this


command

+
Referencing Columns

Takes the
vector as a
list of
columns.

Makes a
correlation
matrix.

Takes the
numbers 1
through 4
as a
vector.

Pulls the
first 4
columns of
iris.

+
Classes, Modes, and Types

The mode of a variable explains how it is stored in memory

The class of a variable explains how it can be used

In other words, the computer thinks in terms of modes and


users think in terms of classes.

A data frame is actually a list in terms of how it is stored,


but the list is inside the object of data.frame, which has
more methods and attributes.

Types and modes are usually the same. But dont trust R to
be consistent.

Sometimes the type is different and the mode and class are
the same, such as vectors.

+
Missing Values

Sometimes in a data frame you will see the value NA.

This is a missing value. NA is an old statistics


standard that has stayed with us.

The equivalent in relational databases is NULL.

Even though R reports missing values as NA, dont ever


test

X == NA

Instead use the is.na() function

+
Homework Assignment

Install the package HSAUR2

After installing the packages, load the package using the


library command

After the library command, pull up the data set CHFLS using
the data command

If youre unsure how to do this, try help()

What are the names of the columns of the CHFLS data


set?

What is the class of CHFLS?

+
Homework Page 2

What is the average R_income for each R_edu?

Create a table of counts of rows by the values in the


R_health column and the R_happy column. In other words,
how many rows are in the data frame for each combination
of distinct values in these two columns?

What is your working directory?

What files exist in your working directory?

Save the data frame CHFLS as a csv file in your working


directory. Show that CHFLS exists as a csv file

Vous aimerez peut-être aussi