Vous êtes sur la page 1sur 20

Leap into the 21st century

Artificial
Intelligence

1
Recap

2
Basic data structures

Variables Vectors Operators

- Sequence of data elements of - Arithmetics operators


- A named storage that a program the same type. - Relational operators
can manipulate. - Members of a vector are called - Logical operators
components of a vector. - Assignment operators

- Integer variable - Vector arithmetics - +, -, *, /, ^, %%, %/%


- Numeric variable - Vector index - <, >, <=, >=, ==, !=
- Double - Numeric index - !, &, |, &&, ||
- Character / String - Logical index - =, <- , -> , <<- , ->>
- Boolean Variable - Name vector members
- TRUE, FALSE - Combining vectors

3
Basic data structures

Matrix Lists Data frame

- R-object which contain - Table or a two dimensional array


- Collection of data elements
elements of different types like, like structure.
Arranged in a two dimensional
numbers, strings, vectors and - Column contain values of a
layout.
another list inside it. variable.
- Data elements must be of same
- It can also contain a matrix or a - Row contain contain one set of
type.
function as its elements. values from each column.

- Create matrix
- Create list - Create data frame
- Dimension of matrix
- Name vector elements - Dimension of data frame
- Name rows and columns
- Type and class of list - Name rows and columns
- Access elements of matrix
- Access elements of list - Access elements of data frame
- Modify matrix
- Modify list - Add column or instance
- Add row or column
- Data manipulation - Modify data frame
- Matrix arithmetics
4
Data sources
Loading data into R can be quite frustrating. Almost every single type of file that
you want to get into R seems to require its own function, and even then you
might get lost in the functions’ arguments. In short, it can be fairly easy to mix
up things from time to time, whether you are a beginner or a more advanced R
user.

5
Flat Files Excel Files Web Databases

- Tabular Data (.txt)


- Excel data (.xlsx) - HTML & CSS - Connection
- CSV
- Sheets within data - Auto-web-scrape tool - Query
- TSV
Import Data

7
Checklist to make sure correct import

● If you work with spreadsheets, the first row is usually reserved for the header, while the first column is

used to identify the sampling unit;

● Avoid names, values or fields with blank spaces, otherwise each word will be interpreted as a separate

variable, resulting in errors that are related to the number of elements per line in your data set;

● If you want to concatenate words, inserting a . in between to words instead of a space;

● Short names are prefered over longer names;

● Try to avoid using names that contain symbols such as ?, $,%, ^, &, *, (, ),-,#, ?,,,<,>, /, |, \, [ ,] ,{, and };

● Delete any comments that you have made in your Excel file to avoid extra columns or NA’s to be added

to your file; and

● Make sure that any missing values in your data set are indicated with NA.
Flat Files

9
Flat files

Path Setting: file.path("FOLDER", "FileName.xyz"), first argument is the directory of file and second argument is file name.

read.table(): Use for any type of table data. Specify the separator (delimiter).

read.csv(): Specially for csv, comma is considered as separator by default.

read.delim(): Specially for tsv files, tab is considered as separator by default.

Other parameters settings:

● Header
● Fileencoding
● StringsAsFactors
● Col.names
● Row.names
● Fill
Excel Data

11
Excel files

readxl: Library for reading Excel files.

read_excel(): Arguments required is only the file name. If specific sheet is required then another argument sheet is used.

Sheet could be a sheet number (integer) or the sheet name (character).

● No external dependency on, e.g., Java or Perl.


● Re-encodes non-ASCII characters to UTF-8.
● Loads datetimes into POSIXct columns. Both Windows (1900) and Mac (1904) date specifications are processed
correctly.
● Discovers the minimal data rectangle and returns that, by default. User can exert more control with range, skip, and
n_max.
● Column names and types are determined from the data in the sheet, by default. User can also supply via col_names and
col_types and control name repair via .name_repair.
● Returns a tibble, i.e. a data frame with an additional tbl_df class. Among other things, this provide nicer printing
Web Scraping

13
Web

Data scraping, also known as web scraping, is the process of importing information from a website into a spreadsheet or local
file saved on your computer. It's one of the most efficient ways to get data from the web.

HTML and CSS knowledge is required to scraping data from web, the alternative to this is auto-web-scrape tool like import.io.

Web scraping in general is almost always going to be unique from use case to use case, this is because every website is different,
updates occur, and things can change.

In R:

library(rvest)
Databases

15
Databases

You only need database connection when

● Your data is already in a database.


● You have so much data that it does not all fit into memory simultaneously and you need to use some external storage
engine.

(If your data fits in memory, there is no advantage to putting it in a database; it will only be slower and more frustrating.)

If you are using R to do data analysis inside a company, most of the data you need probably already lives in a database.

a specific backend for the database that you want to connect to.

Five commonly used backends are:

● RMySQL connects to MySQL and MariaDB


● RPostgreSQL connects to Postgres and Redshift.
● RSQLite embeds a SQLite database.
● odbc connects to many commercial databases via the open database connectivity protocol.
● bigrquery connects to Google’s BigQuery.
Connect to Oracle database

There are packages that either connect via ODBC but do not provide support for DBI, or offer DBI support but connect via
JDBC. The odbc package, in combination with a driver, satisfies both requirements.

Another package that provides both ODBC connectivity and DBI support is ROracle. The current version of dbplyr in CRAN
does not yet fully support a connection coming from ROracle, but we are working on it.

Connection Settings

There are six settings needed to make a connection:

Driver - See the Drivers section for more setup information

Host - A network path to the database server

SVC - The name of the schema

UID - The user’s network ID or server local account

PWD - The account’s password

Port - Should be set to 1521


Connect to Oracle database

library(RODBC)

conda install -c conda-forge r-rodbc

In Windows, you create the DSN(data source name) using the ODBC Source Administrator. This tool can be found in the Control
Panel.

In Windows 10, it’s under System and Security -> Administrative Tools -> ODBC Data Sources.

To set one up, click Add, and you’ll get this box:
Select the appropriate driver (Oracle in OraDB12Home1) and click the Finish button.
A Driver Configuration box opens:
For “Data Source Name,” you can put in almost anything you want. This is the name you will use in R when you connect to the
database.
The “Description” field is optional
TNS Service Name is the name that you (or your company data base administrator) assigned when configuring the Oracle
database. And “User ID” is your ID that you use with the database.

After you fill in these fields, click the “Test Connection” button. Another box pops up, with the TNS Service Name and User ID
already populated, and an empty field for your password. Enter your password and click “OK.” You should see a “Connection
Successful” message. If not, check the Service Name, User ID, and Password.
Lab 5: Import Data

19
Lab 4:

Flat Files Excel Files Web Databases

-Tabular data
- Sheets in excel - Auto-web-scrape - Connecting
- CSV
- Data chunks - RVEST - Querying
- TSV

Vous aimerez peut-être aussi