9 vues

Transféré par Yanis Chan

Introdution to StatisticalAnalysis (5)

- JD - Decision Scientist (Campus).pdf
- DATA QUALITY FOR ANALTICS
- Big Data Trinity-V5
- CGMA Briefing Big Data for Mgnt Acctant
- BCG Opportunity Unlocked Oct 2013
- Analytics 3.0_0.pdf
- Bd Archpatterns2 PDF
- Performance Reporting and Analytics of Contact Centre Operations
- Introduction 1
- Why Analytics
- Swanson Resume
- Big Data Analytics
- The Big Data Mystique - Analytics for Every Business
- Analytics
- Motor Parala in Novac i on Empresa Rial
- Formato de Evaluación
- Big Data Comes of Age
- data Sonic Drive-In Case Study
- Big Data Technologies- An Empirical Investigation on Their Adoption, Benefits and Risks for Companies
- The Definitive Guide to Retail Analytics - eBook Sponsored by IBM

Vous êtes sur la page 1sur 66

Xuhu Wan*

March 4, 2018

1

Introduction to Statistical Analysis Xuhu Wan

Contents

1 Introduction 5

1.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Importance of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Big Data Chaning Financial Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Learning Statistics to Become a Data Scientist . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Why Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Import Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Slicing DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Selection by label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Selection by position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Price difference and return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.4 Moving average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Advanced topic- Download Historical Stock Price . . . . . . . . . . . . . . . . . . . . 18

3.1 Trend-following using moving average . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Making strategy better with parameters tuning . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Training set and test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Can the stock market be predicted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Build our first prediction model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Evaluating your model is more important . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1 Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Parameter and Statisitics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Identify Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1 Variables, Cases and DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.1.1 Kinds of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2.1 Numerical Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2.2 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2.3 Relative Frequency, Distribution and Shape of Distribution . . . . . . . . . . . 58

Page 2

Introduction to Statistical Analysis Xuhu Wan

9 Sampling Distribution 62

Page 3

Introduction to Statistical Analysis Xuhu Wan

Part I

What Data Scientists Do

Page 4

Introduction to Statistical Analysis Xuhu Wan

1 Introduction

1.1 Big Data

Big data is a term that describes the large volume of data – both structured and unstructured –

that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important.

It’s what organizations do with the data that matters. Big data can be analyzed for insights that

lead to better decisions and strategic business moves.

While the term “big data” is relatively new, the act of gathering and storing large amounts of

information for eventual analysis is ages old. In the early 2000s, industry analyst Doug Laney

articulated the now-mainstream definition of big data as the three Vs:

• Volume. Organizations collect data from a variety of sources, including business transac-

tions, social media and information from sensor or machine-to-machine data. In the past,

storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the

burden.

• Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely

manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of

data in near-real time.

• Variety. Data comes in all types of formats – from structured, numeric data in traditional

databases to unstructured text documents, email, video, audio, stock ticker data and finan-

cial transactions.

Page 5

Introduction to Statistical Analysis Xuhu Wan

There are two additional dimensions revealing big values of big data:

• Variability. In addition to the increasing velocities and varieties of data, data flows can be

highly inconsistent with periodic peaks. Is something trending in social media? Daily, sea-

sonal and event-triggered peak data loads can be challenging to manage. Even more so with

unstructured data.

• Complexity. Today’s data comes from multiple sources, which makes it difficult to link,

match, cleanse and transform data across systems. However, it’s necessary to connect and

correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral

out of control.

The importance of big data doesn’t revolve around how much data you have, but what you do

with it. You can take data from any source and analyze it to find answers that enable - cost re-

ductions, - time reductions, - new product development and optimized offerings - smart decision

making.

When you combine big data with high-powered analytics, you can accomplish business-

related tasks such as:

• Generating coupons at the point of sale based on the customer’s buying habits.

With large amounts of information streaming in from countless sources, banks are faced with

finding new and innovative ways to manage big data. While it’s important to understand cus-

tomers and boost their satisfaction, it’s equally important to minimize risk and fraud while main-

taining regulatory compliance. Big data brings big insights, but it also requires financial institu-

tions to stay one step ahead of the game with advanced analytics.

When government agencies are able to harness and apply analytics to their big data, they gain

significant ground when it comes to managing utilities, running agencies, dealing with traffic

congestion or preventing crime. But while there are many advantages to big data, governments

must also address issues of transparency and privacy.

Page 6

Introduction to Statistical Analysis Xuhu Wan

Customer relationship building is critical to the retail industry – and the best way to manage

that is to manage big data. Retailers need to know the best way to market to customers, the most

effective way to handle transactions, and the most strategic way to bring back lapsed business.

Big data remains at the heart of all those things.

Big data analytics can be used in predictive models to estimate the rates of return and probably

outcomes on investments. Increasing access to big data results in more precise predictions and

thus the ability to more effectively mitigate the inherent risks associated with financial trading.

High frequency trading has been used quite successfully up until now, with machines trading

independently of human input.

rapid speeds and frequencies that people cannot. The business archetype incorporates the best

possible prices, traded at specific times and reduces manual errors that arise due to behavioural

influences.Here is an video link to high frequency trading

Real-time analytics has the potential to improve the investing power of HFT firms and individ-

uals alike, as the insights gleaned by algorithmic analysis has levelled the playing field providing

all with access to powerful information.

The power of algorithmic trading lies in the almost limitless capabilities. Structured and un-

structured data can be used and thus social media, stock market information and news analysis

can be used to make intuitive judgements. This situational sentiment analysis is highly valuable

as the stock market is an easily influenced archetype.

The full potential of this technology hasn’t yet been realized and the prospects for the applica-

tion of these innovations are immeasurable. Machine learning enables computers to actually learn

and make decisions based on new information by learning from past mistakes and employing

logic. In this way, these techniques can deliver supremely accurate perceptions. Although, the

technology is still developing, the possibilities are promising. This particular avenue of research

removes the human emotional response from the model and makes decisions based on informa-

tion without bias.

Page 7

Introduction to Statistical Analysis Xuhu Wan

Data Science itself is a combination of three fields, statistics, and computer science and domain

knowledge.

In getting a broader perspective, you gain the ability to not only implement the models but

understand how they connect and are related to the deeper logics behind them.

In terms of statistics that are immediately useful to data science, they typically fall into one of

two categories - statistical inference

• model fitting.

In regards to inference:

• Parameter Estimation

• Hypothesis testing

• Bayesian Analysis

• Identifying the best estimator

• Linear Regression

• Non-linear Regression

• Categorical Data Analysis/Classitification

• Time Series / Longitudinal Analysis

• Machine Learning

Before we start, I’d like to tell you about why I use Python for data analytics. I will try to convince

you that Python is really the best tool for most of the tasks involved.

Ideally, I would like learn only one language that is suited for all kinds of work: number

crunching, application building, web development, interfacing with APIs etc. This language

would be easy to learn, the code would be compact and clear, it would run on any platform.

It would enable me to work interactively, enabling the code to evolve as I write it and be at least

free as in speech. And most importantly, I care much more about my own time than the cpu time

of my pc, so number crunching performance is less important for me than my own productivity.

If we match different languages with positions on employment websites:

Page 8

Introduction to Statistical Analysis Xuhu Wan

Python, like most open source software has one specific characteristic: it can be challenging for

a beginner to find his way around thousands of libraries and tools. This guide will help you get

everything you need in your quant toolbox, hopefully without any problems. Fortunately there

are several distributions, containing most of the required packages, making installation a breeze.

The best distribution in my opinion is Anaconda from Continuum Analytics.

The Anaconda distribution includes:

Page 9

Introduction to Statistical Analysis Xuhu Wan

• Jupyter notebook : Interactive shell & notebook

• numpy & scipy : scientific computing tools, similar to Matlab

• pandas2 : Data structures library

• many more scientific and utility packages,

Page 10

Introduction to Statistical Analysis Xuhu Wan

In this section, we will learn to load data into jupyter notebook.

In folder "data", a historical stock data for Facebook is included. To work on this data, for example,

We need to load data into notebook first and save it with a special format "DataFrame", which

is an important and advanced data type declared in python module "Pandas".

To use pandas we need to import.

import pandas as pd

"pd" is a short name which can be self-defined. Next we will import the data, "facebook.csv".

csv is an abbreviation of "Comma-Separated Values", which means that different values saved in

"facebook.csv" is separated by comma.

fb = pd . DataFrame . from_csv ( ' data / facebook . csv ')

We read data from "Facebook.csv" using a method "from_csv" from pd.DataFrame. All methods

in python is followed by "( )". Data is saved with a special data type/format/class "DataFrame"

and we give a name for this DataFrame "fb". To check the type of "fb", we can

print ( type ( fb ) )

<class 'pandas.core.frame.DataFrame'>

• has its own specialized methods, i.e., DataFrame has methods, head(),tail(),describe(),

• - and its own features, i.e. DataFrame has features: index, columns, shape, which are not

followed by "( )"

fb . head ()

Page 11

Introduction to Statistical Analysis Xuhu Wan

Date

2014-12-31 20.400000 20.510000 19.990000 20.049999 19.459270 4157500

2015-01-02 20.129999 20.280001 19.809999 20.129999 19.536913 2842000

2015-01-05 20.129999 20.190001 19.700001 19.790001 19.206934 4948800

2015-01-06 19.820000 19.840000 19.170000 19.190001 18.624611 4944100

2015-01-07 19.330000 19.500000 19.080000 19.139999 18.576082 8045200

Using head(), we can check columns, index and starting date if it is time series data. Now

we know that, our facebook data starts from Dec, 31 with 6 columns/variables - "Open",

"High","Low","Close", "Adj Close", "Volume".

Python use 0-based indexing. "Open" is column 0, "High" is column 1.

The "column" before column 0, is not variable which is index of DataFrame "fb".

We also can use tail() to check the bottom of DataFrame

fb . tail ()

Date

2018-01-30 241.110001 246.419998 238.410004 242.720001 242.720001

2018-01-31 245.770004 249.270004 244.449997 245.800003 245.800003

2018-02-01 238.520004 246.899994 238.059998 240.500000 240.500000

2018-02-02 237.000000 237.970001 231.169998 233.520004 233.520004

2018-02-05 227.000000 233.229996 205.000000 213.699997 213.699997

Volume

Date

2018-01-30 14270800

2018-01-31 11964400

2018-02-01 12980600

2018-02-02 17961600

2018-02-05 28869000

Now we know that the end date is Feb, 05, 2018. Each row is the information of one trading

day. If we want to know how many trading days totally, we can

fb . shape

(780, 6)

"shape" is a feature of DataFrame. It is a tuple. The first element of tuple states how many rows

and the second element states how many columns/variables. We also can print columns’ names

fb . columns

Page 12

Introduction to Statistical Analysis Xuhu Wan

fb . index

'2015-01-07', '2015-01-08', '2015-01-09', '2015-01-12',

'2015-01-13', '2015-01-14',

...

'2018-01-23', '2018-01-24', '2018-01-25', '2018-01-26',

'2018-01-29', '2018-01-30', '2018-01-31', '2018-02-01',

'2018-02-02', '2018-02-05'],

dtype='datetime64[ns]', name='Date', length=780, freq=None)

Slicing data is an extremely important skill if you want to work on real and big data. It has two

tasks: indexing and Selecting Data.

The axis labeling information in DataFrame serves many purposes:

• Identifies data (i.e. provides metadata) using known indicators, important for analysis, vi-

sualization, and interactive console display

• Enables automatic and explicit data alignment

• Allows intuitive getting and setting of subsets of the data set

In this section, we will focus on the final point: namely, how to slice, dice, and generally get

and set subsets of pandas DataFrame. The primary focus will be on DataFrame and Series (a

Columns of DataFrame) as they have received more development attention in this area.

We have two major approached to select data: selection by label and selection by position

A method loc() is primarily label based,and will raise KeyError when the items are not found.

fb . loc [ ' 2014 -12 -31 ': ' 2015 -1 -21 ' , ' Close ']

Page 13

Introduction to Statistical Analysis Xuhu Wan

Date

2014-12-31 20.049999

2015-01-02 20.129999

2015-01-05 19.790001

2015-01-06 19.190001

2015-01-07 19.139999

2015-01-08 19.860001

2015-01-09 19.940001

2015-01-12 19.690001

2015-01-13 19.660000

2015-01-14 19.740000

2015-01-15 19.600000

2015-01-16 19.959999

2015-01-20 20.020000

2015-01-21 20.299999

Name: Close, dtype: float64

print ( type ( price ) )

<class 'pandas.core.series.Series'>

price . loc [ ' 2015 -01 -21 ']

20.299999

DataFame method .iloc() is primarily integer position based (from 0 to length-1 of the axis).

fb . iloc [0:5 ,0:2]

Open High

Date

2014-12-31 20.400000 20.510000

2015-01-02 20.129999 20.280001

2015-01-05 20.129999 20.190001

2015-01-06 19.820000 19.840000

2015-01-07 19.330000 19.500000

Page 14

Introduction to Statistical Analysis Xuhu Wan

import matplotlib . pyplot as plt

import seaborn as sb

fb2015 = fb . loc [ ' 2015 -01 -01 ': ' 2015 -12 -31 ' , ' Close ']

fb2015 . plot ()

plt . show ()

If you only want to get a column or multiple columns, we can do the following

a_column = fb [ ' Close ']

multiple_columns = fb [[ ' Open ' , ' Close ' ]]

We can add a new column for price Difference

fb [ ' PriceDiff ' ]= fb [ ' Close ' ]. shift ( -1) - fb [ ' Close ']

Page 15

Introduction to Statistical Analysis Xuhu Wan

shift() is a method of DataFrame or Series. shift(-1) means look at value in one day. shift(1) is the

value of variables one day ago.

In computation above, we can find that, the difference of two columns is to calculate the dif-

ference of pairs of numbers between two columns. This is a very nice property for DataFrame:

element-wise operation.

fb . head ()

Date

2014-12-31 20.400000 20.510000 19.990000 20.049999 19.459270 4157500

2015-01-02 20.129999 20.280001 19.809999 20.129999 19.536913 2842000

2015-01-05 20.129999 20.190001 19.700001 19.790001 19.206934 4948800

2015-01-06 19.820000 19.840000 19.170000 19.190001 18.624611 4944100

2015-01-07 19.330000 19.500000 19.080000 19.139999 18.576082 8045200

PriceDiff

Date

2014-12-31 0.080000

2015-01-02 -0.339998

2015-01-05 -0.600000

2015-01-06 -0.050002

2015-01-07 0.720002

S t +1 − S t

rt =

St

fb [ ' Return ' ]=( fb [ ' Close ' ]. shift ( -1) - fb [ ' Close ' ]) / fb [ ' Close ']

fb . head ()

Page 16

Introduction to Statistical Analysis Xuhu Wan

Date

2014-12-31 20.400000 20.510000 19.990000 20.049999 19.459270 4157500

2015-01-02 20.129999 20.280001 19.809999 20.129999 19.536913 2842000

2015-01-05 20.129999 20.190001 19.700001 19.790001 19.206934 4948800

2015-01-06 19.820000 19.840000 19.170000 19.190001 18.624611 4944100

2015-01-07 19.330000 19.500000 19.080000 19.139999 18.576082 8045200

PriceDiff Return

Date

2014-12-31 0.080000 0.003990

2015-01-02 -0.339998 -0.016890

2015-01-05 -0.600000 -0.030318

2015-01-06 -0.050002 -0.002606

2015-01-07 0.720002 0.037618

In statistics, a moving average (rolling average or running average) is a calculation to analyze data

points by creating series of averages of different subsets of the full data set.

For example, we can calculate the average prices of last 40 days and 200 days.

fb [ ' MA40 ' ]= fb [ ' Close ' ]. rolling (40) . mean ()

fb [ ' MA200 ' ]= fb [ ' Close ' ]. rolling (200) . mean ()

fb . head ()

Date

2014-12-31 20.400000 20.510000 19.990000 20.049999 19.459270 4157500

2015-01-02 20.129999 20.280001 19.809999 20.129999 19.536913 2842000

2015-01-05 20.129999 20.190001 19.700001 19.790001 19.206934 4948800

2015-01-06 19.820000 19.840000 19.170000 19.190001 18.624611 4944100

2015-01-07 19.330000 19.500000 19.080000 19.139999 18.576082 8045200

Date

2014-12-31 0.080000 0.003990 NaN NaN

2015-01-02 -0.339998 -0.016890 NaN NaN

2015-01-05 -0.600000 -0.030318 NaN NaN

2015-01-06 -0.050002 -0.002606 NaN NaN

2015-01-07 0.720002 0.037618 NaN NaN

Page 17

Introduction to Statistical Analysis Xuhu Wan

fb [ ' MA40 ' ]. plot ()

fb [ ' MA200 ' ]. plot ()

plt . show ()

In this section, we will demonstrate how to download stock prices or stock indices data through

yahoo or google finance. You need package "pandas_datareader". If you did not install, you need

to install it on your laptop before import. (Installation just need to do once,"import" has to be

down whenever you open a notebook)

! pip install pandas_datareader

import datetime

# datetime . datetime is a data type within the datetime module

start = datetime . datetime (2015 , 1 , 1)

end = datetime . datetime (2018 , 2 , 5)

Page 18

Introduction to Statistical Analysis Xuhu Wan

df = web . DataReader ( " msft " , ' yahoo ' , start , end )

Then we can save it into a data file saved locally in your laptop

df . to_csv ( " data / microsoft . csv " )

Page 19

Introduction to Statistical Analysis Xuhu Wan

We need the following modules and headers for this section.

import pandas as pd

import matplotlib . pyplot as plt

% matplotlib inline

import seaborn

We import stock history and compute moving averages with long and short windows.

ms = pd . DataFrame . from_csv ( ' data / microsoft . csv ')

ms [ ' MA10 ' ]= ms [ ' Close ' ]. rolling (10) . mean ()

ms [ ' MA50 ' ]= ms [ ' Close ' ]. rolling (50) . mean ()

plt . figure ( figsize =(10 ,5) ) # to define the size of graph

ms [ ' Close ' ]. plot ( legend = True , color = ' yellow ')

ms [ ' MA10 ' ]. plot ( legend = True , color = ' red ') # fast signal

ms [ ' MA50 ' ]. plot ( legend = True , color = ' green ') # slow signal

plt . show ()

We will go long (buy) when "MA10" is above MA50 and do nothing otherwise. Each time we

will hold 1 share of Microsoft. We need to define a variable (column) with the name "Shares". We

will use list comprehension to do this job. list comprehension is extremely useful techqniue for

data analysis. We define a new list

Page 20

Introduction to Statistical Analysis Xuhu Wan

newlist =[ 1 if ms . loc [ ei , ' MA10 '] > ms . loc [ ei , ' MA50 '] else 0 for ei in

ms . index ]

List comprehension, is to compute shares by iterating through all index of ’ms’. Then we can

define a columns of ’ms’ using this list. List comprehension is very useful when preprocessing

data.

For example,

alist =[1 ,2 ,3 ,4 ,5]

squarelist =[ x **2 for x in alist ]

squarelist

More than that, we also can transform the values in alist following a more complicated rule. -

if x >3, ’pass’ - else, ’fail’

pflist =[ ' Pass ' if x >3 else ' Fail ' for x in alist ]

pflist

ms [ ' Shares ' ]= newlist

ms [ ' Shares ' ]= [ 1 if ms . loc [ ei , ' MA10 '] > ms . loc [ ei , ' MA50 '] else 0 for

ei in ms . index ]

Since we use close price to compute signal, it is equivalent to say, we will evaluate our signals

when the market is almost close and decide whether buy or sell. If ms["Shares"] is 1, we will buy

one share of stock and the profit is close price of tomorrow minus close price of today. Otherise

we will short sell one share of profit and the profit is close price of today minus close price of

tomorrow. We need to define a new variable "Profit".

ms [ ' Close1 ' ]= ms [ ' Close ' ]. shift ( -1)

Page 21

Introduction to Statistical Analysis Xuhu Wan

ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if

ms . loc [ ei , ' Shares ' ]==1

else 0 for ei in ms . index ]

ms [ ' Profit ' ]. plot ()

plt . axhline ( y =0 , color = ' red ')

plt . show ()

It is not clear whether we make money or lose money. Hence we need to compute the cumu-

lative wealth.

ms [ ' Profit ' ]. cumsum () . plot ()

Page 22

Introduction to Statistical Analysis Xuhu Wan

It seems that we make some money but you get bancrupt before you get rich. Later we will

know that this strategy is not good because the maximum drawdown risk is high.

We can change window sizes of signal to see if we can get better results. To make this process

convenient, we copy and past codes above into single cell.

fast =40

slow =50

ms [ ' MAF ' ]= ms [ ' Close ' ]. rolling ( fast ) . mean ()

ms [ ' MAS ' ]= ms [ ' Close ' ]. rolling ( slow ) . mean ()

ms [ ' Shares ' ]= [ 1 if ms . loc [ ei , ' MAF '] > ms . loc [ ei , ' MAS '] else -1 for

ei in ms . index ]

ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if

ms . loc [ ei , ' Shares ' ]==1

else 0 for ei in ms . index ]

ms [ ' Profit ' ]. cumsum () . plot ()

Page 23

Introduction to Statistical Analysis Xuhu Wan

We get much better results by tuning two key parameters in this strategy. But there is a big trap

here. The dataset used to adjust parameters should be different from the dataset which you use to

evaluate the performance. The first dataset is called training set and the second one is called test

set.

fast =40

slow =50

ms [ ' MAF ' ]= ms [ ' Close ' ]. rolling ( fast ) . mean ()

ms [ ' MAS ' ]= ms [ ' Close ' ]. rolling ( slow ) . mean ()

ms [ ' Shares ' ]= [ 1 if ms . loc [ ei , ' MAF '] > ms . loc [ ei , ' MAS '] else -1 for

ei in ms . index ]

ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if

ms . loc [ ei , ' Shares ' ]==1

else 0 for ei in ms . index ]

Train = ms [ ' Profit ' ]. iloc [: -200]

Test = ms [ ' Profit ' ]. iloc [ -200:]

Train . cumsum () . plot ()

Page 24

Introduction to Statistical Analysis Xuhu Wan

You only can determine your best parameters in training set. After final choice of parameters

are decided, use them in test set which is to mimic real trading process: we train our models using

historical data and apply the best model tomorrow.

Test . cumsum () . plot ()

Page 25

Introduction to Statistical Analysis Xuhu Wan

If you use only training set to adjust parameters, maybe you will choose different set of num-

bers.

fast =140

slow =160

ms [ ' MAF ' ]= ms [ ' Close ' ]. rolling ( fast ) . mean ()

ms [ ' MAS ' ]= ms [ ' Close ' ]. rolling ( slow ) . mean ()

ms [ ' Shares ' ]= [ 1 if ms . loc [ ei , ' MAF '] > ms . loc [ ei , ' MAS '] else -1 for

ei in ms . index ]

ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if

ms . loc [ ei , ' Shares ' ]==1

else 0 for ei in ms . index ]

Train = ms [ ' Profit ' ]. iloc [: -200]

Test = ms [ ' Profit ' ]. iloc [ -200:]

Train . cumsum () . plot ()

Figure 14: Performance in train data with parameters tuned in train data.

Page 26

Introduction to Statistical Analysis Xuhu Wan

Figure 15: Performance in test data with parameters tuned in train data

Hence the best parameters (models) in the training data maybe is not the best parameters for

test data.

Page 27

Introduction to Statistical Analysis Xuhu Wan

import pandas as pd

import matplotlib . pyplot as plt

% matplotlib inline

import seaborn

The efficient-market hypothesis (EMH) is a theory in financial economics that states that asset

prices fully reflect all available information. A direct implication is that it is impossible to predict

price change of stocks

We should not think that the market is efficient because of EMH. Instead, EMH is just a defini-

tion of market efficiency. The market is not efficient at any time scale.

But prediction of stock market is extremely hard because we are building a robust prediction

system against many impacts from outside of the market which change stochastically across time,

in other words, external variables.

Many prediction models are only based on price paths, which are doomed to failure. A suc-

cessful model - use all information available, price, volume, etc.

- use simple model structure, robust to external variables, known or unknown - ...

We will build a prediction for the price change of Microsoft Inc.

ms = pd . DataFrame . from_csv ( ' data / microsoft . csv ')

ms . head ()

Date

2014-12-31 46.730000 47.439999 46.450001 46.450001 42.848763 21552500

2015-01-02 46.660000 47.419998 46.540001 46.759998 43.134731 27913900

2015-01-05 46.369999 46.730000 46.250000 46.330002 42.738068 39673900

2015-01-06 46.380001 46.750000 45.540001 45.650002 42.110783 36447900

2015-01-07 45.980000 46.459999 45.490002 46.230000 42.645817 29114100

Y = Closet+1 − Closet

ms [ 'Y ' ]= ms [ ' Close ' ]. shift ( -1) - ms [ ' Close ']

We will use the following predictors (independent variables) - X1,X2, change of close price today

and yesterday - X3,X4,difference between high and low, today and yesterday - X5,X6,difference

between high and close, today and yesterday - X7,X8,change of volume, today and yesterday

Page 28

Introduction to Statistical Analysis Xuhu Wan

ms [ ' X1 ' ]= ms [ ' Close '] - ms [ ' Close ' ]. shift (1)

ms [ ' X2 ' ]= ms [ ' Close ' ]. shift (1) - ms [ ' Close ' ]. shift (2)

ms [ ' X3 ' ]= ms [ ' High '] - ms [ ' Low ']

ms [ ' X4 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Low ' ]. shift (1)

ms [ ' X5 ' ]= ms [ ' High '] - ms [ ' Close ']

ms [ ' X6 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Close ' ]. shift (1)

ms [ ' X7 ' ]= ms [ ' Volume '] - ms [ ' Volume ' ]. shift (1)

ms [ ' X8 ' ]= ms [ ' Volume ' ]. shift (1) - ms [ ' Volume ' ]. shift (2)

First we need to delete NaN values

ms = ms . dropna ( axis =0)

predictors =[ ' X1 ' , ' X2 ' , ' X3 ' , ' X4 ' , ' X5 ' , ' X6 ' , ' X7 ' , ' X8 ']

from sklearn import linear_model

from sklearn . metrics import mean_squared_error

myModel . fit ( ms [ predictors ] , ms [ 'Y ' ])

ms [ ' Y_predict '] = myModel . predict ( ms [ predictors ])

ms . head ()

Page 29

Introduction to Statistical Analysis Xuhu Wan

Date

2015-01-05 46.369999 46.730000 46.250000 46.330002 42.738068 39673900

2015-01-06 46.380001 46.750000 45.540001 45.650002 42.110783 36447900

2015-01-07 45.980000 46.459999 45.490002 46.230000 42.645817 29114100

2015-01-08 46.750000 47.750000 46.720001 47.590000 43.900375 29645200

2015-01-09 47.610001 47.820000 46.900002 47.189999 43.531395 23942800

Y X1 X2 X3 X4 X5 \

Date

2015-01-05 -0.680000 -0.429996 0.309997 0.480000 0.879997 0.399998

2015-01-06 0.579998 -0.680000 -0.429996 1.209999 0.480000 1.099998

2015-01-07 1.360000 0.579998 -0.680000 0.969997 1.209999 0.229999

2015-01-08 -0.400001 1.360000 0.579998 1.029999 0.969997 0.160000

2015-01-09 -0.590001 -0.400001 1.360000 0.919998 1.029999 0.630001

X6 X7 X8 Y_predict

Date

2015-01-05 0.660000 11760000.0 6361400.0 0.183903

2015-01-06 0.399998 -3226000.0 11760000.0 0.233266

2015-01-07 1.099998 -7333800.0 -3226000.0 0.022806

2015-01-08 0.229999 531100.0 -7333800.0 -0.009937

2015-01-09 0.160000 -5702400.0 531100.0 0.032701

Y = β 0 + β 1 X1 + β 1 X1 . . . + β 7 X7 + β 8 X8 + e

myModel . intercept_

0.13597169031644135

myModel . coef_

-6.67068059e-02, 4.04741435e-01, 8.95326239e-02,

3.03771492e-09, 2.64468840e-09])

Can we generate the profit using our prediction model ? - if Y_predict>0 , we buy today and

sell it tomorrow - otherwise, remain unchanged.

Page 30

Introduction to Statistical Analysis Xuhu Wan

ms [ ' Profit ' ]=[ ms . loc [t , 'Y '] if ms . loc [t , ' Y_predict '] >0 else 0 for t

in ms . index ]

Total profit is

ms [ ' Profit ' ]. sum ()

66.28998699999993

Bingo, we make money using our model and we make more money than that using trend-

following strategy.

plt . plot ( ms [ ' Profit ' ])

If we plot the profit in fact, we lose money on some days, but the number of days we win is

more that we lose.

plt . plot ( ms [ ' Profit ' ]. cumsum () )

Page 31

Introduction to Statistical Analysis Xuhu Wan

(0.08531529858429848, 0.6727636361993468)

Hence our daily profit is 8 cents. If transaction cost is less than 8 cents, we will make money

for sure? or not?

Hence the best parameters (models) in the training data maybe is not the best parameters for

test data.

We notice that, building model is straightforwad with only one line of code. However we spend

more time making and selecting predictors included in our model.

The most important step is to evaluate the performance of your model correctly. In the example

above, the signal (’Y_predict’) is too good to be true.

ms . shape

(777, 17)

We use 777 days to build a linear regression model and apply this model again in these 777

days. This is not right. We should separate the data into train and test, described in last section,

building model in training data and validate the model and strategy in test data. I will leave this

as your first assignment.

Page 32

Introduction to Statistical Analysis Xuhu Wan

If we use all historical data to build a regression model, and the average daily profit, for exam-

ple, in historical data, is $1 per day. We do not know, whether the performance of the model will

be similar. In other words, we cannot evaluate the performance of the model correctly.

Instead, we separate all historical data into train and test. For example, totally, we have 100

days in history. Then we will use the first 80 days as train data which is for model building. After

getting a model(which describe the relationship between your target Y and predictors X1 , X2 , . . . )

from training data, we will use this model to make prediction in both train and test data. We can

either evaluate the accuracy or fit of the model (We will cover the details in Part III.) or evaluate

the return if you make decision based on your model, in both train and test.

We claim that the performance of the model is consistent if the return or fit of the model are

similar in both train and test data.

Otherwise the model is said to be over fitting if the performance of the model are different

significantly in train and test.

That is why we need to study statistics in a more systematical way, in order to know: - Random

Variable and Distribution - Association of Two Variables - Hypothesis testing and significance

level - Evaluation of Linear Regression Models.

Page 33

Introduction to Statistical Analysis Xuhu Wan

First load all necessary modules

import pandas as pd

import matplotlib . pyplot as plt

% matplotlib inline

import seaborn

from sklearn import linear_model

import statsmodels . api as smf

import warnings

warnings . filterwarnings ( ' ignore ')

The last two lines is to hide warning message when we run codes.

Background

In this chapter, we have build a regression model for the change of close price of Microsoft

ms = pd . DataFrame . from_csv ( ' microsoft . csv ')

ms [ 'Y ' ]= ms [ ' Close ' ]. shift ( -1) - ms [ ' Close ']

ms [ ' X1 ' ]= ms [ ' Close '] - ms [ ' Close ' ]. shift (1)

ms [ ' X2 ' ]= ms [ ' Close ' ]. shift (1) - ms [ ' Close ' ]. shift (2)

ms [ ' X3 ' ]= ms [ ' High '] - ms [ ' Low ']

ms [ ' X4 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Low ' ]. shift (1)

ms [ ' X5 ' ]= ms [ ' High '] - ms [ ' Close ']

ms [ ' X6 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Close ' ]. shift (1)

ms [ ' X7 ' ]= ms [ ' Volume '] - ms [ ' Volume ' ]. shift (1)

ms [ ' X8 ' ]= ms [ ' Volume ' ]. shift (1) - ms [ ' Volume ' ]. shift (2)

ms = ms . dropna ( axis =0)

Please notice, to read data sucessfully, you need to put "microsoft.csv" and Assignment notebook

in the same folder.

Problem 1.

divide the data into training (80%) and test data (20%).

Problem 2.

Build linear regression model for Y using X1 , . . . , X8 with training data. Make prediction in train-

ing and test data.

Problem 3.

Compute average daily profit in training data and test data using the following signal-based strat-

egy.

• if Y_predict >0, buy today and sell it tomorrow

Page 34

Introduction to Statistical Analysis Xuhu Wan

• else, do nothing.

Is the performance of your prediction model consistent? (Consistency means that it has similar

performance in train and test.)

Problem 4.

In the end of section 4.1 , we claim that, if the average daily profit is higher than transaction cost,

the strategy can be implemented (Now, we know that consistency of the model is also a necessary

condition.)

But on some days, we do not trade stocks which are also counted in average daily profit. To

evaluate the implementability more precisely, we need to compute average daily profit of those

days we trade. Could you compute this adjusted average profit for training and test periods?

Page 35

Introduction to Statistical Analysis Xuhu Wan

Appendix

Launching Jupyter notebook

The Jupyter Notebook App can be launched by clicking on the Jupyter Notebook icon installed by

Anaconda in the start menu (Windows) or by typing in a terminal (cmd on Windows):

This will launch a new browser window (or a new tab) showing the Notebook Dashboard, a

sort of control panel that allows (among other things) to select which notebook to open.

When started, the Jupyter Notebook App can access only files within its start-up folder (in-

cluding any sub-folder). If you store the notebook documents in a subfolder of your user folder

no configuration is necessary. Otherwise, you need to choose a folder which will contain all the

notebooks and set this as the Jupyter Notebook App start-up folder.

Basics of Python

Python, different from c++, java,scala, etc, is high level programming language.It does not care

about how to manage the hardware.

Learning python is considered as the easiest coding language. Comparing to R, which is an-

other high level language for data anaytics, it is also considered to be more friendly.

For beginners, it is crucial to know different class/types of different object(data). Understand

your data can be loaded or packed in different types.

Core python consists of all data types with their methods and features that are available with-

out importing any advanced modules. It has the following data types

Page 36

Introduction to Statistical Analysis Xuhu Wan

• integer

• float

• string

• bool

• list

• tuple

• dictionary

• set

For this book, we will use integer, float, string, bool, list and tuple.

Integer, float

a =10

b =10.1

a . bit_length ()

b . real

10.1

a, b have different data types /class, hence they have different methods and feature.

big_length() is a built-in method for integer and .real is a build-in feature for float number.

String

String is a contend circulated by double quotes or single quotes

c = " I am handsome "

d = " 123 "

d +3

Page 37

Introduction to Statistical Analysis Xuhu Wan

--------------------------------------------------------------------------

TypeError: must be str, not int

You get error message. We can use str(3) to change integer 3 into string "3". Then we have

d + str (3)

and output is ’1233’. Hence the addition of strings is to concatenate two strings. Strings are very

useful in natural language processing. For this book, string is used for columns (variables)’ names.

list is one of major data types we will use which is a collection of elements with all possible data

types.v

aList =[10 ,12.06 , " Tiger " , False ]

list use square brackets. lists have many built-in attributes and most-often used one is "append".

aList

• You have a list of seven elements. A list is an ordered set of elements enclosed in square

brackets.

• python is 0-indexing, and the first element of any non-empty list, i.e. aList, is always aList[0].

• The last element of this seven-element list is aList[6], because lists are always zero-based.

aList [ -1]

'Katy'

• A negative index accesses elements from the end of the list counting backwards. The last

element of any non-empty list, i.e. aList is always aList[-1].

aList [2:5]

Page 38

Introduction to Statistical Analysis Xuhu Wan

• You can get a subset of a list, called a “slice”, by specifying two indices. The return value is

a new list containing all the elements of the list, order, starting with the first slice index (in

this case alist[2]), up to but not including the second slice index (in this case alist[5]).

• Slicing works if one or both of the slice indices is negative. If it helps, you can think of it

this way: reading the list from left to right, the first slice index specifies the first element you

want, and the second slice index specifies the first element you don’t want. The return value

is everything in between.

• Lists are zero-based, so aList[0:3] returns the first three elements of the list, starting at aL-

ist[0], up to but not including aList[3].

aList [0:3]

Tuple is similar to list, in the sense that it is collections of elements with all kinds of types/class

aTuple =(10 ,12.06 , " Tiger " , False )

but tuples use parentheses, whereas lists use square brackets. The tuples cannot be changed after

it is defined.

aTuple [0]=20

---------------------------------------------------------------------------

<ipython-input-2-4253fec5e5d3> in <module>()

----> 1 aTuple[0]=20

aList [0]=10000

Page 39

Introduction to Statistical Analysis Xuhu Wan

Part II

Variables, Samples and Statistical Inferences

Page 40

Introduction to Statistical Analysis Xuhu Wan

We need to import the following modules for this section:

import pandas as pd

import matplotlib . pyplot as plt

% matplotlib inline

The field of inferential statistics enables you to make educated guesses about the numerical char-

acteristics of large groups. The logic of sampling gives you a way to test conclusions about such

groups using only a small portion of its members.

A population is a group of phenomena that have something in common. The term often refers

to a group of people, as in the following examples:

• All registered voters in Thailand

• All members of US Institute of Mathematical Statistics

• All Hong Kong citizens who played golf at least once in the past year

But populations can refer to things as well as people:

• All neurons from your brain

Often, researchers want to know things about populations but do not have data for every per-

son or thing in the population. If a company’s customer service division wanted to learn whether

its customers were satisfied, it would not be practical (or perhaps even possible) to contact every

individual who purchased a product. Instead, the company might select a sample of the popula-

tion.

A sample is a smaller group of members of a population selected to represent the population.

In order to use statistics to learn things about the population, the sample must be random.

A random sample is one in which every member of a population has an equal chance of being

selected. The most commonly used sample is a simple random sample. It requires that every

possible sample of the selected size has an equal chance of being used.

Page 41

Introduction to Statistical Analysis Xuhu Wan

There are two random samples often used: - When a population element can be selected more

than one time, we are sampling with replacement. - When a population element can be selected

only one time, we are sampling without replacement.

For example, we consider a collection of scores of the first assignment as a population.

data = pd . DataFrame ()

data [ ' Population ' ]=[47 , 85 , 41 , 3, 15 , 46 , 35 , 43 , 92 ,

45 , 59 , 35 , 20 , 81 , 30 , 33 , 6 , 12 ,

38 , 10 , 11 , 48 , 4, 99 , 62 , 72 , 15 ,

8, 31 , 37 , 21 , 72 , 90 , 51 , 97 , 66 ,

5, 22 , 73 , 59 , 57 , 93 , 53 , 31 , 20 ,

82 , 20 , 39 , 82 , 22 , 28 , 56 , 94 , 73 ,

95 , 59 , 53 , 11 , 71 , 85 , 20 , 57 , 88]

a_sample_w i t h o u t_ r e p l ac e m e n t

10 59

54 95

28 31

5 46

45 82

20 11

16 6

61 57

8 92

62 88

Name: Population, dtype: int64

a_sample_ wi th _re pl ac eme nt = data [ ' Population ' ]. sample (10 , replace = True )

A parameter is a characteristic of a population. A statistic is a characteristic of a sample. Statistical

inference enables you to make an educated guess about a population parameter based on a statistic

computed from a sample randomly drawn from that population

Page 42

Introduction to Statistical Analysis Xuhu Wan

For example, say you want to know the mean income of the subscribers to a particular

magazine—a parameter of a population. You draw a random sample of 100 subscribers and de-

termine that their mean income is $27, 500 (a statistic). You conclude that the population mean

income µ is likely to be close to $27, 500 as well. This example is one of statistical inference.

Different symbols are used to denote statistics and parameters, as the following table shows.

Mean x̄ µ

Variance s2 σ2

Standard deviation s σ

print ( " Population mean is " , data [ ' Population ' ]. mean () )

print ( " Population variance " , data [ ' Population ' ]. var ( ddof =0) )

print ( " Population standard deviation is " ,

data [ ' Population ' ]. std ( ddof =0) )

print ( " Population size is " , data [ ' Population ' ]. shape [0])

Population variance 812.5069286974048

Population standard deviation is 28.504507164611788

Population size is 63

The values of these numbers will not change with samples. However

a_sample = data [ ' Population ' ]. sample (10 , replace = True )

print ( " Sample mean is " , a_sample . mean () )

print ( " Sample variance " , a_sample . var ( ddof =1) )

print ( " Sample standard deviation is " , a_sample . std ( ddof =1) )

print ( " Sample size is " , a_sample . shape [0])

Page 43

Introduction to Statistical Analysis Xuhu Wan

Sample variance 632.9

Sample standard deviation is 25.157503850740042

Sample size is 10

Notice that, we have different parameter values for "ddof" when we calculate standard devia-

tion and variance for population. First we have the following formula for parameter and statistics,

where x1 , x2 , . . . , x N are all items from population, x1 , x2 , . . . , xn are a random collections from the

population which make a sample. n is the sample size and N is population size. We have

x1 + x2 + . . . , + x n x1 + x2 + . . . , + x N

x̄ = , µ=

n N

( x1 − x̄ )2 + ( x2 − x̄ )2 . . . + ( xn − x̄ )2 ( x1 − µ )2 + ( x2 − µ )2 . . . + ( x N − µ )2

s2 = , σ2 =

n−1 N

These formulas look very lengthy. Statistician use Σ to present summation as follows

∑in=1 xi ∑ N xi

x̄ = , µ = i =1

n N

∑in=1 ( xi − x̄ )2 ∑ N ( x i − µ )2

s2 = , σ 2 = i =1

n−1 N

An often-asked questions for beginners is why the sample variance is divdied by n − 1 instead

n. We first need to explain what are estimators and unbiased estimators.

An estimator is a statisic of sample intended to approximate the population parameter.

An unbiased estimator is the estimator, whose average from all samples is identical to popu-

lation parameter. Otherwise the estimator is called biased.

For example, we take samples with replacement from the population, compute sample variace

with denominator equal to 1 for each sample, and then take average, we can check whether sample

variance is unbiased estimator for puplation variance.

sample_var ia nc e _c ol le c ti on =[]

sample_var ia nc e _c ol le c ti on . append ( a_sample . var ( ddof =1) )

We run the cell above 200 times and then "sample_variance_collection" get 200 sample variances

computed from 200 samples.

print ( " totally , we get " , len ( s am p le _v ar i an ce _c o ll ec ti o n ) , " sample

variance " )

# len is applied to compute the size of list

Page 44

Introduction to Statistical Analysis Xuhu Wan

sample_var i a nc e _c o ll e c ti o n1 = pd . DataFrame ( s am p le _v ar i an ce _c o ll ec ti o n )

print ( " Average of sample variance is " ,

sample_ v ar i an c e _c o ll e c ti o n1 [0]. mean () )

print ( " Population variance is " , data [ ' Population ' ]. var ( ddof =0) )

Population variance is 812.5069286974048

We can find that the average of sample varaince is very close to population variance. We

can expect that, as we continue to samples , the average will reach the real population varaince.

Readers can check that the average of sample varaince with ddof=0 will not reach population

variance. H

Hence sample variance with the denominator = n − 1 is unbiased estimator.

More intuitionally, we take a sample, and sample mean must be a center of all sample data,

which might be much different from population mean. Hence the deviation of sample items from

sample mean is much smaller than that from population mean. Hence it is divided by n − 1, a

smaller number comparing to n, to approximiate the population variance.

rb . head ()

NdateTime

2018-02-06 21:00:00.500 3948.0 3945.0 199.0 4.0 3945.0

2018-02-06 21:00:01.000 3948.0 3946.0 352.0 132.0 3946.0

2018-02-06 21:00:01.500 3947.0 3946.0 16.0 6.0 3946.0

2018-02-06 21:00:02.000 3947.0 3946.0 5.0 95.0 3946.0

2018-02-06 21:00:02.500 3948.0 3946.0 191.0 37.0 3948.0

Volume Turnover

NdateTime

2018-02-06 21:00:00.500 762.0 3007738.0

2018-02-06 21:00:01.000 2450.0 9668594.0

2018-02-06 21:00:01.500 1106.0 4363998.0

2018-02-06 21:00:02.000 540.0 2130974.0

2018-02-06 21:00:02.500 424.0 1673534.0

Page 45

Introduction to Statistical Analysis Xuhu Wan

"rb" is an abbreviation of "reinforcing steel bar, rebar", which is traded in Shanghai Futures

Exchange. The details of Rb contracts can be found here but in Chinese. It is traded from 21:00-

23:00, 9:00-10:15,10:30-11:30,13:30-3:00 on workdays.

Every 500 milliseconds, the exchange will release trading information once, including

• BidPrice: the highest price to buy

• AskVolume: number of shares waiting to sell at AskPrice

• BidVolme: number of shares waiting to buy at BidPrice

• Volume: number of shares traded during 500 milliseconds

• Turnover: The amount of money spent in trading during 500 milliseconds

• Price: the price in last trade.

Pricet − Pricet−1

Returnt =

Pricet−1

rb . head ()

Page 46

Introduction to Statistical Analysis Xuhu Wan

NdateTime

2018-02-06 21:00:00.500 3948.0 3945.0 199.0 4.0 3945.0

2018-02-06 21:00:01.000 3948.0 3946.0 352.0 132.0 3946.0

2018-02-06 21:00:01.500 3947.0 3946.0 16.0 6.0 3946.0

2018-02-06 21:00:02.000 3947.0 3946.0 5.0 95.0 3946.0

2018-02-06 21:00:02.500 3948.0 3946.0 191.0 37.0 3948.0

NdateTime

2018-02-06 21:00:00.500 762.0 3007738.0 NaN

2018-02-06 21:00:01.000 2450.0 9668594.0 0.000253

2018-02-06 21:00:01.500 1106.0 4363998.0 0.000000

2018-02-06 21:00:02.000 540.0 2130974.0 0.000000

2018-02-06 21:00:02.500 424.0 1673534.0 0.000507

rb . shape [0]

414020

sample size is huge, but it is still a sample. What is the population in this example? It is an

collection of inifinite return (rate) that our sample possible comes from. We are not sure what is

the mean, varaince or proportion of all possible values in population. But with this large sample,

we can make inference (meaningful guess) about it. To do that, we need to explore our sample to

get more knowledge about this set of data.

Page 47

Introduction to Statistical Analysis Xuhu Wan

import pandas as pd

import matplotlib . pyplot as plt

% matplotlib inline

import numpy as np

’numpy’ is another important module which can generate array of random variable with given

distribution, and column-wisely scientific computation etc.

6.1 Variables, Cases and DataFrame

The figure abov shows a small collection of sea shells gathered on s beach. All the shells in the

collection are similar: small disk-shaped shells with a hole in the center. But the shells also differ

from one another in overall size and weight, in color, in smoothness, in the size of the hole, etc.

Any data set is something like the shell collection. It consists of cases (obervations): the objects

in the collected sample.

Each case has one or more attributes or qualities, called variables. This word “variable” em-

phasizes that it is differences or variation that is often of primary interest. Usually, there are many

possible variables. The researcher chooses those that are of interest, often drawing on detailed

knowledge of the system that is under study.

The researcher measures or observes the value of each variable for each case. The result is a

Page 48

Introduction to Statistical Analysis Xuhu Wan

table, also known as a data frame: a sort of spreadsheet. Within the data frame, each row refers to

one case, each column to one variable.

4.3, 12.0, 3.8 are called observed values of the variable "diameter".

Most people tend to think of data as numeric, but variables can also be descriptions, as the sea

shell collection illustrates. The two basic types of data are:

age, and so on. Numerical variables can be further classified as discrete and continuous.

• Categorical: A description that can be put simply into words or categories, for instance male

versus female or red vs green vs yellow, and so on.

Some categorical variables have values that have a natural order. For example, a categorical vari-

able for temperature might have values such as “cold,”“warm,” “hot,” “very hot.” Variables like

this are called ordinal. Opinion surveys often ask for a choice from an ordered set such as this:

strongly disagree, disagree, no opinion, agree, strongly agree.

rb = pd . DataFrame . from_csv ( " data / Rb2018 . csv " )

rb . head ()

Page 49

Introduction to Statistical Analysis Xuhu Wan

NdateTime

2018-02-06 21:00:00 39480.0 39450.0 199.0 4.0 39450.0

2018-02-06 21:00:30 39460.0 39450.0 184.0 224.0 39460.0

2018-02-06 21:01:00 39450.0 39440.0 8.0 83.0 39440.0

2018-02-06 21:01:30 39510.0 39500.0 5.0 730.0 39510.0

2018-02-06 21:02:00 39480.0 39460.0 236.0 315.0 39480.0

Volume Turnover

NdateTime

2018-02-06 21:00:00 762.0 3007738.0

2018-02-06 21:00:30 56.0 220954.0

2018-02-06 21:01:00 290.0 1144022.0

2018-02-06 21:01:30 468.0 1849118.0

2018-02-06 21:02:00 38.0 149992.0

Next we compute VAP: volume adjusted price.

AskPrice ∗ BidVolume + BidPrice ∗ AskVolume

VAP =

AskVolume + BidVolume

rb [ ' VAP ' ]=( rb [ ' BidPrice ' ]* rb [ ' AskVolume ' ]+ rb [ ' AskPrice ' ]* rb [ ' BidVolume ' ]) /( rb [ ' A

rb [ ' Return ' ]= rb [ ' VAP ' ]. pct_change ()

rb [ ' Direction ' ]=[ ' Up ' if x >0 else ' Down ' for x in rb [ ' Return ' ]]

rb . head ()

NdateTime

2018-02-06 21:00:30 39460.0 39450.0 184.0 224.0 39460.0

2018-02-06 21:01:00 39450.0 39440.0 8.0 83.0 39440.0

2018-02-06 21:01:30 39510.0 39500.0 5.0 730.0 39510.0

2018-02-06 21:02:00 39480.0 39460.0 236.0 315.0 39480.0

2018-02-06 21:02:30 39490.0 39480.0 141.0 598.0 39490.0

NdateTime

2018-02-06 21:00:30 56.0 220954.0 39455.490196 0.000124 Up

2018-02-06 21:01:00 290.0 1144022.0 39449.120879 -0.000161 Down

2018-02-06 21:01:30 468.0 1849118.0 39509.931973 0.001542 Up

2018-02-06 21:02:00 38.0 149992.0 39471.433757 -0.000974 Down

2018-02-06 21:02:30 24.0 94772.0 39488.092016 0.000422 Up

Page 50

Introduction to Statistical Analysis Xuhu Wan

and others are discrete numerical varaiable.

"Data" or "data set" we often mention, is a data frame consisting of multiple cases of one variable.

A first step to start with data analysis is so-called exploratory data analysis

We can compute numerical descriptive measures from three aspects. - Central Tendency: Mean,

Median, Mode - Variation:Variance,SD, Range, Interquantile range - Distribution: tail, skewness,

Q-Q plot, extreme events and heavy tail distribution

We will explain distribution after we learn the histgram which is to visualize the distribution

rb . mode ()

Page 51

Introduction to Statistical Analysis Xuhu Wan

0 39230.0 39350.0 1.0 1.0 39210.0 0.0 0.0

1 39360.0 NaN NaN NaN 39230.0 NaN NaN

2 NaN NaN NaN NaN NaN NaN NaN

3 NaN NaN NaN NaN NaN NaN NaN

4 NaN NaN NaN NaN NaN NaN NaN

5 NaN NaN NaN NaN NaN NaN NaN

6 NaN NaN NaN NaN NaN NaN NaN

7 NaN NaN NaN NaN NaN NaN NaN

8 NaN NaN NaN NaN NaN NaN NaN

9 NaN NaN NaN NaN NaN NaN NaN

10 NaN NaN NaN NaN NaN NaN NaN

11 NaN NaN NaN NaN NaN NaN NaN

12 NaN NaN NaN NaN NaN NaN NaN

13 NaN NaN NaN NaN NaN NaN NaN

14 NaN NaN NaN NaN NaN NaN NaN

15 NaN NaN NaN NaN NaN NaN NaN

16 NaN NaN NaN NaN NaN NaN NaN

17 NaN NaN NaN NaN NaN NaN NaN

18 NaN NaN NaN NaN NaN NaN NaN

19 NaN NaN NaN NaN NaN NaN NaN

20 NaN NaN NaN NaN NaN NaN NaN

21 NaN NaN NaN NaN NaN NaN NaN

22 NaN NaN NaN NaN NaN NaN NaN

23 NaN NaN NaN NaN NaN NaN NaN

24 NaN NaN NaN NaN NaN NaN NaN

25 NaN NaN NaN NaN NaN NaN NaN

26 NaN NaN NaN NaN NaN NaN NaN

27 NaN NaN NaN NaN NaN NaN NaN

28 NaN NaN NaN NaN NaN NaN NaN

29 NaN NaN NaN NaN NaN NaN NaN

... ... ... ... ... ... ... ...

6899 NaN NaN NaN NaN NaN NaN NaN

6900 NaN NaN NaN NaN NaN NaN NaN

6901 NaN NaN NaN NaN NaN NaN NaN

6902 NaN NaN NaN NaN NaN NaN NaN

6903 NaN NaN NaN NaN NaN NaN NaN

6904 NaN NaN NaN NaN NaN NaN NaN

6905 NaN NaN NaN NaN NaN NaN NaN

6906 NaN NaN NaN NaN NaN NaN NaN

6907 NaN NaN NaN NaN NaN NaN NaN

6908 NaN NaN NaN NaN NaN NaN NaN

6909 NaN NaN NaN NaN NaN NaN NaN

6910 NaN NaN NaN NaN NaN NaN NaN

6911 NaN NaN NaN NaN NaN NaN NaN

6912 NaN NaN NaN NaN NaN NaN NaN

6913 NaN NaN NaN NaN NaN NaN NaN

6914 NaN NaN NaN NaN NaN NaN NaN

6915 NaN NaN NaN NaN NaN NaN NaN

6916 NaN NaN NaN NaN NaN NaN NaN Page 52

6917 NaN NaN NaN NaN NaN NaN NaN

6918 NaN NaN NaN NaN NaN NaN NaN

6919 NaN NaN NaN NaN NaN NaN NaN

Introduction to Statistical Analysis Xuhu Wan

rb . mean ()

AskPrice 39369.937942

BidPrice 39359.812383

AskVolume 391.124116

BidVolume 383.348102

Price 39364.801559

Volume 65.310723

Turnover 257289.964497

VAP 39364.851415

Return 0.000003

dtype: float64

rb . median ()

AskPrice 3.930000e+04

BidPrice 3.929000e+04

AskVolume 3.130000e+02

BidVolume 3.130000e+02

Price 3.929000e+04

Volume 8.000000e+00

Turnover 3.155200e+04

VAP 3.929283e+04

Return 5.855534e-07

dtype: float64

mean can be affected by extreme value (extremely large or small value). However, the mode

and median are not affected.

AskPrice 4.800383e+02

BidPrice 4.800329e+02

AskVolume 3.520719e+02

BidVolume 3.450304e+02

Price 4.799575e+02

Volume 6.468859e+02

Turnover 2.568355e+06

VAP 4.800255e+02

Return 3.469254e-04

dtype: float64

Page 53

Introduction to Statistical Analysis Xuhu Wan

Variation is also affected by extreme value. In order to correctly evaluate the variation of the

data, we need to use interquantile range. First let us compute quantile.

rb . quantile (0.5)

AskPrice 3.930000e+04

BidPrice 3.929000e+04

AskVolume 3.130000e+02

BidVolume 3.130000e+02

Price 3.929000e+04

Volume 8.000000e+00

Turnover 3.155200e+04

VAP 3.929283e+04

Return 5.855534e-07

Name: 0.5, dtype: float64

Interquantile range (IR) is the difference between 75% quantile and 25% quantile which is

applied to described variation of data . Different from standard deviation or variance, extreme

value has no impact on IR

IR = rb . quantile (0.75) - rb . quantile (0.25)

print ( IR )

AskPrice 360.000000

BidPrice 360.000000

AskVolume 360.000000

BidVolume 349.000000

Price 350.000000

Volume 34.000000

Turnover 133352.000000

VAP 352.470772

Return 0.000229

dtype: float64

plt . figure ( figsize =(15 ,5) )

plt . hist ( rb [ ' Return '] , bins =100)

plt . axvline ( x = rb [ ' Return ' ]. quantile (0.25) , color = 'r ')

plt . axvline ( x = rb [ ' Return ' ]. quantile (0.75) , color = 'r ')

plt . show ()

Page 54

Introduction to Statistical Analysis Xuhu Wan

Bar chart and pie chart are the two main classic method of plotting for categorical variables. The

following are several examples

rb [ ' Direction ' ]. value_counts () . plot ( kind = ' pie ')

Page 55

Introduction to Statistical Analysis Xuhu Wan

rb [ ' Return ' ]. plot ( kind = ' hist ' , bins =100)

plt . show ()

Boxplot is important visualizatioin method for numerical data. - A descriptive statistics, a box

plot or boxplot is a convenient way of graphically depicting groups of numerical data through

Page 56

Introduction to Statistical Analysis Xuhu Wan

their quartiles. - Box plots may also have lines extending vertically from the boxes (whiskers)

indicating variability outside the upper and lower quartiles, hence also called box-and-whisker

plot and box-and-whisker diagram.

rb [ ' Return ' ]. plot ( kind = ' box ')

plt . show ()

Page 57

Introduction to Statistical Analysis Xuhu Wan

The set of the values that appear in a sample (puplation) follow certain proportion. In sample,

these proportions are called frequency or relative frequency (in percentage). The proportion of

different values for a variable in population is called distribution.

We can apply histogram visualize the distibution or value_counts() (discrete or categorical) to

get frequency table for any variables.

n , bins , patch = plt . hist ( rb [ ' Return '] , bins =10)

plt . show ()

Page 58

Introduction to Statistical Analysis Xuhu Wan

bins

0.00183297, 0.00341738, 0.0050018 , 0.00658622, 0.00817063,

0.00975505])

n

6.03700000e+03, 8.56000000e+02, 6.00000000e+00,

2.00000000e+00, 0.00000000e+00, 0.00000000e+00,

1.00000000e+00])

As sample size increases, the relative frequency will converge to the relative frequency in Pop-

ulation, which is called "Distribution".

Page 59

Introduction to Statistical Analysis Xuhu Wan

Page 60

Introduction to Statistical Analysis Xuhu Wan

Page 61

Introduction to Statistical Analysis Xuhu Wan

9 Sampling Distribution

Page 62

Introduction to Statistical Analysis Xuhu Wan

Page 63

Introduction to Statistical Analysis Xuhu Wan

Page 64

Introduction to Statistical Analysis Xuhu Wan

Part III

Prediction with Multiple Linear Regression

Page 65

Introduction to Statistical Analysis Xuhu Wan

List of Figures

1 Maps of energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Growth of unstructured data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 High frequency data of stock price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Matching Index of "Data Science" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Matching index of "deep learning" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

6 Daily close price of Facebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7 Slow and fast signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

8 Stock price and moving average processes. . . . . . . . . . . . . . . . . . . . . . . . . 20

9 Daily profit of signal-based strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

10 Accumulated profit or wealth process of signal-based stratgy . . . . . . . . . . . . . . 23

11 Improved profit with parameter-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 24

12 Performance of strategy in training data. . . . . . . . . . . . . . . . . . . . . . . . . . . 25

13 Performance of strategy in testing set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

14 Performance in train data with parameters tuned in train data. . . . . . . . . . . . . . 26

15 Performance in test data with parameters tuned in train data . . . . . . . . . . . . . . 27

16 Average daily profit with regression model . . . . . . . . . . . . . . . . . . . . . . . . 31

17 Accumulated profit with regression model. . . . . . . . . . . . . . . . . . . . . . . . . 32

18 A snapshot of jupyter notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

19 Population and sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

20 Parameter and statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

21 Reinforcing steel bar, rebar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

22 shells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

23 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

24 Demonstration of interquantile range for Return. . . . . . . . . . . . . . . . . . . . . . 55

25 Pie chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

26 Bar chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

27 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

28 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

29 Polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

30 Histogram with the output of frequency and bars . . . . . . . . . . . . . . . . . . . . 59

Page 66

- JD - Decision Scientist (Campus).pdfTransféré parPritesh Kumar
- DATA QUALITY FOR ANALTICSTransféré parPratheep Deepu
- Big Data Trinity-V5Transféré parFarhan Arshad
- CGMA Briefing Big Data for Mgnt AcctantTransféré parWingyan Chan
- BCG Opportunity Unlocked Oct 2013Transféré parpushpane
- Analytics 3.0_0.pdfTransféré parpubliwil
- Bd Archpatterns2 PDFTransféré parTarik Krmoua Softwares
- Performance Reporting and Analytics of Contact Centre OperationsTransféré parPayal Chauhan
- Introduction 1Transféré parPratik Tagwale
- Why AnalyticsTransféré parNamanGupta
- Swanson ResumeTransféré pardrProton2013
- Big Data AnalyticsTransféré parRavindra
- The Big Data Mystique - Analytics for Every BusinessTransféré parAnonymous PKVCsG
- AnalyticsTransféré parAshwini Kumar
- Motor Parala in Novac i on Empresa RialTransféré parasn89
- Formato de EvaluaciónTransféré parSergio Landin
- Big Data Comes of AgeTransféré parWilson Verardi
- data Sonic Drive-In Case StudyTransféré paradilaks240418
- Big Data Technologies- An Empirical Investigation on Their Adoption, Benefits and Risks for CompaniesTransféré parhya
- The Definitive Guide to Retail Analytics - eBook Sponsored by IBMTransféré parrajeshwarideb
- 5 Ways Data Analyticѕ Can Hеlр Your BuѕіneѕѕTransféré parahmet
- Www.crackingpatching.comTransféré parFaisal Qadri
- Big Data - Frans MullerTransféré paramcuc
- Big Data Understanding Big DataTransféré parIulia T
- Big Data AnalyticsTransféré parMaxx
- Study of Big Data: An Industrial Revolution Review of applications and challengesTransféré pareditor3854
- BIG DATA ANALYTICS: – CHALLENGES, OPEN RESEARCH ISSUES FOR COMPUTER SCIENCE EDUCATIONTransféré parAnonymous CwJeBCAXp
- MGI Big Data Exec SummaryTransféré paradityagv1989
- Organize 3 days workshop.docTransféré parsankarsada
- MarketingTransféré parAnonymous ZS0ZWi4K7

- Session 16 Omnichannel Not OmnishamblesTransféré parYanis Chan
- 1 MGMT 2110 Kapadia - IntroductionTransféré parYanis Chan
- Session 16 Engaging Customers the Evolution of Asia Pacific Digital BankingTransféré parYanis Chan
- ISOM2500_2018 (5)Transféré parYanis Chan
- Session 3&4 Your Loyalty Program is Betraying You-1Transféré parYanis Chan
- 3-D Printing Takes ShapeTransféré parHarsh Bhardwaj

- IELTSreadingexam5.DocTransféré paranhchangleloi
- Case Statement 2011Transféré parRps Kohli
- 10 Ways of Managing P-O FITTransféré parRishabh Gupta
- Introduction to Financial Accounting - Home _ CourseraTransféré pareasyfeet
- 644656.pdfTransféré parLatifa Selmi
- case studyTransféré parapi-367105964
- Langman Medical Embryology Made EasyTransféré parMa Theresa Monje
- Geolog-8Transféré parjohnny0257
- Secret of HappinessTransféré parAna Maria
- Continuous Delivery _ A Maturity Assessment ModelFINAL.pdfTransféré parmilady.mili1352
- ESC Rights and Climate Change Legal Reference GuideTransféré parAaron Wu
- gulzar_auliyaTransféré parTariq Mehmood Tariq
- Refkas-CondylomaAccuminataTransféré parmichelle1945
- SyllabusTransféré parAgustín Abreu Cornelio
- FETAL ECHOCARDIOGRAPHY PPT DD fix2.pptxTransféré parPAOGI UNAND
- EXCEL Formula DictionaryTransféré parcoldflame81
- HSDPATransféré parahmedzizo88
- Joey Yap - Am_I_in_the_Right_Job.pdfTransféré parMinju Du
- Oil and Gas Industry Courses for a Successful Career in Energy SectorTransféré parKoliyoth Institute of Energy & Research
- Clinical issuesinocclusion – Part ITransféré parGeorge Lazar
- PHAR 5 6-7Transféré parClay Baker
- 02Transféré pareko nopiadie
- Socio Cultural Changes in Potohar Rejion a Case Study of Village Noor Pur SyedanTransféré parSyed Arslan Ahmed
- Emerging Managers: Small Firms with Big IdeasTransféré parCallan
- Ethical Implications of Cross-Cultural International ArtTransféré parCarolina Arévalo-Ojeda
- William the ConquerorTransféré paryollacullen
- mktgTransféré parSapana Bhatnagar
- SSRN-id1756741Transféré parShrey Singh
- Snake CubeTransféré parxiaoboshi
- 2000 English (Syl[1]. a) P3 Question+Data FileTransféré parConnie Diu