Vous êtes sur la page 1sur 66

Introduction to Statistical Analysis

Xuhu Wan*

March 4, 2018

* Department of ISOM, HKUST, xuhu.wan@gmail.com

1
Introduction to Statistical Analysis Xuhu Wan

Contents

I What Data Scientists Do 4

1 Introduction 5
1.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Importance of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Big Data Chaning Financial Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Learning Statistics to Become a Data Scientist . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Why Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Play With Stock Data using DataFrame 11


2.1 Import Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Slicing DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Selection by label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Selection by position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Price difference and return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Moving average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Advanced topic- Download Historical Stock Price . . . . . . . . . . . . . . . . . . . . 18

3 Let Us Make Money 20


3.1 Trend-following using moving average . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Making strategy better with parameters tuning . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Training set and test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Making Money is Not To Gamble 28


4.1 Can the stock market be predicted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Build our first prediction model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Evaluating your model is more important . . . . . . . . . . . . . . . . . . . . . . . . . 32

II Variables, Samples and Statistical Inferences 40

5 Sample and Population 41


5.1 Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Parameter and Statisitics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Identify Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Variables, Frequency and Distribution 48


6.1 Variables, Cases and DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1.1 Kinds of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2.1 Numerical Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2.2 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.3 Relative Frequency, Distribution and Shape of Distribution . . . . . . . . . . . 58

7 Probability, Distribution and Expectation 60

Page 2
Introduction to Statistical Analysis Xuhu Wan

8 Association and Variable Selection 61

9 Sampling Distribution 62

10 Estimation of Stock Return and Volatility 63

11 Testing of Market Change 64

III Prediction with Multiple Linear Regression 65

Page 3
Introduction to Statistical Analysis Xuhu Wan

Part I
What Data Scientists Do

Page 4
Introduction to Statistical Analysis Xuhu Wan

1 Introduction
1.1 Big Data

Figure 1: Maps of energy Consumption

Big data is a term that describes the large volume of data – both structured and unstructured –
that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important.
It’s what organizations do with the data that matters. Big data can be analyzed for insights that
lead to better decisions and strategic business moves.
While the term “big data” is relatively new, the act of gathering and storing large amounts of
information for eventual analysis is ages old. In the early 2000s, industry analyst Doug Laney
articulated the now-mainstream definition of big data as the three Vs:

• Volume. Organizations collect data from a variety of sources, including business transac-
tions, social media and information from sensor or machine-to-machine data. In the past,
storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the
burden.

• Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely
manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of
data in near-real time.

• Variety. Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, email, video, audio, stock ticker data and finan-
cial transactions.

Page 5
Introduction to Statistical Analysis Xuhu Wan

Figure 2: Growth of unstructured data

There are two additional dimensions revealing big values of big data:

• Variability. In addition to the increasing velocities and varieties of data, data flows can be
highly inconsistent with periodic peaks. Is something trending in social media? Daily, sea-
sonal and event-triggered peak data loads can be challenging to manage. Even more so with
unstructured data.

• Complexity. Today’s data comes from multiple sources, which makes it difficult to link,
match, cleanse and transform data across systems. However, it’s necessary to connect and
correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral
out of control.

1.2 Importance of Big Data


The importance of big data doesn’t revolve around how much data you have, but what you do
with it. You can take data from any source and analyze it to find answers that enable - cost re-
ductions, - time reductions, - new product development and optimized offerings - smart decision
making.
When you combine big data with high-powered analytics, you can accomplish business-
related tasks such as:

• Determining root causes of failures, issues and defects in near-real time.

• Generating coupons at the point of sale based on the customer’s buying habits.

• Recalculating entire risk portfolios in minutes.

• Detecting fraudulent behavior before it affects your organization.

With large amounts of information streaming in from countless sources, banks are faced with
finding new and innovative ways to manage big data. While it’s important to understand cus-
tomers and boost their satisfaction, it’s equally important to minimize risk and fraud while main-
taining regulatory compliance. Big data brings big insights, but it also requires financial institu-
tions to stay one step ahead of the game with advanced analytics.
When government agencies are able to harness and apply analytics to their big data, they gain
significant ground when it comes to managing utilities, running agencies, dealing with traffic
congestion or preventing crime. But while there are many advantages to big data, governments
must also address issues of transparency and privacy.

Page 6
Introduction to Statistical Analysis Xuhu Wan

Customer relationship building is critical to the retail industry – and the best way to manage
that is to manage big data. Retailers need to know the best way to market to customers, the most
effective way to handle transactions, and the most strategic way to bring back lapsed business.
Big data remains at the heart of all those things.

1.3 Big Data Chaning Financial Trading


Big data analytics can be used in predictive models to estimate the rates of return and probably
outcomes on investments. Increasing access to big data results in more precise predictions and
thus the ability to more effectively mitigate the inherent risks associated with financial trading.
High frequency trading has been used quite successfully up until now, with machines trading
independently of human input.

Figure 3: High frequency data of stock price

Machine learning is enabling computers to make human-like decisions, executing trades at


rapid speeds and frequencies that people cannot. The business archetype incorporates the best
possible prices, traded at specific times and reduces manual errors that arise due to behavioural
influences.Here is an video link to high frequency trading
Real-time analytics has the potential to improve the investing power of HFT firms and individ-
uals alike, as the insights gleaned by algorithmic analysis has levelled the playing field providing
all with access to powerful information.
The power of algorithmic trading lies in the almost limitless capabilities. Structured and un-
structured data can be used and thus social media, stock market information and news analysis
can be used to make intuitive judgements. This situational sentiment analysis is highly valuable
as the stock market is an easily influenced archetype.
The full potential of this technology hasn’t yet been realized and the prospects for the applica-
tion of these innovations are immeasurable. Machine learning enables computers to actually learn
and make decisions based on new information by learning from past mistakes and employing
logic. In this way, these techniques can deliver supremely accurate perceptions. Although, the
technology is still developing, the possibilities are promising. This particular avenue of research
removes the human emotional response from the model and makes decisions based on informa-
tion without bias.

Page 7
Introduction to Statistical Analysis Xuhu Wan

1.4 Learning Statistics to Become a Data Scientist


Data Science itself is a combination of three fields, statistics, and computer science and domain
knowledge.
In getting a broader perspective, you gain the ability to not only implement the models but
understand how they connect and are related to the deeper logics behind them.
In terms of statistics that are immediately useful to data science, they typically fall into one of
two categories - statistical inference

• model fitting.

In regards to inference:

• Parameter Estimation
• Hypothesis testing
• Bayesian Analysis
• Identifying the best estimator

In regards to model fitting there are a multitude of topics:

• Linear Regression
• Non-linear Regression
• Categorical Data Analysis/Classitification
• Time Series / Longitudinal Analysis
• Machine Learning

1.5 Why Python

Before we start, I’d like to tell you about why I use Python for data analytics. I will try to convince
you that Python is really the best tool for most of the tasks involved.
Ideally, I would like learn only one language that is suited for all kinds of work: number
crunching, application building, web development, interfacing with APIs etc. This language
would be easy to learn, the code would be compact and clear, it would run on any platform.
It would enable me to work interactively, enabling the code to evolve as I write it and be at least
free as in speech. And most importantly, I care much more about my own time than the cpu time
of my pc, so number crunching performance is less important for me than my own productivity.
If we match different languages with positions on employment websites:

Page 8
Introduction to Statistical Analysis Xuhu Wan

Figure 4: Matching Index of "Data Science"

If we match with positions of deep learning:

Figure 5: Matching index of "deep learning"

Python, like most open source software has one specific characteristic: it can be challenging for
a beginner to find his way around thousands of libraries and tools. This guide will help you get
everything you need in your quant toolbox, hopefully without any problems. Fortunately there
are several distributions, containing most of the required packages, making installation a breeze.
The best distribution in my opinion is Anaconda from Continuum Analytics.
The Anaconda distribution includes:

Page 9
Introduction to Statistical Analysis Xuhu Wan

• Python 3 Python interpreter on top of which everything else runs


• Jupyter notebook : Interactive shell & notebook
• numpy & scipy : scientific computing tools, similar to Matlab
• pandas2 : Data structures library
• many more scientific and utility packages,

Page 10
Introduction to Statistical Analysis Xuhu Wan

2 Play With Stock Data using DataFrame


In this section, we will learn to load data into jupyter notebook.

2.1 Import Data


In folder "data", a historical stock data for Facebook is included. To work on this data, for example,

• Navigate the price history

• Plot different features of stocks.

• Find important features which can predict price change.

• Compute some measures of stock price, for example moving average

We need to load data into notebook first and save it with a special format "DataFrame", which
is an important and advanced data type declared in python module "Pandas".
To use pandas we need to import.
import pandas as pd

"pd" is a short name which can be self-defined. Next we will import the data, "facebook.csv".
csv is an abbreviation of "Comma-Separated Values", which means that different values saved in
"facebook.csv" is separated by comma.
fb = pd . DataFrame . from_csv ( ' data / facebook . csv ')

We read data from "Facebook.csv" using a method "from_csv" from pd.DataFrame. All methods
in python is followed by "( )". Data is saved with a special data type/format/class "DataFrame"
and we give a name for this DataFrame "fb". To check the type of "fb", we can
print ( type ( fb ) )

<class 'pandas.core.frame.DataFrame'>

Every data type/class in python

• has its own specialized methods, i.e., DataFrame has methods, head(),tail(),describe(),

• - and its own features, i.e. DataFrame has features: index, columns, shape, which are not
followed by "( )"

If we want to check the rows at the top of DataFrame, we can


fb . head ()

Page 11
Introduction to Statistical Analysis Xuhu Wan

Open High Low Close Adj Close Volume


Date
2014-12-31 20.400000 20.510000 19.990000 20.049999 19.459270 4157500
2015-01-02 20.129999 20.280001 19.809999 20.129999 19.536913 2842000
2015-01-05 20.129999 20.190001 19.700001 19.790001 19.206934 4948800
2015-01-06 19.820000 19.840000 19.170000 19.190001 18.624611 4944100
2015-01-07 19.330000 19.500000 19.080000 19.139999 18.576082 8045200

Using head(), we can check columns, index and starting date if it is time series data. Now
we know that, our facebook data starts from Dec, 31 with 6 columns/variables - "Open",
"High","Low","Close", "Adj Close", "Volume".
Python use 0-based indexing. "Open" is column 0, "High" is column 1.
The "column" before column 0, is not variable which is index of DataFrame "fb".
We also can use tail() to check the bottom of DataFrame
fb . tail ()

Open High Low Close Adj Close \


Date
2018-01-30 241.110001 246.419998 238.410004 242.720001 242.720001
2018-01-31 245.770004 249.270004 244.449997 245.800003 245.800003
2018-02-01 238.520004 246.899994 238.059998 240.500000 240.500000
2018-02-02 237.000000 237.970001 231.169998 233.520004 233.520004
2018-02-05 227.000000 233.229996 205.000000 213.699997 213.699997

Volume
Date
2018-01-30 14270800
2018-01-31 11964400
2018-02-01 12980600
2018-02-02 17961600
2018-02-05 28869000

Now we know that the end date is Feb, 05, 2018. Each row is the information of one trading
day. If we want to know how many trading days totally, we can
fb . shape

(780, 6)

"shape" is a feature of DataFrame. It is a tuple. The first element of tuple states how many rows
and the second element states how many columns/variables. We also can print columns’ names
fb . columns

Page 12
Introduction to Statistical Analysis Xuhu Wan

Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

and index of DataFrame


fb . index

DatetimeIndex(['2014-12-31', '2015-01-02', '2015-01-05', '2015-01-06',


'2015-01-07', '2015-01-08', '2015-01-09', '2015-01-12',
'2015-01-13', '2015-01-14',
...
'2018-01-23', '2018-01-24', '2018-01-25', '2018-01-26',
'2018-01-29', '2018-01-30', '2018-01-31', '2018-02-01',
'2018-02-02', '2018-02-05'],
dtype='datetime64[ns]', name='Date', length=780, freq=None)

2.2 Slicing DataFrame


Slicing data is an extremely important skill if you want to work on real and big data. It has two
tasks: indexing and Selecting Data.
The axis labeling information in DataFrame serves many purposes:

• Identifies data (i.e. provides metadata) using known indicators, important for analysis, vi-
sualization, and interactive console display
• Enables automatic and explicit data alignment
• Allows intuitive getting and setting of subsets of the data set

In this section, we will focus on the final point: namely, how to slice, dice, and generally get
and set subsets of pandas DataFrame. The primary focus will be on DataFrame and Series (a
Columns of DataFrame) as they have received more development attention in this area.
We have two major approached to select data: selection by label and selection by position

2.2.1 Selection by label


A method loc() is primarily label based,and will raise KeyError when the items are not found.
fb . loc [ ' 2014 -12 -31 ': ' 2015 -1 -21 ' , ' Close ']

Page 13
Introduction to Statistical Analysis Xuhu Wan

Date
2014-12-31 20.049999
2015-01-02 20.129999
2015-01-05 19.790001
2015-01-06 19.190001
2015-01-07 19.139999
2015-01-08 19.860001
2015-01-09 19.940001
2015-01-12 19.690001
2015-01-13 19.660000
2015-01-14 19.740000
2015-01-15 19.600000
2015-01-16 19.959999
2015-01-20 20.020000
2015-01-21 20.299999
Name: Close, dtype: float64

price = fb [ ' Close ']


print ( type ( price ) )

<class 'pandas.core.series.Series'>

’price’ is a pandas Series.


price . loc [ ' 2015 -01 -21 ']

20.299999

loc() select data including the ends.

2.2.2 Selection by position


DataFame method .iloc() is primarily integer position based (from 0 to length-1 of the axis).
fb . iloc [0:5 ,0:2]

Open High
Date
2014-12-31 20.400000 20.510000
2015-01-02 20.129999 20.280001
2015-01-05 20.129999 20.190001
2015-01-06 19.820000 19.840000
2015-01-07 19.330000 19.500000

Page 14
Introduction to Statistical Analysis Xuhu Wan

For example, we want to plot the price of Facebook in 2015.


import matplotlib . pyplot as plt
import seaborn as sb
fb2015 = fb . loc [ ' 2015 -01 -01 ': ' 2015 -12 -31 ' , ' Close ']

fb2015 . plot ()
plt . show ()

Figure 6: Daily close price of Facebook

If you only want to get a column or multiple columns, we can do the following
a_column = fb [ ' Close ']
multiple_columns = fb [[ ' Open ' , ' Close ' ]]

2.2.3 Price difference and return


We can add a new column for price Difference

PriceDi f f = Pricet+1 − Pricet

In DataFrame, it can be done easily


fb [ ' PriceDiff ' ]= fb [ ' Close ' ]. shift ( -1) - fb [ ' Close ']

Page 15
Introduction to Statistical Analysis Xuhu Wan

shift() is a method of DataFrame or Series. shift(-1) means look at value in one day. shift(1) is the
value of variables one day ago.
In computation above, we can find that, the difference of two columns is to calculate the dif-
ference of pairs of numbers between two columns. This is a very nice property for DataFrame:
element-wise operation.
fb . head ()

Open High Low Close Adj Close Volume \


Date
2014-12-31 20.400000 20.510000 19.990000 20.049999 19.459270 4157500
2015-01-02 20.129999 20.280001 19.809999 20.129999 19.536913 2842000
2015-01-05 20.129999 20.190001 19.700001 19.790001 19.206934 4948800
2015-01-06 19.820000 19.840000 19.170000 19.190001 18.624611 4944100
2015-01-07 19.330000 19.500000 19.080000 19.139999 18.576082 8045200

PriceDiff
Date
2014-12-31 0.080000
2015-01-02 -0.339998
2015-01-05 -0.600000
2015-01-06 -0.050002
2015-01-07 0.720002

In finance, the return (here it is daily return) is computed as

S t +1 − S t
rt =
St

fb [ ' Return ' ]=( fb [ ' Close ' ]. shift ( -1) - fb [ ' Close ' ]) / fb [ ' Close ']

fb . head ()

Page 16
Introduction to Statistical Analysis Xuhu Wan

Open High Low Close Adj Close Volume \


Date
2014-12-31 20.400000 20.510000 19.990000 20.049999 19.459270 4157500
2015-01-02 20.129999 20.280001 19.809999 20.129999 19.536913 2842000
2015-01-05 20.129999 20.190001 19.700001 19.790001 19.206934 4948800
2015-01-06 19.820000 19.840000 19.170000 19.190001 18.624611 4944100
2015-01-07 19.330000 19.500000 19.080000 19.139999 18.576082 8045200

PriceDiff Return
Date
2014-12-31 0.080000 0.003990
2015-01-02 -0.339998 -0.016890
2015-01-05 -0.600000 -0.030318
2015-01-06 -0.050002 -0.002606
2015-01-07 0.720002 0.037618

2.2.4 Moving average


In statistics, a moving average (rolling average or running average) is a calculation to analyze data
points by creating series of averages of different subsets of the full data set.
For example, we can calculate the average prices of last 40 days and 200 days.
fb [ ' MA40 ' ]= fb [ ' Close ' ]. rolling (40) . mean ()
fb [ ' MA200 ' ]= fb [ ' Close ' ]. rolling (200) . mean ()

fb . head ()

Open High Low Close Adj Close Volume \


Date
2014-12-31 20.400000 20.510000 19.990000 20.049999 19.459270 4157500
2015-01-02 20.129999 20.280001 19.809999 20.129999 19.536913 2842000
2015-01-05 20.129999 20.190001 19.700001 19.790001 19.206934 4948800
2015-01-06 19.820000 19.840000 19.170000 19.190001 18.624611 4944100
2015-01-07 19.330000 19.500000 19.080000 19.139999 18.576082 8045200

PriceDiff Return MA40 MA200


Date
2014-12-31 0.080000 0.003990 NaN NaN
2015-01-02 -0.339998 -0.016890 NaN NaN
2015-01-05 -0.600000 -0.030318 NaN NaN
2015-01-06 -0.050002 -0.002606 NaN NaN
2015-01-07 0.720002 0.037618 NaN NaN

Page 17
Introduction to Statistical Analysis Xuhu Wan

fb [ ' Close ' ]. plot ()


fb [ ' MA40 ' ]. plot ()
fb [ ' MA200 ' ]. plot ()
plt . show ()

Figure 7: Slow and fast signal.

Why green line is above red line?

2.3 Advanced topic- Download Historical Stock Price


In this section, we will demonstrate how to download stock prices or stock indices data through
yahoo or google finance. You need package "pandas_datareader". If you did not install, you need
to install it on your laptop before import. (Installation just need to do once,"import" has to be
down whenever you open a notebook)
! pip install pandas_datareader

import pandas_datareader as web


import datetime

"datetime" is a module which help to format numbers into time.


# datetime . datetime is a data type within the datetime module
start = datetime . datetime (2015 , 1 , 1)
end = datetime . datetime (2018 , 2 , 5)

Page 18
Introduction to Statistical Analysis Xuhu Wan

# DataReader method name is case sensitive


df = web . DataReader ( " msft " , ' yahoo ' , start , end )

Then we can save it into a data file saved locally in your laptop
df . to_csv ( " data / microsoft . csv " )

Page 19
Introduction to Statistical Analysis Xuhu Wan

3 Let Us Make Money


We need the following modules and headers for this section.
import pandas as pd
import matplotlib . pyplot as plt
% matplotlib inline
import seaborn

3.1 Trend-following using moving average


We import stock history and compute moving averages with long and short windows.
ms = pd . DataFrame . from_csv ( ' data / microsoft . csv ')
ms [ ' MA10 ' ]= ms [ ' Close ' ]. rolling (10) . mean ()
ms [ ' MA50 ' ]= ms [ ' Close ' ]. rolling (50) . mean ()

If we plot them in one plots, we have


plt . figure ( figsize =(10 ,5) ) # to define the size of graph
ms [ ' Close ' ]. plot ( legend = True , color = ' yellow ')
ms [ ' MA10 ' ]. plot ( legend = True , color = ' red ') # fast signal
ms [ ' MA50 ' ]. plot ( legend = True , color = ' green ') # slow signal
plt . show ()

Figure 8: Stock price and moving average processes.

We will go long (buy) when "MA10" is above MA50 and do nothing otherwise. Each time we
will hold 1 share of Microsoft. We need to define a variable (column) with the name "Shares". We
will use list comprehension to do this job. list comprehension is extremely useful techqniue for
data analysis. We define a new list

Page 20
Introduction to Statistical Analysis Xuhu Wan

newlist =[ 1 if ms . loc [ ei , ' MA10 '] > ms . loc [ ei , ' MA50 '] else 0 for ei in
ms . index ]

List comprehension, is to compute shares by iterating through all index of ’ms’. Then we can
define a columns of ’ms’ using this list. List comprehension is very useful when preprocessing
data.
For example,
alist =[1 ,2 ,3 ,4 ,5]

We want to take squares of each item in alist, then we can


squarelist =[ x **2 for x in alist ]
squarelist

[1, 4, 9, 16, 25]

More than that, we also can transform the values in alist following a more complicated rule. -
if x >3, ’pass’ - else, ’fail’
pflist =[ ' Pass ' if x >3 else ' Fail ' for x in alist ]
pflist

['Fail', 'Fail', 'Fail', 'Pass', 'Pass']

Then we can make a new column with the name ’Shares’


ms [ ' Shares ' ]= newlist

or we can do it in one step


ms [ ' Shares ' ]= [ 1 if ms . loc [ ei , ' MA10 '] > ms . loc [ ei , ' MA50 '] else 0 for
ei in ms . index ]

Since we use close price to compute signal, it is equivalent to say, we will evaluate our signals
when the market is almost close and decide whether buy or sell. If ms["Shares"] is 1, we will buy
one share of stock and the profit is close price of tomorrow minus close price of today. Otherise
we will short sell one share of profit and the profit is close price of today minus close price of
tomorrow. We need to define a new variable "Profit".
ms [ ' Close1 ' ]= ms [ ' Close ' ]. shift ( -1)

This is to compute close price of tomorrow.

Page 21
Introduction to Statistical Analysis Xuhu Wan

ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if
ms . loc [ ei , ' Shares ' ]==1
else 0 for ei in ms . index ]

We can plot daily profit


ms [ ' Profit ' ]. plot ()
plt . axhline ( y =0 , color = ' red ')
plt . show ()

Figure 9: Daily profit of signal-based strategy

It is not clear whether we make money or lose money. Hence we need to compute the cumu-
lative wealth.
ms [ ' Profit ' ]. cumsum () . plot ()

Page 22
Introduction to Statistical Analysis Xuhu Wan

Figure 10: Accumulated profit or wealth process of signal-based stratgy

It seems that we make some money but you get bancrupt before you get rich. Later we will
know that this strategy is not good because the maximum drawdown risk is high.

3.2 Making strategy better with parameters tuning


We can change window sizes of signal to see if we can get better results. To make this process
convenient, we copy and past codes above into single cell.
fast =40
slow =50
ms [ ' MAF ' ]= ms [ ' Close ' ]. rolling ( fast ) . mean ()
ms [ ' MAS ' ]= ms [ ' Close ' ]. rolling ( slow ) . mean ()
ms [ ' Shares ' ]= [ 1 if ms . loc [ ei , ' MAF '] > ms . loc [ ei , ' MAS '] else -1 for
ei in ms . index ]
ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if
ms . loc [ ei , ' Shares ' ]==1
else 0 for ei in ms . index ]
ms [ ' Profit ' ]. cumsum () . plot ()

Page 23
Introduction to Statistical Analysis Xuhu Wan

Figure 11: Improved profit with parameter-tuning

3.3 Training set and test set


We get much better results by tuning two key parameters in this strategy. But there is a big trap
here. The dataset used to adjust parameters should be different from the dataset which you use to
evaluate the performance. The first dataset is called training set and the second one is called test
set.
fast =40
slow =50
ms [ ' MAF ' ]= ms [ ' Close ' ]. rolling ( fast ) . mean ()
ms [ ' MAS ' ]= ms [ ' Close ' ]. rolling ( slow ) . mean ()
ms [ ' Shares ' ]= [ 1 if ms . loc [ ei , ' MAF '] > ms . loc [ ei , ' MAS '] else -1 for
ei in ms . index ]
ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if
ms . loc [ ei , ' Shares ' ]==1
else 0 for ei in ms . index ]
Train = ms [ ' Profit ' ]. iloc [: -200]
Test = ms [ ' Profit ' ]. iloc [ -200:]
Train . cumsum () . plot ()

Page 24
Introduction to Statistical Analysis Xuhu Wan

Figure 12: Performance of strategy in training data.

You only can determine your best parameters in training set. After final choice of parameters
are decided, use them in test set which is to mimic real trading process: we train our models using
historical data and apply the best model tomorrow.
Test . cumsum () . plot ()

Figure 13: Performance of strategy in testing set.

Page 25
Introduction to Statistical Analysis Xuhu Wan

If you use only training set to adjust parameters, maybe you will choose different set of num-
bers.
fast =140
slow =160
ms [ ' MAF ' ]= ms [ ' Close ' ]. rolling ( fast ) . mean ()
ms [ ' MAS ' ]= ms [ ' Close ' ]. rolling ( slow ) . mean ()
ms [ ' Shares ' ]= [ 1 if ms . loc [ ei , ' MAF '] > ms . loc [ ei , ' MAS '] else -1 for
ei in ms . index ]
ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if
ms . loc [ ei , ' Shares ' ]==1
else 0 for ei in ms . index ]
Train = ms [ ' Profit ' ]. iloc [: -200]
Test = ms [ ' Profit ' ]. iloc [ -200:]
Train . cumsum () . plot ()

Figure 14: Performance in train data with parameters tuned in train data.

Test . cumsum () . plot ()

Page 26
Introduction to Statistical Analysis Xuhu Wan

Figure 15: Performance in test data with parameters tuned in train data

Hence the best parameters (models) in the training data maybe is not the best parameters for
test data.

Page 27
Introduction to Statistical Analysis Xuhu Wan

4 Making Money is Not To Gamble

import pandas as pd
import matplotlib . pyplot as plt
% matplotlib inline
import seaborn

4.1 Can the stock market be predicted


The efficient-market hypothesis (EMH) is a theory in financial economics that states that asset
prices fully reflect all available information. A direct implication is that it is impossible to predict
price change of stocks
We should not think that the market is efficient because of EMH. Instead, EMH is just a defini-
tion of market efficiency. The market is not efficient at any time scale.
But prediction of stock market is extremely hard because we are building a robust prediction
system against many impacts from outside of the market which change stochastically across time,
in other words, external variables.
Many prediction models are only based on price paths, which are doomed to failure. A suc-
cessful model - use all information available, price, volume, etc.
- use simple model structure, robust to external variables, known or unknown - ...

4.2 Build our first prediction model


We will build a prediction for the price change of Microsoft Inc.
ms = pd . DataFrame . from_csv ( ' data / microsoft . csv ')
ms . head ()

Open High Low Close Adj Close Volume


Date
2014-12-31 46.730000 47.439999 46.450001 46.450001 42.848763 21552500
2015-01-02 46.660000 47.419998 46.540001 46.759998 43.134731 27913900
2015-01-05 46.369999 46.730000 46.250000 46.330002 42.738068 39673900
2015-01-06 46.380001 46.750000 45.540001 45.650002 42.110783 36447900
2015-01-07 45.980000 46.459999 45.490002 46.230000 42.645817 29114100

Our target is to predict the change of close price

Y = Closet+1 − Closet

ms [ 'Y ' ]= ms [ ' Close ' ]. shift ( -1) - ms [ ' Close ']

We will use the following predictors (independent variables) - X1,X2, change of close price today
and yesterday - X3,X4,difference between high and low, today and yesterday - X5,X6,difference
between high and close, today and yesterday - X7,X8,change of volume, today and yesterday

Page 28
Introduction to Statistical Analysis Xuhu Wan

ms [ ' X1 ' ]= ms [ ' Close '] - ms [ ' Close ' ]. shift (1)
ms [ ' X2 ' ]= ms [ ' Close ' ]. shift (1) - ms [ ' Close ' ]. shift (2)
ms [ ' X3 ' ]= ms [ ' High '] - ms [ ' Low ']
ms [ ' X4 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Low ' ]. shift (1)
ms [ ' X5 ' ]= ms [ ' High '] - ms [ ' Close ']
ms [ ' X6 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Close ' ]. shift (1)
ms [ ' X7 ' ]= ms [ ' Volume '] - ms [ ' Volume ' ]. shift (1)
ms [ ' X8 ' ]= ms [ ' Volume ' ]. shift (1) - ms [ ' Volume ' ]. shift (2)

We will use [’X1’,’X2’,’X3’,’X4’,’X5’,’X6’,’X7’,’X8’] to predict Y.


First we need to delete NaN values
ms = ms . dropna ( axis =0)

predictors =[ ' X1 ' , ' X2 ' , ' X3 ' , ' X4 ' , ' X5 ' , ' X6 ' , ' X7 ' , ' X8 ']
from sklearn import linear_model
from sklearn . metrics import mean_squared_error

myModel = linear_model . LinearRegression ()


myModel . fit ( ms [ predictors ] , ms [ 'Y ' ])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

# Make predictions using the testing set


ms [ ' Y_predict '] = myModel . predict ( ms [ predictors ])

ms . head ()

Page 29
Introduction to Statistical Analysis Xuhu Wan

Open High Low Close Adj Close Volume \


Date
2015-01-05 46.369999 46.730000 46.250000 46.330002 42.738068 39673900
2015-01-06 46.380001 46.750000 45.540001 45.650002 42.110783 36447900
2015-01-07 45.980000 46.459999 45.490002 46.230000 42.645817 29114100
2015-01-08 46.750000 47.750000 46.720001 47.590000 43.900375 29645200
2015-01-09 47.610001 47.820000 46.900002 47.189999 43.531395 23942800

Y X1 X2 X3 X4 X5 \
Date
2015-01-05 -0.680000 -0.429996 0.309997 0.480000 0.879997 0.399998
2015-01-06 0.579998 -0.680000 -0.429996 1.209999 0.480000 1.099998
2015-01-07 1.360000 0.579998 -0.680000 0.969997 1.209999 0.229999
2015-01-08 -0.400001 1.360000 0.579998 1.029999 0.969997 0.160000
2015-01-09 -0.590001 -0.400001 1.360000 0.919998 1.029999 0.630001

X6 X7 X8 Y_predict
Date
2015-01-05 0.660000 11760000.0 6361400.0 0.183903
2015-01-06 0.399998 -3226000.0 11760000.0 0.233266
2015-01-07 1.099998 -7333800.0 -3226000.0 0.022806
2015-01-08 0.229999 531100.0 -7333800.0 -0.009937
2015-01-09 0.160000 -5702400.0 531100.0 0.032701

Our model is called multiple linear regerssion model

Y = β 0 + β 1 X1 + β 1 X1 . . . + β 7 X7 + β 8 X8 + e

β 0 is called intercept, which


myModel . intercept_

0.13597169031644135

β 1 , . . . , β 8 is called coefficient of 8 predictors, and their values are


myModel . coef_

array([ 9.13881358e-02, -8.91420644e-03, -2.60086967e-01,


-6.67068059e-02, 4.04741435e-01, 8.95326239e-02,
3.03771492e-09, 2.64468840e-09])

Can we generate the profit using our prediction model ? - if Y_predict>0 , we buy today and
sell it tomorrow - otherwise, remain unchanged.

Page 30
Introduction to Statistical Analysis Xuhu Wan

ms [ ' Profit ' ]=[ ms . loc [t , 'Y '] if ms . loc [t , ' Y_predict '] >0 else 0 for t
in ms . index ]

Total profit is
ms [ ' Profit ' ]. sum ()

66.28998699999993

Bingo, we make money using our model and we make more money than that using trend-
following strategy.
plt . plot ( ms [ ' Profit ' ])

Figure 16: Average daily profit with regression model

If we plot the profit in fact, we lose money on some days, but the number of days we win is
more that we lose.
plt . plot ( ms [ ' Profit ' ]. cumsum () )

Page 31
Introduction to Statistical Analysis Xuhu Wan

Figure 17: Accumulated profit with regression model.

ms [ ' Profit ' ]. mean () , ms [ ' Profit ' ]. std ()

(0.08531529858429848, 0.6727636361993468)

Hence our daily profit is 8 cents. If transaction cost is less than 8 cents, we will make money
for sure? or not?
Hence the best parameters (models) in the training data maybe is not the best parameters for
test data.

4.3 Evaluating your model is more important


We notice that, building model is straightforwad with only one line of code. However we spend
more time making and selecting predictors included in our model.
The most important step is to evaluate the performance of your model correctly. In the example
above, the signal (’Y_predict’) is too good to be true.
ms . shape

(777, 17)

We use 777 days to build a linear regression model and apply this model again in these 777
days. This is not right. We should separate the data into train and test, described in last section,
building model in training data and validate the model and strategy in test data. I will leave this
as your first assignment.

Page 32
Introduction to Statistical Analysis Xuhu Wan

If we use all historical data to build a regression model, and the average daily profit, for exam-
ple, in historical data, is $1 per day. We do not know, whether the performance of the model will
be similar. In other words, we cannot evaluate the performance of the model correctly.
Instead, we separate all historical data into train and test. For example, totally, we have 100
days in history. Then we will use the first 80 days as train data which is for model building. After
getting a model(which describe the relationship between your target Y and predictors X1 , X2 , . . . )
from training data, we will use this model to make prediction in both train and test data. We can
either evaluate the accuracy or fit of the model (We will cover the details in Part III.) or evaluate
the return if you make decision based on your model, in both train and test.
We claim that the performance of the model is consistent if the return or fit of the model are
similar in both train and test data.
Otherwise the model is said to be over fitting if the performance of the model are different
significantly in train and test.
That is why we need to study statistics in a more systematical way, in order to know: - Random
Variable and Distribution - Association of Two Variables - Hypothesis testing and significance
level - Evaluation of Linear Regression Models.

Page 33
Introduction to Statistical Analysis Xuhu Wan

Assignment: Is Prediction Model Really Profitable


First load all necessary modules
import pandas as pd
import matplotlib . pyplot as plt
% matplotlib inline
import seaborn
from sklearn import linear_model
import statsmodels . api as smf
import warnings
warnings . filterwarnings ( ' ignore ')

The last two lines is to hide warning message when we run codes.

Background
In this chapter, we have build a regression model for the change of close price of Microsoft
ms = pd . DataFrame . from_csv ( ' microsoft . csv ')
ms [ 'Y ' ]= ms [ ' Close ' ]. shift ( -1) - ms [ ' Close ']
ms [ ' X1 ' ]= ms [ ' Close '] - ms [ ' Close ' ]. shift (1)
ms [ ' X2 ' ]= ms [ ' Close ' ]. shift (1) - ms [ ' Close ' ]. shift (2)
ms [ ' X3 ' ]= ms [ ' High '] - ms [ ' Low ']
ms [ ' X4 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Low ' ]. shift (1)
ms [ ' X5 ' ]= ms [ ' High '] - ms [ ' Close ']
ms [ ' X6 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Close ' ]. shift (1)
ms [ ' X7 ' ]= ms [ ' Volume '] - ms [ ' Volume ' ]. shift (1)
ms [ ' X8 ' ]= ms [ ' Volume ' ]. shift (1) - ms [ ' Volume ' ]. shift (2)
ms = ms . dropna ( axis =0)

Please notice, to read data sucessfully, you need to put "microsoft.csv" and Assignment notebook
in the same folder.

Problem 1.
divide the data into training (80%) and test data (20%).

Problem 2.
Build linear regression model for Y using X1 , . . . , X8 with training data. Make prediction in train-
ing and test data.

Problem 3.
Compute average daily profit in training data and test data using the following signal-based strat-
egy.
• if Y_predict >0, buy today and sell it tomorrow

Page 34
Introduction to Statistical Analysis Xuhu Wan

• else, do nothing.

Is the performance of your prediction model consistent? (Consistency means that it has similar
performance in train and test.)

Problem 4.
In the end of section 4.1 , we claim that, if the average daily profit is higher than transaction cost,
the strategy can be implemented (Now, we know that consistency of the model is also a necessary
condition.)
But on some days, we do not trade stocks which are also counted in average daily profit. To
evaluate the implementability more precisely, we need to compute average daily profit of those
days we trade. Could you compute this adjusted average profit for training and test periods?

Page 35
Introduction to Statistical Analysis Xuhu Wan

Appendix
Launching Jupyter notebook
The Jupyter Notebook App can be launched by clicking on the Jupyter Notebook icon installed by
Anaconda in the start menu (Windows) or by typing in a terminal (cmd on Windows):

Figure 18: A snapshot of jupyter notebook

This will launch a new browser window (or a new tab) showing the Notebook Dashboard, a
sort of control panel that allows (among other things) to select which notebook to open.
When started, the Jupyter Notebook App can access only files within its start-up folder (in-
cluding any sub-folder). If you store the notebook documents in a subfolder of your user folder
no configuration is necessary. Otherwise, you need to choose a folder which will contain all the
notebooks and set this as the Jupyter Notebook App start-up folder.

Basics of Python
Python, different from c++, java,scala, etc, is high level programming language.It does not care
about how to manage the hardware.
Learning python is considered as the easiest coding language. Comparing to R, which is an-
other high level language for data anaytics, it is also considered to be more friendly.
For beginners, it is crucial to know different class/types of different object(data). Understand
your data can be loaded or packed in different types.
Core python consists of all data types with their methods and features that are available with-
out importing any advanced modules. It has the following data types

Page 36
Introduction to Statistical Analysis Xuhu Wan

• integer

• float

• string

• bool

• list

• tuple

• dictionary

• set

For this book, we will use integer, float, string, bool, list and tuple.

Integer, float

a =10
b =10.1

Python can recognize that a is integer and b is float.


a . bit_length ()

b . real

10.1

a, b have different data types /class, hence they have different methods and feature.
big_length() is a built-in method for integer and .real is a build-in feature for float number.

String
String is a contend circulated by double quotes or single quotes
c = " I am handsome "
d = " 123 "

d is not integer number.


d +3

Page 37
Introduction to Statistical Analysis Xuhu Wan

--------------------------------------------------------------------------
TypeError: must be str, not int

You get error message. We can use str(3) to change integer 3 into string "3". Then we have
d + str (3)

and output is ’1233’. Hence the addition of strings is to concatenate two strings. Strings are very
useful in natural language processing. For this book, string is used for columns (variables)’ names.

Lists and tuples


list is one of major data types we will use which is a collection of elements with all possible data
types.v
aList =[10 ,12.06 , " Tiger " , False ]

list use square brackets. lists have many built-in attributes and most-often used one is "append".

aList . append ( " Katy " ) # in place


aList

[10, 12.06, 'Tiger', False, 'Katy', 'Tom', 'Katy']

• You have a list of seven elements. A list is an ordered set of elements enclosed in square
brackets.
• python is 0-indexing, and the first element of any non-empty list, i.e. aList, is always aList[0].
• The last element of this seven-element list is aList[6], because lists are always zero-based.

aList [ -1]

'Katy'

• A negative index accesses elements from the end of the list counting backwards. The last
element of any non-empty list, i.e. aList is always aList[-1].

aList [2:5]

['Tiger', False, 'Katy']

Page 38
Introduction to Statistical Analysis Xuhu Wan

• You can get a subset of a list, called a “slice”, by specifying two indices. The return value is
a new list containing all the elements of the list, order, starting with the first slice index (in
this case alist[2]), up to but not including the second slice index (in this case alist[5]).
• Slicing works if one or both of the slice indices is negative. If it helps, you can think of it
this way: reading the list from left to right, the first slice index specifies the first element you
want, and the second slice index specifies the first element you don’t want. The return value
is everything in between.
• Lists are zero-based, so aList[0:3] returns the first three elements of the list, starting at aL-
ist[0], up to but not including aList[3].

aList [0:3]

[10, 12.06, 'Tiger']

Tuple is similar to list, in the sense that it is collections of elements with all kinds of types/class
aTuple =(10 ,12.06 , " Tiger " , False )

but tuples use parentheses, whereas lists use square brackets. The tuples cannot be changed after
it is defined.
aTuple [0]=20

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

<ipython-input-2-4253fec5e5d3> in <module>()
----> 1 aTuple[0]=20

TypeError: 'tuple' object does not support item assignment

aList [0]=10000

Page 39
Introduction to Statistical Analysis Xuhu Wan

Part II
Variables, Samples and Statistical Inferences

Page 40
Introduction to Statistical Analysis Xuhu Wan

5 Sample and Population


We need to import the following modules for this section:
import pandas as pd
import matplotlib . pyplot as plt
% matplotlib inline

5.1 Population and Sample


The field of inferential statistics enables you to make educated guesses about the numerical char-
acteristics of large groups. The logic of sampling gives you a way to test conclusions about such
groups using only a small portion of its members.
A population is a group of phenomena that have something in common. The term often refers
to a group of people, as in the following examples:
• All registered voters in Thailand
• All members of US Institute of Mathematical Statistics
• All Hong Kong citizens who played golf at least once in the past year
But populations can refer to things as well as people:

• All daily maximum temperatures in July for major China cities


• All neurons from your brain
Often, researchers want to know things about populations but do not have data for every per-
son or thing in the population. If a company’s customer service division wanted to learn whether
its customers were satisfied, it would not be practical (or perhaps even possible) to contact every
individual who purchased a product. Instead, the company might select a sample of the popula-
tion.
A sample is a smaller group of members of a population selected to represent the population.
In order to use statistics to learn things about the population, the sample must be random.
A random sample is one in which every member of a population has an equal chance of being
selected. The most commonly used sample is a simple random sample. It requires that every
possible sample of the selected size has an equal chance of being used.

Figure 19: Population and sample

Page 41
Introduction to Statistical Analysis Xuhu Wan

There are two random samples often used: - When a population element can be selected more
than one time, we are sampling with replacement. - When a population element can be selected
only one time, we are sampling without replacement.
For example, we consider a collection of scores of the first assignment as a population.
data = pd . DataFrame ()
data [ ' Population ' ]=[47 , 85 , 41 , 3, 15 , 46 , 35 , 43 , 92 ,
45 , 59 , 35 , 20 , 81 , 30 , 33 , 6 , 12 ,
38 , 10 , 11 , 48 , 4, 99 , 62 , 72 , 15 ,
8, 31 , 37 , 21 , 72 , 90 , 51 , 97 , 66 ,
5, 22 , 73 , 59 , 57 , 93 , 53 , 31 , 20 ,
82 , 20 , 39 , 82 , 22 , 28 , 56 , 94 , 73 ,
95 , 59 , 53 , 11 , 71 , 85 , 20 , 57 , 88]

We will apply a method of DataFrame to get a sample

a_sample_w i t h o u t_ r e p l ac e m e n t = data [ ' Population ' ]. sample (10 , replace = False )


a_sample_w i t h o u t_ r e p l ac e m e n t

10 59
54 95
28 31
5 46
45 82
20 11
16 6
61 57
8 92
62 88
Name: Population, dtype: int64

parameter "replace" determines whether sample is obtained with or without replacement.


a_sample_ wi th _re pl ac eme nt = data [ ' Population ' ]. sample (10 , replace = True )

5.2 Parameter and Statisitics


A parameter is a characteristic of a population. A statistic is a characteristic of a sample. Statistical
inference enables you to make an educated guess about a population parameter based on a statistic
computed from a sample randomly drawn from that population

Page 42
Introduction to Statistical Analysis Xuhu Wan

Figure 20: Parameter and statistic

For example, say you want to know the mean income of the subscribers to a particular
magazine—a parameter of a population. You draw a random sample of 100 subscribers and de-
termine that their mean income is $27, 500 (a statistic). You conclude that the population mean
income µ is likely to be close to $27, 500 as well. This example is one of statistical inference.
Different symbols are used to denote statistics and parameters, as the following table shows.

Sample Statistic Population Parameter


Mean x̄ µ
Variance s2 σ2
Standard deviation s σ

We use the example about, we have population parameters as follows.


print ( " Population mean is " , data [ ' Population ' ]. mean () )
print ( " Population variance " , data [ ' Population ' ]. var ( ddof =0) )
print ( " Population standard deviation is " ,
data [ ' Population ' ]. std ( ddof =0) )
print ( " Population size is " , data [ ' Population ' ]. shape [0])

Population mean is 47.74603174603175


Population variance 812.5069286974048
Population standard deviation is 28.504507164611788
Population size is 63

The values of these numbers will not change with samples. However
a_sample = data [ ' Population ' ]. sample (10 , replace = True )
print ( " Sample mean is " , a_sample . mean () )
print ( " Sample variance " , a_sample . var ( ddof =1) )
print ( " Sample standard deviation is " , a_sample . std ( ddof =1) )
print ( " Sample size is " , a_sample . shape [0])

Page 43
Introduction to Statistical Analysis Xuhu Wan

Sample mean is 45.7


Sample variance 632.9
Sample standard deviation is 25.157503850740042
Sample size is 10

Notice that, we have different parameter values for "ddof" when we calculate standard devia-
tion and variance for population. First we have the following formula for parameter and statistics,
where x1 , x2 , . . . , x N are all items from population, x1 , x2 , . . . , xn are a random collections from the
population which make a sample. n is the sample size and N is population size. We have
x1 + x2 + . . . , + x n x1 + x2 + . . . , + x N
x̄ = , µ=
n N

( x1 − x̄ )2 + ( x2 − x̄ )2 . . . + ( xn − x̄ )2 ( x1 − µ )2 + ( x2 − µ )2 . . . + ( x N − µ )2
s2 = , σ2 =
n−1 N
These formulas look very lengthy. Statistician use Σ to present summation as follows

∑in=1 xi ∑ N xi
x̄ = , µ = i =1
n N

∑in=1 ( xi − x̄ )2 ∑ N ( x i − µ )2
s2 = , σ 2 = i =1
n−1 N
An often-asked questions for beginners is why the sample variance is divdied by n − 1 instead
n. We first need to explain what are estimators and unbiased estimators.
An estimator is a statisic of sample intended to approximate the population parameter.
An unbiased estimator is the estimator, whose average from all samples is identical to popu-
lation parameter. Otherwise the estimator is called biased.
For example, we take samples with replacement from the population, compute sample variace
with denominator equal to 1 for each sample, and then take average, we can check whether sample
variance is unbiased estimator for puplation variance.
sample_var ia nc e _c ol le c ti on =[]

a_sample = data [ ' Population ' ]. sample (50 , replace = True )


sample_var ia nc e _c ol le c ti on . append ( a_sample . var ( ddof =1) )

We run the cell above 200 times and then "sample_variance_collection" get 200 sample variances
computed from 200 samples.
print ( " totally , we get " , len ( s am p le _v ar i an ce _c o ll ec ti o n ) , " sample
variance " )
# len is applied to compute the size of list

Page 44
Introduction to Statistical Analysis Xuhu Wan

totally, we get 200 sample variance

sample_var i a nc e _c o ll e c ti o n1 = pd . DataFrame ( s am p le _v ar i an ce _c o ll ec ti o n )
print ( " Average of sample variance is " ,
sample_ v ar i an c e _c o ll e c ti o n1 [0]. mean () )
print ( " Population variance is " , data [ ' Population ' ]. var ( ddof =0) )

Average of sample variance is 809.3548102040818


Population variance is 812.5069286974048

We can find that the average of sample varaince is very close to population variance. We
can expect that, as we continue to samples , the average will reach the real population varaince.
Readers can check that the average of sample varaince with ddof=0 will not reach population
variance. H
Hence sample variance with the denominator = n − 1 is unbiased estimator.
More intuitionally, we take a sample, and sample mean must be a center of all sample data,
which might be much different from population mean. Hence the deviation of sample items from
sample mean is much smaller than that from population mean. Hence it is divided by n − 1, a
smaller number comparing to n, to approximiate the population variance.

5.3 Identify Population and Sample

rb = pd . DataFrame . from_csv ( " data / Rb2018 . csv " )

rb . head ()

AskPrice BidPrice AskVolume BidVolume Price \


NdateTime
2018-02-06 21:00:00.500 3948.0 3945.0 199.0 4.0 3945.0
2018-02-06 21:00:01.000 3948.0 3946.0 352.0 132.0 3946.0
2018-02-06 21:00:01.500 3947.0 3946.0 16.0 6.0 3946.0
2018-02-06 21:00:02.000 3947.0 3946.0 5.0 95.0 3946.0
2018-02-06 21:00:02.500 3948.0 3946.0 191.0 37.0 3948.0

Volume Turnover
NdateTime
2018-02-06 21:00:00.500 762.0 3007738.0
2018-02-06 21:00:01.000 2450.0 9668594.0
2018-02-06 21:00:01.500 1106.0 4363998.0
2018-02-06 21:00:02.000 540.0 2130974.0
2018-02-06 21:00:02.500 424.0 1673534.0

Page 45
Introduction to Statistical Analysis Xuhu Wan

"rb" is an abbreviation of "reinforcing steel bar, rebar", which is traded in Shanghai Futures
Exchange. The details of Rb contracts can be found here but in Chinese. It is traded from 21:00-
23:00, 9:00-10:15,10:30-11:30,13:30-3:00 on workdays.

Figure 21: Reinforcing steel bar, rebar

Every 500 milliseconds, the exchange will release trading information once, including

• AskPrice: the lowest price to sell


• BidPrice: the highest price to buy
• AskVolume: number of shares waiting to sell at AskPrice
• BidVolme: number of shares waiting to buy at BidPrice
• Volume: number of shares traded during 500 milliseconds
• Turnover: The amount of money spent in trading during 500 milliseconds
• Price: the price in last trade.

We compute the return of Price

Pricet − Pricet−1
Returnt =
Pricet−1

rb [ ' Return ' ]= rb [ ' Price ' ]. pct_change ()

rb . head ()

Page 46
Introduction to Statistical Analysis Xuhu Wan

AskPrice BidPrice AskVolume BidVolume Price \


NdateTime
2018-02-06 21:00:00.500 3948.0 3945.0 199.0 4.0 3945.0
2018-02-06 21:00:01.000 3948.0 3946.0 352.0 132.0 3946.0
2018-02-06 21:00:01.500 3947.0 3946.0 16.0 6.0 3946.0
2018-02-06 21:00:02.000 3947.0 3946.0 5.0 95.0 3946.0
2018-02-06 21:00:02.500 3948.0 3946.0 191.0 37.0 3948.0

Volume Turnover Return


NdateTime
2018-02-06 21:00:00.500 762.0 3007738.0 NaN
2018-02-06 21:00:01.000 2450.0 9668594.0 0.000253
2018-02-06 21:00:01.500 1106.0 4363998.0 0.000000
2018-02-06 21:00:02.000 540.0 2130974.0 0.000000
2018-02-06 21:00:02.500 424.0 1673534.0 0.000507

In this example, all return rate is a sample, sample size is equal to


rb . shape [0]

414020

sample size is huge, but it is still a sample. What is the population in this example? It is an
collection of inifinite return (rate) that our sample possible comes from. We are not sure what is
the mean, varaince or proportion of all possible values in population. But with this large sample,
we can make inference (meaningful guess) about it. To do that, we need to explore our sample to
get more knowledge about this set of data.

Page 47
Introduction to Statistical Analysis Xuhu Wan

6 Variables, Frequency and Distribution

import pandas as pd
import matplotlib . pyplot as plt
% matplotlib inline
import numpy as np

’numpy’ is another important module which can generate array of random variable with given
distribution, and column-wisely scientific computation etc.
6.1 Variables, Cases and DataFrame

Figure 22: shells

The figure abov shows a small collection of sea shells gathered on s beach. All the shells in the
collection are similar: small disk-shaped shells with a hole in the center. But the shells also differ
from one another in overall size and weight, in color, in smoothness, in the size of the hole, etc.
Any data set is something like the shell collection. It consists of cases (obervations): the objects
in the collected sample.
Each case has one or more attributes or qualities, called variables. This word “variable” em-
phasizes that it is differences or variation that is often of primary interest. Usually, there are many
possible variables. The researcher chooses those that are of interest, often drawing on detailed
knowledge of the system that is under study.
The researcher measures or observes the value of each variable for each case. The result is a

Page 48
Introduction to Statistical Analysis Xuhu Wan

Figure 23: dataset

table, also known as a data frame: a sort of spreadsheet. Within the data frame, each row refers to
one case, each column to one variable.
4.3, 12.0, 3.8 are called observed values of the variable "diameter".

6.1.1 Kinds of Variables


Most people tend to think of data as numeric, but variables can also be descriptions, as the sea
shell collection illustrates. The two basic types of data are:

• Numerical: Naturally represented by a number, for instance diameter, weight, temperature,


age, and so on. Numerical variables can be further classified as discrete and continuous.

• Categorical: A description that can be put simply into words or categories, for instance male
versus female or red vs green vs yellow, and so on.

The distinction between numerical and categorical variables is not explicit.


Some categorical variables have values that have a natural order. For example, a categorical vari-
able for temperature might have values such as “cold,”“warm,” “hot,” “very hot.” Variables like
this are called ordinal. Opinion surveys often ask for a choice from an ordered set such as this:
strongly disagree, disagree, no opinion, agree, strongly agree.
rb = pd . DataFrame . from_csv ( " data / Rb2018 . csv " )
rb . head ()

Page 49
Introduction to Statistical Analysis Xuhu Wan

AskPrice BidPrice AskVolume BidVolume Price \


NdateTime
2018-02-06 21:00:00 39480.0 39450.0 199.0 4.0 39450.0
2018-02-06 21:00:30 39460.0 39450.0 184.0 224.0 39460.0
2018-02-06 21:01:00 39450.0 39440.0 8.0 83.0 39440.0
2018-02-06 21:01:30 39510.0 39500.0 5.0 730.0 39510.0
2018-02-06 21:02:00 39480.0 39460.0 236.0 315.0 39480.0

Volume Turnover
NdateTime
2018-02-06 21:00:00 762.0 3007738.0
2018-02-06 21:00:30 56.0 220954.0
2018-02-06 21:01:00 290.0 1144022.0
2018-02-06 21:01:30 468.0 1849118.0
2018-02-06 21:02:00 38.0 149992.0
Next we compute VAP: volume adjusted price.
AskPrice ∗ BidVolume + BidPrice ∗ AskVolume
VAP =
AskVolume + BidVolume

rb [ ' VAP ' ]=( rb [ ' BidPrice ' ]* rb [ ' AskVolume ' ]+ rb [ ' AskPrice ' ]* rb [ ' BidVolume ' ]) /( rb [ ' A

We compute return of VAP:


rb [ ' Return ' ]= rb [ ' VAP ' ]. pct_change ()
rb [ ' Direction ' ]=[ ' Up ' if x >0 else ' Down ' for x in rb [ ' Return ' ]]

rb = rb . dropna ( axis =0)


rb . head ()

AskPrice BidPrice AskVolume BidVolume Price \


NdateTime
2018-02-06 21:00:30 39460.0 39450.0 184.0 224.0 39460.0
2018-02-06 21:01:00 39450.0 39440.0 8.0 83.0 39440.0
2018-02-06 21:01:30 39510.0 39500.0 5.0 730.0 39510.0
2018-02-06 21:02:00 39480.0 39460.0 236.0 315.0 39480.0
2018-02-06 21:02:30 39490.0 39480.0 141.0 598.0 39490.0

Volume Turnover VAP Return Direction


NdateTime
2018-02-06 21:00:30 56.0 220954.0 39455.490196 0.000124 Up
2018-02-06 21:01:00 290.0 1144022.0 39449.120879 -0.000161 Down
2018-02-06 21:01:30 468.0 1849118.0 39509.931973 0.001542 Up
2018-02-06 21:02:00 38.0 149992.0 39471.433757 -0.000974 Down
2018-02-06 21:02:30 24.0 94772.0 39488.092016 0.000422 Up

Page 50
Introduction to Statistical Analysis Xuhu Wan

Direction is a categorical variable, VAP,Return, LogPrice are continuous numerical variables


and others are discrete numerical varaiable.

6.2 Exploratory Data Analysis


"Data" or "data set" we often mention, is a data frame consisting of multiple cases of one variable.
A first step to start with data analysis is so-called exploratory data analysis

6.2.1 Numerical Descriptive Statistics


We can compute numerical descriptive measures from three aspects. - Central Tendency: Mean,
Median, Mode - Variation:Variance,SD, Range, Interquantile range - Distribution: tail, skewness,
Q-Q plot, extreme events and heavy tail distribution
We will explain distribution after we learn the histgram which is to visualize the distribution

rb . mode ()

Page 51
Introduction to Statistical Analysis Xuhu Wan

AskPrice BidPrice AskVolume BidVolume Price Volume Turnover \


0 39230.0 39350.0 1.0 1.0 39210.0 0.0 0.0
1 39360.0 NaN NaN NaN 39230.0 NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN NaN NaN
20 NaN NaN NaN NaN NaN NaN NaN
21 NaN NaN NaN NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN NaN NaN
24 NaN NaN NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN NaN NaN
27 NaN NaN NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN NaN NaN
29 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
6899 NaN NaN NaN NaN NaN NaN NaN
6900 NaN NaN NaN NaN NaN NaN NaN
6901 NaN NaN NaN NaN NaN NaN NaN
6902 NaN NaN NaN NaN NaN NaN NaN
6903 NaN NaN NaN NaN NaN NaN NaN
6904 NaN NaN NaN NaN NaN NaN NaN
6905 NaN NaN NaN NaN NaN NaN NaN
6906 NaN NaN NaN NaN NaN NaN NaN
6907 NaN NaN NaN NaN NaN NaN NaN
6908 NaN NaN NaN NaN NaN NaN NaN
6909 NaN NaN NaN NaN NaN NaN NaN
6910 NaN NaN NaN NaN NaN NaN NaN
6911 NaN NaN NaN NaN NaN NaN NaN
6912 NaN NaN NaN NaN NaN NaN NaN
6913 NaN NaN NaN NaN NaN NaN NaN
6914 NaN NaN NaN NaN NaN NaN NaN
6915 NaN NaN NaN NaN NaN NaN NaN
6916 NaN NaN NaN NaN NaN NaN NaN Page 52
6917 NaN NaN NaN NaN NaN NaN NaN
6918 NaN NaN NaN NaN NaN NaN NaN
6919 NaN NaN NaN NaN NaN NaN NaN
Introduction to Statistical Analysis Xuhu Wan

rb . mean ()

AskPrice 39369.937942
BidPrice 39359.812383
AskVolume 391.124116
BidVolume 383.348102
Price 39364.801559
Volume 65.310723
Turnover 257289.964497
VAP 39364.851415
Return 0.000003
dtype: float64

rb . median ()

AskPrice 3.930000e+04
BidPrice 3.929000e+04
AskVolume 3.130000e+02
BidVolume 3.130000e+02
Price 3.929000e+04
Volume 8.000000e+00
Turnover 3.155200e+04
VAP 3.929283e+04
Return 5.855534e-07
dtype: float64
mean can be affected by extreme value (extremely large or small value). However, the mode
and median are not affected.

rb . std ( ddof =1)

AskPrice 4.800383e+02
BidPrice 4.800329e+02
AskVolume 3.520719e+02
BidVolume 3.450304e+02
Price 4.799575e+02
Volume 6.468859e+02
Turnover 2.568355e+06
VAP 4.800255e+02
Return 3.469254e-04
dtype: float64

Page 53
Introduction to Statistical Analysis Xuhu Wan

Variation is also affected by extreme value. In order to correctly evaluate the variation of the
data, we need to use interquantile range. First let us compute quantile.
rb . quantile (0.5)

AskPrice 3.930000e+04
BidPrice 3.929000e+04
AskVolume 3.130000e+02
BidVolume 3.130000e+02
Price 3.929000e+04
Volume 8.000000e+00
Turnover 3.155200e+04
VAP 3.929283e+04
Return 5.855534e-07
Name: 0.5, dtype: float64

Interquantile range (IR) is the difference between 75% quantile and 25% quantile which is
applied to described variation of data . Different from standard deviation or variance, extreme
value has no impact on IR
IR = rb . quantile (0.75) - rb . quantile (0.25)
print ( IR )

AskPrice 360.000000
BidPrice 360.000000
AskVolume 360.000000
BidVolume 349.000000
Price 350.000000
Volume 34.000000
Turnover 133352.000000
VAP 352.470772
Return 0.000229
dtype: float64

We take Return to plot and demonstrate the location of quantiles.


plt . figure ( figsize =(15 ,5) )
plt . hist ( rb [ ' Return '] , bins =100)
plt . axvline ( x = rb [ ' Return ' ]. quantile (0.25) , color = 'r ')
plt . axvline ( x = rb [ ' Return ' ]. quantile (0.75) , color = 'r ')
plt . show ()

Page 54
Introduction to Statistical Analysis Xuhu Wan

Figure 24: Demonstration of interquantile range for Return.

6.2.2 Data Visualization


Bar chart and pie chart are the two main classic method of plotting for categorical variables. The
following are several examples
rb [ ' Direction ' ]. value_counts () . plot ( kind = ' pie ')

Figure 25: Pie chart.

rb [ ' Direction ' ]. value_counts () . plot ( kind = ' bar ')

Page 55
Introduction to Statistical Analysis Xuhu Wan

Figure 26: Bar chart.

For numerical variables, we have histogram, polygon, boxplot etc.


rb [ ' Return ' ]. plot ( kind = ' hist ' , bins =100)
plt . show ()

Figure 27: Histogram

Boxplot is important visualizatioin method for numerical data. - A descriptive statistics, a box
plot or boxplot is a convenient way of graphically depicting groups of numerical data through

Page 56
Introduction to Statistical Analysis Xuhu Wan

their quartiles. - Box plots may also have lines extending vertically from the boxes (whiskers)
indicating variability outside the upper and lower quartiles, hence also called box-and-whisker
plot and box-and-whisker diagram.
rb [ ' Return ' ]. plot ( kind = ' box ')
plt . show ()

Figure 28: Boxplot

rb [ ' Price ' ]. plot ()

Page 57
Introduction to Statistical Analysis Xuhu Wan

Figure 29: Polygon

6.2.3 Relative Frequency, Distribution and Shape of Distribution


The set of the values that appear in a sample (puplation) follow certain proportion. In sample,
these proportions are called frequency or relative frequency (in percentage). The proportion of
different values for a variable in population is called distribution.
We can apply histogram visualize the distibution or value_counts() (discrete or categorical) to
get frequency table for any variables.
n , bins , patch = plt . hist ( rb [ ' Return '] , bins =10)
plt . show ()

Page 58
Introduction to Statistical Analysis Xuhu Wan

Figure 30: Histogram with the output of frequency and bars

bins

array([-0.00608911, -0.0045047 , -0.00292028, -0.00133587, 0.00024855,


0.00183297, 0.00341738, 0.0050018 , 0.00658622, 0.00817063,
0.00975505])

bins give the x-coordinates of each bins.


n

array([ 2.00000000e+00, 1.00000000e+00, 2.40000000e+01,


6.03700000e+03, 8.56000000e+02, 6.00000000e+00,
2.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.00000000e+00])

n contains inof frequency for each bar.


As sample size increases, the relative frequency will converge to the relative frequency in Pop-
ulation, which is called "Distribution".

Page 59
Introduction to Statistical Analysis Xuhu Wan

7 Probability, Distribution and Expectation

Page 60
Introduction to Statistical Analysis Xuhu Wan

8 Association and Variable Selection

Page 61
Introduction to Statistical Analysis Xuhu Wan

9 Sampling Distribution

Page 62
Introduction to Statistical Analysis Xuhu Wan

10 Estimation of Stock Return and Volatility

Page 63
Introduction to Statistical Analysis Xuhu Wan

11 Testing of Market Change

Page 64
Introduction to Statistical Analysis Xuhu Wan

Part III
Prediction with Multiple Linear Regression

Page 65
Introduction to Statistical Analysis Xuhu Wan

List of Figures
1 Maps of energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Growth of unstructured data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 High frequency data of stock price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Matching Index of "Data Science" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Matching index of "deep learning" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Daily close price of Facebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7 Slow and fast signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8 Stock price and moving average processes. . . . . . . . . . . . . . . . . . . . . . . . . 20
9 Daily profit of signal-based strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
10 Accumulated profit or wealth process of signal-based stratgy . . . . . . . . . . . . . . 23
11 Improved profit with parameter-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 24
12 Performance of strategy in training data. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
13 Performance of strategy in testing set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
14 Performance in train data with parameters tuned in train data. . . . . . . . . . . . . . 26
15 Performance in test data with parameters tuned in train data . . . . . . . . . . . . . . 27
16 Average daily profit with regression model . . . . . . . . . . . . . . . . . . . . . . . . 31
17 Accumulated profit with regression model. . . . . . . . . . . . . . . . . . . . . . . . . 32
18 A snapshot of jupyter notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
19 Population and sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
20 Parameter and statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
21 Reinforcing steel bar, rebar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
22 shells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
23 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
24 Demonstration of interquantile range for Return. . . . . . . . . . . . . . . . . . . . . . 55
25 Pie chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
26 Bar chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
27 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
28 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
29 Polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
30 Histogram with the output of frequency and bars . . . . . . . . . . . . . . . . . . . . 59

Page 66