Vous êtes sur la page 1sur 21

Arizona State University

W. P. Carey School of Business


Computer Information Systems

Oasis Project Report

Iago Franco Bacurau


Joao Pedro Santos de Moura
Pedro Silva Moreira

December 8, 2014

Contents
1 Introduction

2 Transactional Database Modeling


2.1 Business Rules . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Entity-Relationship Diagram . . . . . . . . . . . . . . . . . . .
2.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4
4
5
7

3 Dimensional Model Designs

4 Data Creation
11
4.1 PHP Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Python Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Morckaroo Scripts . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 ETL Process

16

6 Remarks

20

List of Figures
1
2
3
4
5
6
7
8
9
10
11
12

ER Diagram . . . . . . . . . . . . .
Test 1 . . . . . . . . . . . . . . . .
Test 2 . . . . . . . . . . . . . . . .
Test 3 . . . . . . . . . . . . . . . .
Test 4 . . . . . . . . . . . . . . . .
Policies Sales Star Schema . . . . .
Monthly Sales Periodic Snapshot
Insurance Business Matrix . . . . .
Mockaroo homepage . . . . . . . .
PHP script of ETL Part 1 . . . . .
PHP script of ETL Part 2 . . . . .
Monthly Sales Fact Data . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

6
7
7
8
8
9
10
10
15
16
17
19

Listings
1
2
3
4

PHP script to populate Policy Coverage table . . .


Python script to populate Policy table . . . . . . .
Python script to populate Vehicle table . . . . . . .
SQL script for th ETL process of periodic snapshot

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

11
12
14
18

Introduction

The project is based on chapter 16 of the book The Data Warehouse Toolkit:
The Complete Guide to Dimensional Modeling and the author discusses various types of insurance like homeowner and private property but for this
project we focused on automobile insurance.
The fist step was to create the business rules to back the development
of the ER diagram of the transactional database which is the primal source
of data of the data warehouse. Next were created star schemas for two
different purposes and subsequently were generated some fake data via script
for simulate the ETL process after the implementation of both star schemas.
Finally was generated a report as a example of the information that can be
extracted of the data modeling developed in this project.

Transactional Database Modeling

According Kimball and Ross (2013) the insurance companies have some common demands like to measure performance over time by coverage, covered
item, policyholder, and sales distribution channel characteristics. Although
some enterprises are engaged in many other external processes, such as the
investment of premium payments or compensation of contract agents. Despite all this requirements, we concentrated on the core of the business that
is related to policies and payments.

2.1

Business Rules

To support the development of the transactional database, the following fifteen business rules were created:
1. A customer can have zero or more policies
2. A vehicle must be owned by one and only one customer
3. Every vehicle must have one and only one model
4. A policy must have a customer, a agent and a vehicle
5. A customer must have a SSN, address and name

6. A customer must have a phone


7. An agent can have zero or more policies
8. A customer can have zero or more cars
9. An agent must have name and address
10. A policy can have one or more coverage
11. Customers can not share the same SSN
12. A model must have a brand
13. An address must have a city
14. A city must have a state
15. A state must have a country

2.2

Entity-Relationship Diagram

The main entities are Vehicle, Customer, Agent and Policy. A bridge entity
called policy coverage was used because one policy can have more than one
coverage, and one type of coverage may be in one or more policies. Based
on this fifteen business rules the following Entity-Relationship diagram was
generated:

Figure 1: ER Diagram

2.3

Tests

To ensure that transactional database created was conformed, SQL queries


were executed according the business rules.

Figure 2: Test 1

Figure 3: Test 2

Figure 4: Test 3

Figure 5: Test 4

Dimensional Model Designs

The main purpose of our Data Warehouse is to support the analysis of the
amount of policies sold, by whom have sold and where its sold. The grain is
on row per agent per policy and based on that the following star schema was
created:

Figure 6: Policies Sales Star Schema


For the periodic snapshot was defined the need to track the sales in a
systematic interval and the monthly interval was the best option based on
the business process. Agent dimension it was the only conformed dimension
used and the grain is one row per agent per month.

Figure 7: Monthly Sales Periodic Snapshot


There are several options to track with this data model, this enterprise
business matrix illustrates others possible alternatives as well the dimensions
that can be used by each process:

Figure 8: Insurance Business Matrix

10

Data Creation

For the data creation were used Python, PHP and SQL scripts wherein the
last one was created with the support of Mockaroo website.

4.1

PHP Scripts

PHP scripts were used to generate the population of Policy Coverage table.
The following script generates 1200 rows of fake data.
Listing 1: PHP script to populate Policy Coverage table
1
2
3

<? php
$numberPolicies = 1200;
$numberCoverages = 4;

4
5
6

for ( $i = 1; $i <= $numberPolicies ; $i ++) {


$coverages = [ ];

$randomCoverages = rand (1 , $numberCoverages ) ;


for ( $j = 1; $j <= $randomCoverages ;) {
if (! in_array ( $j , $coverages ) ) {
array_push ( $coverages , $j ) ;
$j ++;
}
}

8
9
10
11
12
13
14
15

for ( $k = 1; $k <= $randomCoverages ; $k ++) {


? > < div > <? php echo " INSERT INTO policy_coverage (
policy_id , coverage_id ) VALUES ( $i , $k ) ; " ;
? > </ div > <? php
}

16
17

18
19

20
21

?>

Listing 1: PHP script to populate Policy Coverage table

11

4.2

Python Scripts

Python scripts were used to populate the Vehicle and Policy tables.
The first one chose a random date (month, day and year), vehicle and
agent. This script create 500 rows of fake data.
Listing 2: Python script to populate Policy table
1

import random

2
3
4
5
6

# Date source
year = [2010 ,2011 ,2012]
month = list ( range (1 ,13) )
day = list ( range (1 ,29) )

7
8
9
10
11
12
13

# Policy_id
id =0
# List of all vehicles
vehicle = list ( range (1 ,1001) )
# List of all agent
agent = list ( range (1 ,11) )

14
15

f = open ( " policy - output . sql " ," w " )

16
17
18

for i in range (500) :


id = id +1

19
20
21
22
23

# Picking randomly dates


rand_year = random . choice ( year )
rand_month = random . choice ( month )
rand_day = random . choice ( day )

24
25
26

# DateInicitial
rand_date_ini = " " + str ( rand_year ) + " -" + str (
rand_month ) + " -" + str ( rand_day ) + " "

27
28

29
30

# All policies have 1 year of duration , creation of


DateFinal
rand_year = rand_year +1
rand_date_fin = " " + str ( rand_year ) + " -" + str (
rand_month ) + " -" + str ( rand_day ) + " "

31

12

32
33
34

# Picking vehicle and agent randomly


rand_vehicle = random . choice ( vehicle )
rand_agent = random . choice ( agent )

35
36
37
38
39

40

41

# Writing SQL statement in a txt file


# Price will be calculate using SQL later
sql = " INSERT INTO policy ( id , price , vehicle_id ,
agent_id , date_initial , date_final ) VALUES ( " \
+ str ( id ) + " , " + str (0) + " , " + str (
rand_vehicle ) + " , " + str ( rand_agent ) + " ,
" + str ( rand_date_ini ) + " , " + str (
rand_date_fin ) + " ) ;\ n "
f . write ( sql )

42
43
44

f . close ()

Listing 2: Python script to populate Policy table


The second one picked a random plate (three numbers and four characters), a customer number, a model of vehicle and a Manufacturing year. This
script created a output SQL document having 1000 rows.

13

Listing 3: Python script to populate Vehicle table


1

import random

2
3
4

# Plate source
char = [ " a " ," b " ," c " ," d " ," e " ," f " ," g " ," h " ," i " ," j " ," k " ," l " ,
" m " ," n " ," o " ," p " ," q " ," r " ," s " ," t " ," u " ," v " ," w " ," x " ," y " ,"
z"]
num = [ " 0 " ," 1 " ," 2 " ," 3 " ," 4 " ," 5 " ," 6 " ," 7 " ," 8 " ," 9 " ]

6
7

id = 0

8
9
10
11
12

# list of customer
cust = list ( range (1 ,101) )
# list of models
model = list ( range (1 ,275) )

13
14
15

# car s years
year = [ " 2000 " ," 2001 " ," 2002 " ," 2003 " ," 2004 " ," 2005 " ," 2006 "
," 2007 " ," 2008 " ," 2009 " ," 2010 " ," 2011 " ," 2012 " ," 2013 " ,"
2014 " ]

16
17

f = open ( " vehicle_output . sql " ," w " )

18
19
20
21

for i in range (1000) :


# Plate s data randomly
plate = random . choice ( num ) + random . choice ( num ) +
random . choice ( num ) + " -" + random . choice ( char ) +
random . choice ( char ) + random . choice ( char ) +
random . choice ( char )

22
23

24

sql = " INSERT INTO vehicle ( id , plate , customer_id ,


model_id , year ) Values ( " + str ( id + i ) + " , " +
str ( plate ) + " ," + str ( random . choice ( cust ) ) + "
," + str ( random . choice ( model ) ) + " ," + str ( random
. choice ( year ) ) + " ) ;\ n "
f . write ( sql )

25
26

f . close ()

Listing 3: Python script to populate Vehicle table


14

4.3

Morckaroo Scripts

The Mockaroo website was used to generate mostly of the data due its simplicity and effectiveness. This is a simple website with a really simple interface for generate scripts in many different types of languages and was
decided to generate all scripts in SQL language to sustain the consistency of
the project.

Figure 9: Mockaroo homepage

15

ETL Process

The Extraction/Transform/Load process was developed partially using the


PHP-based framework named Laravel and partially using SQL scripts.
Laravel was used to develop the ETL for the first star schema[6] due
to the complexity involved in the process. For the ETL process only the
model part of the MVC (model, view, control) Laravels environment was
used, once there was no need to build an entire application.

Figure 10: PHP script of ETL Part 1

16

Figure 11: PHP script of ETL Part 2

17

SQL scripting was used to create the ETL process for the periodic snapshot.
Listing 4: SQL script for th ETL process of periodic snapshot
1

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

INSERT INTO m on th ly _s al es _f ac t ( agent_id ,


policy_sales_amount , total_price , month_id ) (
( SELECT
policy . agent_id as agent_id ,
COUNT ( policy . agent_id ) as policy_sales_amount ,
SUM ( policy . price ) as total_price ,
md . month_id
FROM
policy
JOIN
month_dimension md ON md . month_year = DATE_PART (
year ,
policy . date_initial
) AND md . month _numbe r_year = DATE_PART (
month ,
policy . date_initial
)
GROUP BY
policy . agent_id , md . month_id )
)

Listing 4: SQL script for th ETL process of periodic snapshot

18

After the ETL process finished, the following table was generated as a
example of report that it can be extracted of the data modeling developed
in this project.

Figure 12: Monthly Sales Fact Data


The process of building and executing the ETL was very instructional,
giving a panorama about insurance and data warehouse.

19

Remarks

In sum, a fully functional data warehouse was developed following the business rules that the team created according the business process discussed
in chapter sixteen of Kimball and Ross(2013)[2]. The first step was to define the business rules in order to design a transactional database which is
the backbone of the data warehouse. After that, the transactional database
was implemented and tested to ensure its conform for the next phase which
was the creation of the star schema of the main business process. Also, a
star schema of a periodic snapshot was design to support the analysis of the
monthly sales. After the implementation of both star schemas, scripts were
created to populate the database with fake data for the purpose of testing
the scripts created for ETL process. The ETL process was developed via
Laravel and SQL script and the last step was to generate a example report
that data modeling designed by the team is capable of.
Some remarkable experiences through the project were:
The necessity to change the initial ER model during the process to fit
all requirements
ETL process is the most complex step
There are many different ways to generate fake data

References
[1] The PostgreSQL Global Development Group. Postgresql documentation,
2014.
[2] Ralph Kimball and Margy Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. John Wiley & Sons, Inc., 3rd edition, 2013.
[3] Taylor Otwell. Laravel 4.2 documentation, 2014.

20

Vous aimerez peut-être aussi