Vous êtes sur la page 1sur 34

DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING

MANIPAL INSTITUTE OF TECHNOLOGY

(A constituent Institute of MANIPAL UNIVERSITY)

MANIPAL - 576 104, KARNATAKA, INDIA



Industrial Training



On

Databases and Data
Warehousing



SUBMITTED
BY


Rina Maria Pereira, Reg. No.-100905034,
rina.per92@gmail.com, Ph-8095103556




Under the Guidance of

Mr. Shariq Khan


Sr.ETL Consultant, IT Services Department



Khorafi Business Machines, Kuwait

Table of Contents


1.Certificate 1
2. Acknowledgement 2
3. Abstract 3
4. Details of the Organization 4
5. Information acquired during
study period
7
i. Database 7
ii. Data warehouse 12
6. Project 17
i. Dimensions 20
ii. Staging 23
iii. System of Record 26
iv. Fact 28
7. Conclusion 30
8.References 31






1

CERTIFICATE


2


ACKNOWLEDGEMENT


It is a matter of great satisfaction and pleasure to present this report on Databases and Data warehousing
concepts. I take this opportunity to owe my thanks to all those involved in my training. .

Firstly I would like to thank Khorafi Business Machines (KBM) for giving the opportunity to complete
my project in the organization. I put on record my sincere thanks to my college, Manipal Institute of
Technology (MIT), Manipal, for giving me such an opportunity. I am extremely grateful to Mr. Shariq
Khan for the encouragement, discussions and critical assessment of the project.

It was a good experience for me to design , create ,test and execute ETL jobs. I am greatly obliged to Mr.
Shariq Khan my industry guide, and Mr. Ihab who have shared their expertise and knowledge with me
without which the completion of project would not have been possible.

I express my gratitude towards staff of KBM, those who have helped me directly or indirectly in completing the
training.




















3


ABSTRACT


KEYWORDS: Database, Data warehouse, GBM, KBM, ETL

A database maintains information about various types of objects (inventory), events
(transactions), people (employees), and places (warehouses). A data warehouse is
a database used for reporting and data analysis. It is a central repository of data which is created
by integrating data from one or more disparate sources. Data warehouses store current as well
as historical data and are used for creating trending reports for senior management reporting
such as annual and quarterly comparisons.

Gulf Business Machines (GBM) is the leading IT solution provider in the gulf region fulfilling
the IT requirements of local, regional and international organizations in the GCC. GBM is the
sole distributor for IBM- excluding selected IBM products and services- throughout the GCC.
GBMs mission is to be the IT partner of choice for organizations in the region offering them
end-to-end IT solutions to help them achieve their business goals.

Khorafi Business Machines (W.L.L) is Kuwaits operation of Gulf Business Machines that are
involved in carrying out the modernization project designed to meet todays challenging and
ever-evolving business environment and to prepare for the future. It has been involved in
projects that improve government efficiency and service delivery throughout the region.

KBM uses IBM Infosphere Datastage that enables the engineer to design data flows that extract
information from multiple source systems, transform it in ways that make it more valuable, and
then deliver it to one target databases and applications.

Through IBM Infoshpere Datastage, we create ETL jobs for various public and private
organizations that contain fact tables and dimensions that define how the data is extracted,
transformed and loaded to the data warehouse providing better services to all clients of KBM.
Further these ETL agents manage the flow of data between the data sources and the target
warehouses by transferring data from the source database to the target database.

4

DETAILS OF THE ORGANISATION

Founded in 1990, Gulf Business Machines (GBM) is the leading IT solutions provider in the gulf
region fulfilling the IT requirements of local, regional and international organizations in the GCC.
GBM is the sole distributor for IBM excluding selected IBM products and services throughout the
GCC, except for Saudi Arabia.
GBM's momentum was further enhanced in 1999, when the team secured the Cisco portfolio.
GBM now holds the highest level of recognition in the region from Cisco, Gold Partner status,
in addition to the Cisco Borderless Network Architecture Specialized Learning Partner status.
Today, GBM is one of the largest IT solutions providers in the GCC, with more than 1000
employees and over 20 solid strategic partnerships forged with internationally recognized IT
solution providers. This means that GBM can offer an extensive range of IT infrastructure, IT
solutions and services ranging from consulting, resource deployment and integration to after-
sales support.
GBM Mission
To be the information technology partner of choice for organizations in the region offering
them End-to-End IT solutions to help them achieve their business goals.
GBM Vision
Our Customers
To remain committed to excellence in providing leading IT solutions and services and to focus
on helping our customers operate at best in a competitive and challenging environment.
Our People
To strive to provide our employees with a dynamic and creative environment to allow them to
realize their full potential.


5

Our Partners
To reinforce our leadership position across the markets in which we operate through our strong
alliances with worldwide technology leaders and our partnerships with renowned solutions
providers.

GBM's experience and expertise span across multiple sectors and particularly e-Government,
Banking and Finance, Telecommunications, Retail and Oil. Believing in the importance of
being where our customers are, GBM today has offices in the UAE (Abu Dhabi, Dubai and
Sharjah), Bahrain, Kuwait, Oman and Qatar, as well as in Pakistan.
Through a unique combination of local market presence, international level skills, a network of
business partners and the access to the worldwide resources of IBM and Cisco, GBM
consistently brings to customers unparalleled IT business solutions.
GBM invests in the continuous training of its employees. The result, GBMs specialized team is
well equipped to address the ever-evolving, industry-specific IT demands in every market.
When particular challenges and requirements arise, GBM is able to adapt and quickly leverage
resources stationed in the various locations and coordinated by the Dubai central office.
GBM's business model looks forward to offering a cutting edge value proposition for clients,
customers, principals and partners aiming at consistently delivering outstanding technology and
exceptional value.
The following solutions and brands are some of the key Information Management offerings of
GBM:
IBM Cognos 8 Business Intelligence delivers the complete range of BI capabilities: Reporting,
Analysis, Dashboarding and Scorecards on a single, service-oriented architecture (SOA).
Together with Cognos Planning and Cognos Financial Consolidation, these offerings provide a
comprehensive Corporate Performance Management (CPM) solution.
GBM offers the industry leading portfolio of IBM Systems and together with IBM is setting the
infrastructure agenda for the region and providing their clients with a clear path to a dynamic

6

infrastructure, helping them transform IT assets into valued services. They provide a choice to
their clients with a broad range of server, storage and retail system offerings required to match
diverse workload needs. With broad expertise across servers, storage, software and technical
support, GBM is in a unique position to deliver innovative products and solutions that help meet
customer business goals in ways not previously thought possible. From applications and skills
to technology resources,
GBM delivers flexibility through a choice of platforms, a broad portfolio and integrated
virtualization capabilitiesall built on open industry standards to help meet the needs of any
enterprise.

Khorafi Business Machines
Khorafi Business Machines (W.L.L) is Kuwaits operation of Gulf Business Machines that are
involved in carrying out the modernization project designed to meet todays challenging and
ever-evolving business environment and to prepare for the future. It has been involved in
projects that improve government efficiency and service delivery throughout the region.

Khorafi Business Machines Co is a private company which provides IT goods and services
including E-Business, networking, mainframes, AS/400, RS/6000, personal computers,
consultancy, project management, education and training, out-sourcing, systems management
and integration.

7

INFORMATION ACQUIRED DURING THE STUDY PERIOD

INTRODUCTION:
Information is everywhere in an organization.
Employees must be able to obtain and analyze the many different levels, formats, and
granularities of organizational information to make decisions
Successfully collecting, compiling, sorting, and analyzing information can provide
tremendous insight into how an organization is performing.
High quality information can significantly improve the chances of making a good
decision.
Good decisions can directly impact an organization's bottom line.
Information is stored in databases.

1. Database
Database maintains information about various types of objects (inventory), events
(transactions), people (employees), and places (warehouses).
Database advantages from a business perspective include
Increased flexibility
Increased scalability and performance
Reduced information redundancy
Increased information integrity (quality)
Increased information security

MY ROLE:
During the first couple of weeks of my internship I was taught the following.
Entity a person, place, thing, transaction, or event about which information is stored
The rows in each table contain the entities.

8

Entity class (table) a collection of similar entities
Attributes (fields, columns) characteristics or properties of an entity class
The columns in each table contain the attributes
Primary key a field (or group of fields) that uniquely identifies a given entity in a
table.
Foreign key a primary key of one table that appears an attribute in another table and
acts to provide a logical relationship among the two tables.


Potential relational database for Coca Cola


9


Entity-Relationship Diagram- An ER diagram is a graphical representation of entities
and their relationships to each other, typically used in computing in regard to the
organization of data within databases or information systems. An entity is a piece of
data-an object or concept about which data is stored.

Normalization- Normalization is the process of efficiently organizing data in a
database. There are two goals of the normalization process: eliminating redundant
data and ensuring data dependencies make sense.
The database community has developed a series of guidelines for ensuring that
databases are normalized. These are referred to as normal forms and are numbered from
one through five. In practical applications, 1NF, 2NF and 3NF along with the
occasional 4NF are often used. Fifth normal form is very rarely used.

Referential Integrity- Referential integrity is a database concept that ensures that
relationships between tables remain consistent. When one table has a foreign key to
another table, the concept of referential integrity states that you may not add a record to
the table that contains the foreign key unless there is a corresponding record in the
linked table.

At the end of the first couple of weeks I was presented with a hardcopy of a deluxe department
store and asked to create the logical and physical design of the deluxe department store which I
completed with success.


10




11




The E-R diagram of the Deluxe Department Store is as follows:




12

2. Data warehouse:
A data warehouse is a logical collection of information gathered from many different
operational databases that supports business analysis activities and decision-making tasks
The primary purpose of a data warehouse is to aggregate information throughout an
organization into a single repository for decision-making purposes.

MY ROLE:
I was given a brief explanation on IBM Infosphere Datastage and how to create ETL jobs
using it.
The following are the things I learnt about IBM Infopshere and ETL jobs:
IBM InfoSphere DataStage integrates data across multiple systems using a high
performance parallel framework, and it supports extended metadata management and
enterprise connectivity. The scalable platform provides more flexible integration of all
types of data, including big data at rest (Hadoop-based) or in motion (stream-based), on
distributed and mainframe platforms.
InfoSphere DataStage provides these features and benefits:
1. Powerful, scalable ETL platformsupports the collection, integration and
transformation of large volumes of data, with data structures ranging from simple to
complex.
2. Support for big data and Hadoopenables you to directly access big data on a
distributed file system.
3. Near real-time data integrationas well as connectivity between data sources and
applications.
4. Workload and business rules managementhelps you optimize hardware
utilization and prioritize mission-critical tasks.
5. Ease of usehelps improve speed, flexibility and effectiveness to build, deploy,
update and manage your data integration infrastructure.





13

It comprises of three components

1. DataStage designer (designer client)
o The Designer client gives you the tools that you need to create jobs that extract,
transform, load, and check the quality of data.
o The Designer client is like a workbench or a blank canvas that you use to build
jobs. The Designer client has a palette that contains the tools that form the basic
building blocks of a job:
o Stages connect to data sources to read or write files and to process data.
o Links connect the stages along which your data flows.
o Annotations provide information about the jobs that you create.
o The Designer client uses a repository where you can store the objects that you
are creating as part of the design process. These objects can be reused by other
job designers.
o Jobs and their associated objects are organized in projects.
2. DataStage Administrator (Administrator Client)-
o It creates projects using the Administrator client. When you start the Designer
client, you specify the project that you will work in, and everything that you do
is stored in that project.
3. DataStage Director (Director Client)-
o When your job designs are finished they are run in the Director client. No data is
moved or transformed until you actually run the job. When you start the Director
client, you specify the project that contains the jobs to run. It is used to monitor
the various jobs.
There are three types of jobs
Server Jobs- Server jobs are compiled and run on the IBM InfoSphere
Information Server engine. Such jobs connect to a data source, extract and

14

transform data, and write data to a target database or file, such as a data
warehouse.
Parallel Jobs- Parallel jobs are used to design parallel jobs to transform and to
cleanse data. They are compiled and run on the IBM InfoSphere Information
Server engine.
Sequence Jobs- For more complex designs, you can build sequences of jobs to
run. Job sequencing allows you to build in programming type controls, such as
branching and looping. IBM InfoSphere DataStage provides a graphical Job
sequencer which allows you to specify a sequence of parallel jobs or server jobs
to run.
An IBM InfoSphere DataStage job consists of individual stages linked together which
describe the flow of data from a data source to a data target.
A stage usually has at least one data input and/or one data output. However, some stages
can accept more than one data input, and output to more than one stage. Each stage has a
set of predefined and editable properties that tell it how to perform or process data.
Properties might include the file name for the Sequential File stage, the columns to sort,
the transformations to perform
These properties are viewed or edited using stage editors. Stages are added to a job and
linked together using the Designer.

15


Stages and links can be grouped in a shared container. Instances of the shared container
can then be reused in different parallel jobs. The different types of jobs have different
stage types.
The stages that are available in the Designer depend on the type of job that is currently
open in the Designer. Parallel Job stages are organized into different groups on the
Designer palette as follows:
General includes stages such as Container and Link.
Data Quality includes stages such as Investigate, Standardize, Reference
Match, and Survive.
Database includes stages such as Classic Federation, DB2 UDB, DB2
UDB/Enterprise, Oracle, Sybase, SQL Server, Teradata, Distributed

16

transaction and ODBC.
File includes stages such as Complex Flat File, Data Set, Lookup File Set,
and Sequential File.
Processing includes stages such as Aggregator, Copy, FTP, Funnel, Join,
Lookup, Merge, Remove Duplicates, Slowly Changing Dimension,Surrogate
Key Generator, Sort, and Transformer.



.












17





PROJECT:
IMMIGRATION SYSTEM FOR THE
MINISTRY OF INTERIORS,
KUWAIT













18

INTRODUCTION

Ministry of Interiors (MOI), Kuwait appointed Khorafi Business Machines to design the
Immigration System for keeping track of all the exit-entry points around the country.
It was supposed to keep track of
1. All ports (air, water and land)
2. Nationality
3. Movement (exit-entry)
4. Terminal
5. Counter (distinct number given to each port)
6. Gender
7. Passport No
8. Visa Type, etc.

MOI provided us with the following Data Source Tables
1. TEXIT_ENTRY_INCLUS
2. TEXIT_ENTRY_MOVE
3. TUSER
4. T_PUBLIC_ORG
5. NATIONALITY
6. TTERMINAL
KBM implemented the following flat files
1. CONTINENT
2. COUNTRY CATEGORY(GCC, KUWAIT,FOREIGNER,OTHER)
3. PORT CATEGORY(AIR, LAND, SEA).




19

The following were the various target files created using the ETL jobs
1. DIMENSIONS
CONTINENT_DIM
TIME_DIM
DOC_TYPE_DIM
RESIDENCE_TYPE_DIM
COUNTRY_DIM
COUNTRY_CAT_DIM
PORT_CAT_DIM
PORT_DIM
2. STAGING
PORT_STG
3. SYTEM OF RECORD (SOR)
CUST_INFO_SOR
PORT_SOR
MOVE_SOR
USER_SOR
MACHINE_SOR
4. FACT
PORT_FACT







20

DIMENSIONS

In a data warehouse, Dimensions provide structured labeling information to otherwise
unordered numeric measures. The dimension is a data set composed of individual, non-
overlapping data elements. The primary functions of dimensions are threefold: to provide
filtering, grouping and labeling.
Below are some of the dimensions designed by KBM for the Immigration system

1. CONTINENT_DIM and its transformer-



21



2. COUNTRY_CAT_DIM and its transformer


22



3. For all dimensions which is not specified it takes -999 value.





23

STAGING
The Data Warehouse Staging Area is temporary location where data from source systems is
copied. A staging area is mainly required in a Data Warehousing Architecture for timing
reasons. In short, all required data must be available before data can be integrated into the Data
Warehouse.
Due to varying business cycles, data processing cycles, hardware and network resource
limitations and geographical factors, it is not feasible to extract all the data from all Operational
databases at exactly the same time.
Below is the staging diagram for PORT and its transformers.




24






25




Loading data into staging table
Data can be loaded in 2 ways
1. Historical Loading
Data is loaded based on the from and to date.
2. Ongoing Loading
Data is loaded based on the last date read from the timestamp.





26

SYSTEM OF RECORD
SOR loads data from staging table but with ids of dimensions. It contains all meaningful
information.
Below are few of the SOR for
1. CUST_INFO_SOR and its transformer



27

2. MOV_SOR and its transformer






28

FACT

A fact table is the central table in a star schema of a data warehouse. A fact table stores
quantitative information for analysis and is often denormalized.
A fact table works with dimension tables. A fact table holds the data to be analyzed, and a
dimension table stores data about the ways in which the data in the fact table can be analyzed.
Thus, the fact table consists of two types of columns. The foreign keys column allows joins
with dimension tables, and the measures columns contain the data that is being analyzed.
Below is fact for PORT for exit
SOR





29


Fact for exit


















30


CONCLUSION



The main objective of the industrial training is to provide an opportunity to undergraduates to
identify, observe and practice how engineering is applicable in the real industry. It is not only
to get experience on technical practices but also to observe management practices and to
interact with fellow workers.

It is easy to work with sophisticated machines, but not with people. The only chance that an
undergraduate has to have this experience is the industrial training period. I feel I got the
maximum out of that experience. Also I learnt the way of work in an organization, the
importance of being punctual, the importance of maximum commitment, and the importance
of team spirit.

During the course of my training I got an exposure to an office environment and learnt the
basics of how the company works and what all tasks are performed by certain teams.

It was a pleasure for me to be involved in such a highly professional environment and every
moment of mine spent there was a learning curve for me. Along with learning the concepts of
data warehousing, I designed, created, tested and executed ETL jobs using IBM Infosphere
Datastage for the Ministry of Interiors project. From MOI databases we created flat files and
dimensions and eventually ETL jobs to meet the governments requirements.

Overall it was a very insightful and wonderful learning experience. I will always cherish
this opportunity and am grateful to the college for giving us this chance of working in a
real working environment and preparing us for the future.











31


REFERENCES







[1] http://en.wikipedia.org/



[2] http://gbm4ibm.com



[3] http://redbook.ibm.com



[4] https://pic.dhe.ibm.com

32