Vous êtes sur la page 1sur 32

1.What is a Database? Ans: A database is a collection of related data .

A database is a logically coherent collection of data with some inherent meaning. 2. What is DBMS? Ans: Database Management system is a collection of programs that enables user to create and maintain a database. Thus a DBMS is a general purposed s/w system that facilitates the process of defining constructing and manipulating a database for various applications. (Defining a data base involves specifying the data types, structures and constraints for the data to be stored in the data database. Constructing a data base is the process of storing data itself on some storage medium that is controlled by DBMS. Manipulating a database includes such functions as querying the data base to retrieve specific data, updating the database to reflect the changes in the mini-world. 3. What is a Catalog? Ans: A catalog is a table that contain the information such as structure of each file , the type and storage format of each data item and various constraints on the data . The information stored in the catalog is called Metadata . Whenever a request is made to access a particular data, the DBMS s/w refers to the catalog to determine the structure of the file. 4. What is data ware housing & OLAP? Ans: Data warehousing and OLAP (online analytical processing ) systems are the techniques used in many companies to extract and analyze useful information from very large databases for decision making . 5. What is real time database technology? Ans: These are all the techniques used in controlling industrial and manufacturing processes. 6. What is program-data independence? Ans: Unlike in the traditional file sys. the structure of the data files is stored in the DBMS catalog separately from the access programs . This property is called program-data independence.i.e. We neednt to change the code of the DBMS if the structure of the data is changed .Which is not supported by traditional file sys . 7. What is ORDBMS? Ans: Object oriented RDBMS is a relational DBMS in which every thing is treated as objects. User can define operations on data as a part of the database definition. 8. What is program-operation independence? Ans: An operation is specified in two parts . 1. Interface (operation name and data types of its arguments). 2. Implementation (the code part) The implementation part can be changed without affecting the interface. This is called program-operation independence.

9. What is a view? Ans: A view may be a subset of the database or it may contain virtual data that is derived from the database files but is not explicitly stored . 10. What is OLTP? Ans: Online transaction processing is an application that involve multiple database accesses from different parts of the world . OLTP needs a multi-user DBMS s/w to ensure that concurrent transactions operate correctly. 11. What is the job of DBA? Ans: A database administrator is a person or a group responsible for authorizing access to the database, for coordinating and monitoring its use, and for acquiring s/w and h/w resources as needed. 12. Who are db designer? Ans: Data base designers are responsible for identifying the data to be stored in the database and for choosing appropriate structure to represent and store this data . 13. What are different types of end users? Ans: 1. Casual end-users 2. Naive or parametric end users 3. Sophisticated end users 4. Stand alone users. 14. What are the advantages of using a dbms? Ans: 1. Controlling redundancy. 2. Restricting unauthorized access. 3. Providing persistent storage for program objects and data structures. 4. Permitting inferencing and actions using rules. 5. Providing multi-user interfaces. 6. Representing complex relationships among data. 7. Enforcing integrity constraints. 8. Providing backups and recovery. 15. What are the disadvantages of using a dbms? Ans: 1. High initial investments in h/w, s/w, and training. 2. Generality that a DBMS provides for defining and processing data. 3. Overhead for providing security, concurrency control, recovery, and integrity functions. 16. What is a data model? Ans: It is a collection of concepts that can be used to describe the structure of a database. It provides necessary means to achieve this abstraction. By structure of a database we mean the data types, relations, and constraints that should hold on the data. 17. What are different categories of data models? Ans: 1. High-level or conceptual data models. 2. Representational data models. 3. Low-level or physical data models. High level data models provide the concepts that are close to the way many users perceive data. Representational data models are provide concepts that provide the concepts that may be understood by end users but that are not too far removed from organization of data in 2

the database. Physical data models describe the details of how data is stored in the computers. 18. What is schema? Ans: The description of a data base is called the database schema , which is specified during database design and is not expected to change frequently . A displayed schema is called schema diagram .We call each object in the schema as schema construct. 19. What are types of schema? Ans: 1. internal schema. 2. Conceptual schema. 3. External schemas or user views. 20. What is Data independency? Ans: Data independency is defined as the capacity to change the conceptual schema without having to change the schema at the next higher level. We can define two types of data independence: 1. Logical data independence. 2. Physical data independence. LDI is the capacity to change the conceptual schema without having to change external schemas or application programs. PDI is the capacity to change the internal schema without having to change conceptual (or external) schemas. 21. What are different DBMS languages? Ans: 1. DDL (Data definition language) 2. SDL (Storage definition language) 3. VDL (View definition language) 4. DML (Data manipulation language) 22. What are different types of DBMS? Ans: 1. DBMS 2. RDBMS (Relational) 3. ORDBMS (Object Relational) 4. DDBMS (Distributed) 5. FDBMS (Federated) 6. HDDBMS (Homogeneous) 7. HDBMS (Hierarchical) 8. NDBMS (Networked) 23. What is an entity? Ans: An entity is a thing in the real world with an independent existence. 24. What are attributes? Ans: These are the particular properties that describe an entity. 25. What are diff. types of attributes? Ans: 1. Composite Vs simple attributes. 2. Single valued Vs multi-valued attributes. 3. Stored Vs derived attribute. 4. Null valued attributes. 5. Complex attributes. 26. What is difference between entity set and entity type? 27. What is domain value or value set of an attribute? Ans: It is the set of values that may be assigned to that attribute for each individual entities . 28. What is degree of a relationship? 3

Ans: The no of entities participating in that relation . 29. What is recursive relationship? Ans: It is the relationship where both the participating entities belong to same entity type . 30. What are relationship constraints? Ans: 1. Cardinality ratio. 2. Participation constraints. 31. What is Cardinality ratio? Ans: The cardinality ratio for a binary relationship specifies the number of relationship instances that an entity can participate in. 32. What is a Participation constraint? Ans: The participation constraint specifies whether the existence of an entity depends on its being related to another entity via the relationship type. This is of two types: 1. Total participation. 2. Partial participation. 33. What is a weak entity types? Ans: The entity types that do not have key attributes of their own are called weak entity types. Rests are called strong entity types .The entity that gives identity to a weak entity is called owner entity. And the relationship is called identifying relationship. A weak entity type always has a total participation constraint with respect to its identifying relationship. 34. What is an ER Diagram? Ans: This data model is based on real world that consists of basic objects called entities and of relationship among these objects. Entities are described in a database by a set of attributes. 35. What is an EER? 36. What is specialization? Ans: It is the process of defining a set of subclasses of an entity type where each subclass contain all the attributes and relationships of the parent entity and may have additional attributes and relationships which are specific to itself. 37. What is generalization? Ans: It is the process of finding common attributes and relations of a number of entities and defining a common super class for them. 38. What are constraints on generalization and specialization? Ans: 1. disjoint ness constraints. 2. Completeness constraints. Disjointness constraint specifies that the subclasses of the specialization must be disjoint .i.e. an entity can be a member of at most one of the subclasses of the specialization. The reverse of it is overlapping. Completeness constraint is a participation constraint which may be 1. Total 2. Partial Total specialization constraint tells that each entity in the super class must be a member of some 4

subclass in the specialization. And partial specialization constraint allows an entity not to belong to any of the subclasses .Thus we do have the following 4 types of constraints on specialization: 1. Disjoint, total 2. Disjoint, partial 3. Overlapping, total 4. Overlapping, partial 39. What is a ternary relationship? Ans: A relationship with a degree 3 is called a ternary relationship. 40. What is aggregation and association? Ans: Aggregation is an abstraction concept for building composite objects from their component objects. The abstraction of association is used to associate objects from several independent classes. 41. What is RAID Technology? Ans: Redundant array of inexpensive (or independent) disks. The main goal of raid technology is to even out the widely different rates of performance improvement of disks against those in memory and microprocessor. Raid technology employs the technique of data striping to achieve higher transfer rates. 42. What is Hashing technique? Ans: This is a primary file organization technique that provides very fast access to records on certain search conditions. The search condition must be an equality condition on a single field, called hash field of the file. 1. Internal hashing 2. External hashing 3. Extendible hashing 4. Linear hashing 5. Partitioned hashing 43. What are different types of relational constraints? Ans: 1. Domain constraints 2. Key constraints 3. Entity integrity constraints 4. Referential integrity constraints Domain constraints specify that the value of each attribute must be an atomic value from the domain of the attributes. Key constraints tell that no two tuples can have the same combination of values for all their attributes. Entity integrity constraint states that no primary key value can be null. Referential integrity constraints states that a tuple in one relation that refers to another relation must refer to an existing tuple in that relation it is specified between two relations and is used to maintain the consistency among tuples of the two relations. 44. What is difference between a super key, a key, a candidate key and a primary key? Ans: A super key specifies a uniqueness constrain that no two distinct tuples in a state can have the same value for the super key. Every relation has at least one default super key. A key is a minimal super key or the subset of the super key which is obtained after removing redundancy. A relation schema may have more than one key .In this case each key is called a candidate key. One of the candidate key with minimum number of attributes is chosen as primary key. 45. What is a foreign key? Ans: A key of a relation schema is called as a foreign key if it is the primary key of some other relation to which it is related to. 46. What is a transaction? Ans: A transaction is a logical unit of database processing that includes one or more database access operations.

47. What are the properties of transaction? Ans: 1. Atomicity 2. Consistency preservation 3. Isolation 4. Durability (permanence) 48. What are the basic data base operations? Ans: 1. Write_item(x) 2. Read_item(x) 49. What are the disadvantages of not controlling concurrency? Ans: 1. Lost update problem 2. Temporary update(Dirty read) problem 3. Incorrect summary problem 50. What are serial, non serial? Ans: A schedule S is serial if, for every transaction T participating in the schedule, all the operations of T is executed consecutively in the schedule, otherwise, the schedule is called non-serial schedule. 51. What are conflict serializable schedules? Ans: A schedule S of n transactions is serializable if it is equivalent to some serial schedule of the same n transactions. 52. What is result equivalent? Ans: Two schedules are called result equivalent if they produce the same final state of the data base. 53. What are conflict equivalent schedules? Ans: Two schedules are said to be conflict equivalent if the order of any two conflicting operations is the same in both schedules. 54. What is a conflict serializable schedule? Ans: A schedule is called conflict serializable if it is conflict equivalent to some serial schedule. 55. What is view equivalence? Ans: Two schedules S and S are said to be view equivalent if the following three conditions hold : 1. Both S and S contain same set of transactions with same operations in them. 2. If any read operation read(x) reads a value written by a write operation or the original value of x the same conditions must hold in the other schedule for the same read(x) operation. 3. If an operation write1(y) is the last operation to write the value of y in schedule S then the same operation must be the last operation in schedule S. 56. What is view serializable? Ans: A schedule is said to be view serializable if it is view equivalent with some serial schedule. 57. What are the various methods of controlling concurrency? Ans: 1. Locking 2. Time stamp Locking data item to prevent multiple transactions from accessing the item concurrently. A time stamp is a unique identifier for each transaction, generated by the system. 58. What is a lock? 6

Ans: A lock is a variable associated with a data item that describes the status of the item with respect to the possible operations that can be applied to it. 59. What are various types of locking techniques? Ans: 1. a binary lock 2. Shared/Exclusive lock 3. Two phase locking 60. What is a binary lock? Ans: A binary lock can have two states or values: 1. locked (1) 2. unlocked(0) If locked it cannot be accessed by any other operations, else can be. 61. What is shared or exclusive lock? Ans: It implements multiple-mode lock. Allowing multiple accesses for read operations but exclusive access for write operation. 62. Explain two phase locking? Ans: All the locking operations must precede the first unlock operation in the transaction .It does have two phases: 1. expanding phase (Locks are issued) 2. Shrinking phase (Locks are released) 63. What are different types of two phase lockings (2pl)? Ans: 1. Basic 2. Conservative 3. Strict 4. Rigorous this is the basic technique of 2pl described above. Conservative 2pl requires a transaction to lock all the items it accesses before the transaction begins its execution, by pre-declaring its read-set and write-set. Strict 2pl guarantees that a transaction doesnt release any of its exclusive locks until after it commits or aborts. Rigorous guarantees that a transaction doesnt release any of its locks (including shared locks) until after it commits or aborts. 64. What is a deadlock? Ans: Dead lock occurs when each transaction T in a set of two or more transactions is waiting for some item that is locked by some other transaction T in the set. Hence each transaction is in a waiting queue, waiting for one of the other transactions to release the lock on them. 65. What are triggers? Ans: Triggers are the PL/SQL blocks definining an action the database should take when some database related event occurs. Triggers may be used to supplement declarative referential integrity, to enforce complex business rules, or to audit changes to data.

Database Models Database systems can be based on different data models or database models respectively. A data model is a collection of concepts and rules for the description of the structure of the database. Structure of the database means the data types, the constraints and the relationships for the description or storage of data respectively. The most often used data models are:

Network Model and Hierarchical Model

The network model and the hierarchical model are the predecessors of the relational model. They build upon individual data sets and are able to express hierarchical or network like structures of the real world.

Network Model and Hierarchical Model Relational Model The relational model is the best known and in todays DBMS most often implemented database model. It defines a database as a collection of tables (relations) which contain all data. This module deals predominantly with the relational database model and the database systems based on it.

Relational Database Model Object-oriented Model Object-oriented models define a database as a collection of objects with features and methods. A detailed discussion of object-oriented databases follows in an advanced module.

Schematic Representation of a Object-oriented Database Model Object-relational Model Object-oriented models are very powerful but also quite complex. With the relatively new object-relational database model is the wide spread and simple relational database model extended by some basic object-oriented concepts. These allow us to work with the widely know relational database model but also have some advantages of the object-oriented model without its complexity.

Schematic Represenation of the objectrelational Database Model

atabase Schemes and Database Instances Independent from the database model it is important to differentiate between the description of the database and the database itself. The description of the database is called database scheme or also metadata . The database scheme is defined during the database design process and changes very rarely afterwards. The actual content of the database, the data, changes often over the years. A database state at a specific time defined through the currently existing content and relationship and their attributes is called a database instance The following illustration shows that a database scheme could be looked at like a template or building plan for one or several database instances.

Analogy Database Schemes and Building Plans When designing a database it is differentiated between two levels of abstraction and their respective data schemes, the conceptual and the logical data scheme. Conceptual Data Scheme: : A conceptual data scheme is a system independent data description. That means that it is independent from the database or computer systems used. (Translated) (ZEHNDER 1998) Logical Data Scheme: 9

A logical data scheme describes the data in a data definition language DDL of a specific database management system. (Translated) (ZEHNDER 1998) A logical data scheme describes the data in a data definition language DDL of a specific database management system. (Translated) (ZEHNDER 1998) The conceptual data scheme orients itself exclusively by the database application and therefore by the real world. It does not consider any data technical infrastructure like DBMS or computer systems, which are eventually employed. Entity relationship diagrams and relations are tools for the development of a conceptual scheme. When designing a database the conceptual data scheme is derived from the logical data scheme (see unit Relational Database Design). This derivation results in a logical data scheme for one specific application and one specific DBMS. A DB-Development System converts then the logical scheme directly into instructions for the DBMS.

DBMS-Architecture and Data Independence Database management systems are complex softwares which were often developed and optimised over years. From the view of the user, however, most of them have a quite similar basic architecture. The discussion of this basic architecture shall help to understand the connection with data modelling and the introductionally to this module postulated 'data independence' of the database approach. Three-Schemes Architecture Knowing about the conceptual and the derived logical scheme (discussed in unit Database Models, Schemes and Instances this unit explains two additional schemes - the external scheme and the internal scheme - which help to understand the DBMS architecture. External Scheme: An external data scheme describes the information about the user view of specific users (single users and user groups) and the specific methods and constraints connected with this information. (Translated) (ZEHNDER 1998) Internal Scheme: 10

The internal data scheme describes the content of the data and the required service functionality which is used for the operation of the DBMS. (Translated) (ZEHNDER 1998) Therefore, the internal scheme describes the data from a view very close to the computer or system in general. It completes the logical scheme with data technical aspects like storage methods or help functions for more efficiency.

ThreeSchemes Architecture The right hand side of the representation above is also called the three-schemes architecture: internal, logical and external scheme. While the internal scheme describes the physical grouping of the data and the use of the storage space, the logical scheme (derived from the conceptual scheme) describes the basic construction of the data structure. The external scheme of a specific application, generally, only highlights that part of the logical scheme which is relevant for its application. Therefore, a database has exactly one internal and one logical scheme but may have several external schemes for several applications using this database. The aim of the three-schemes architecture is the separation of the user applications from the physical database, the stored data. Physically the data is only existent on the internal level while other forms of representation are calculated or derived respectively if needed. The DBMS has the task to realise this representation between each of these levels. Data Independence With knowledge about the three-schemes architecture the term data independence can be explained as followed: Each higher level of the data architecture is immune to changes of the next lower level of the architecture. Physical Independence:

11

Therefore, the logical scheme may stay unchanged even though the storage space or type of some data is changed for reasons of optimisation or reorganisation. Logical Independence: Also the external scheme may stay unchanged for most changes of the logical scheme. This is especially desirable as in this case the application software does not need to be modified or newly translated.

Database Languages DDL For describing data and data structures a suitable description tool, a data definition language (DDL), is needed. With this help a data scheme can be defined and also changed later. Typical DDL operations (with their respective keywords in the structured query language SQL):

Creation of tables and definition of attributes (CREATE TABLE ...) Change of tables by adding or deleting attributes (ALTER TABLE ) Deletion of whole table including content (!) (DROP TABLE )

DML

Additionally a language for the descriptions of the operations with data like store, search, read, change, etc. the so-called data manipulation, is needed. Such operations can be done with a data manipulation language (DML). Within such languages keywords like insert, modify, update, delete, select, etc. are common. Typical DML operations (with their respective keywords in the structured query language SQL):

Add data (INSERT) Change data (UPDATE) Delete data (DELETE) Query data (SELECT)

Often these two languages for the definition and manipulation of databases are combined in one comprehensive language. A good example is the structured query language SQL which is discussed in detail in lesson Structured Query Language SQL

Three Level Database Architecture Data and Related Structures Data are actually stored as bits, or numbers and strings, but it is difficult to work with data at this level. 12

It is necessary to view data at different levels of abstraction. Schema:

Description of data at some level. Each level has its own schema.

We will be concerned with three forms of schemas:


physical, conceptual, and external.

Physical Data Level The physical schema describes details of how data is stored: files, indices, etc. on the random access disk system. It also typically describes the record layout of files and type of files (hash, b-tree, flat). Early applications worked at this level - explicitly dealt with details. E.g., minimizing physical distances between related data and organizing the data structures within the file (blocked records, linked lists of blocks, etc.) Problem:

Routines are hardcoded to deal with physical representation. Changes to data structures are difficult to make. Application code becomes complex since it must deal with details. Rapid implementation of new features very difficult.

13

Conceptual Data Level Also referred to as the Logical level Hides details of the physical level.

In the relational model, the conceptual schema presents data as a set of tables.

The DBMS maps data access between the conceptual to physical schemas automatically.

Physical schema can be changed without changing application: DBMS must change mapping from conceptual to physical. Referred to as physical data independence.

External Data Level In the relational model, the external schema also presents data as a set of relations. An external schema specifies a view of the data in terms of the conceptual level. It is tailored to the needs of a particular category of users. Portions of stored data should not be seen by some users and begins to implement a level of security and simplifies the view for these users Examples:

Students should not see faculty salaries. Faculty should not see billing or payment data.

Information that can be derived from stored data might be viewed as if it were stored.

GPA not stored, calculated when needed.

Applications are written in terms of an external schema. The external view is computed when accessed. It is not stored. Different external schemas can be provided to different categories of users. Translation from external level to conceptual level is done automatically by DBMS at run time. The conceptual schema can be changed without changing application:

Mapping from external to conceptual must be changed. Referred to as conceptual data independence.

14

Data Model Schema: description of data at some level (e.g., tables, attributes, constraints, domains) Model: tools and languages for describing:

Conceptual/logical and external schema described by the data definition language (DDL) Integrity constraints, domains described by DDL Operations on data described by the data manipulation language (DML) Directives that influence the physical schema (affects performance, not semantics) are described by the storage definition language (SDL)

Data Independence Logical data independence

Immunity of external models to changes in the logical model Occurs at user interface level

Physical data independence

Immunity of logical model to changes in internal model Occurs at logical interface level

15

OLTP vs. OLAP


We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it

OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). - OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema). The following table summarizes the major differences between OLTP and OLAP system design. OLTP System OLAP System Online Transaction Processing Online Analytical Processing (Operational System) (Data Warehouse) Operational data; OLTPs are the Consolidation data; OLAP data comes Source of data original source of the data. from the various OLTP Databases Purpose of To control and run fundamental To help with planning, problem solving, data business tasks and decision support Reveals a snapshot of ongoing business Multi-dimensional views of various kinds of What the data processes business activities Inserts and Short and fast inserts and updates Periodic long-running batch jobs refresh Updates initiated by end users the data Relatively standardized and simple Often complex queries involving Queries queries Returning relatively few aggregations records 16

Depends on the amount of data involved; Processing batch data refreshes and complex queries Typically very fast Speed may take many hours; query speed can be improved by creating indexes Larger due to the existence of aggregation Space Can be relatively small if historical structures and history data; requires more Requirements data is archived indexes than OLTP Database Typically de-normalized with fewer tables; Highly normalized with many tables Design use of star and/or snowflake schemas Backup religiously; operational data is Instead of regular backups, some Backup and critical to run the business, data loss is environments may consider simply Recovery likely to entail significant monetary reloading the OLTP data as a recovery loss and legal liability method

OLTP and OLAP Data Integration: A Review of Feasible Implementation Methods and Architectures for Real Time Data Analysis
Introduction and Foundation Database technology is at the center of many types of information systems. Among these information systems are decision support systems and executive information systems. Online Transaction Processing (OLTP) environments use database technology to transact and query against data, and support the daily operational needs of the business enterprise. Online Analytical Processing (OLAP) environments use database technology to support analysis and mining of long data horizons, and to provide decision makers with a platform from which to generate decision making information. Decision support systems and business intelligence tools, as well as data mining analytics, generally utilize a different data structure to do their work. One of the main disadvantages to the OLAP environment is the currency of the data. Because the process by which data is extracted, transformed, and loaded into the OLAP environment can be relatively slow by transactional data standards, the ability to achieve real-time data analysis is lost. Kannan [21] contends that the real value of having real time intelligence is that business processes can be optimized. The importance of generating real-time business intelligence is that it is a building block to achieve better business process management and true business process optimization. OLTP database structures are characterized by storage of atomic values, meaning individual fields cannot store multi-values. They are also transaction oriented as the name implies. Alternatively, OLAP database structures are generally aggregated and summarized. There is an analytical orientation to the nature of the data, and the values represent a historical view of the entity. According to Terr , real-time data warehousing combines realtime activity with OLAP concepts. He discusses the problem with latency and differentiates between realtime and near real-time systems. The business goal of using real-time data analysis is further supported on the compelling reasons for building real-time data analysis systems.

Significance
17

The significance of this research is that organizations need real time data analysis in order to improve decision support systems, business intelligence, and business activity monitoring. The migration of the data into the data warehouse using conventional ETL tools and methods, performed on large sets of data, creates an information latency problem because of the time required to perform the ETL function and migrate the data to a separate OLAP platform. Reimers [30] points out that the need to speed up decision-making in business by using real-time analytic tools and platforms is strongly motivated by the fear of falling behind the competition. Database management systems (DBMSs) are generally implemented against data in one of two main structured data environments: the Online Transaction Processing (OLTP) environment and the Online Analytical Processing (OLAP) environment. Conventionally, a separate database from the OLTP environment is used to implement the OLAP environment . The standard design for OLTP schemas originated in Codds seminal work. Codd [8] published the initial work on designing relational databases in his paper titled A Relational Model of Data for Large Shared Data Banks. Relational databases are exceptionally well suited for transactional purposes, and have proliferated in the business enterprise as the principle technology to host and maintain operational data. The OLAP environment serves a different business purpose and is much different than the OLTP environment. Data from the transactional OLTP environment is extracted, transformed, and loaded into the OLAP environment where advanced DBMS products can utilize analytics native in the Structured Query Language (SQL) release 3 (1999) standard to provide business intelligence. The OLAP environment was designed to provide for a successful implementation of Decision Support System applications . The extraction, transformation, and loading (ETL) process can be time consuming, and is generally a large determining factor in the frequency of the updates to the OLAP environment. To this extent, the data in the data warehouse can be less than useful in decision support since it is not real time data, such as is found in the OLTP environment. The research question is stated as is it possible to integrate the OLTP and OLAP environments without disruption to the transactional performance so that real-time, or near real-time, data analysis may occur? The OLTP and OLAP environments are separate and distinct because they are designed to accommodate different business purposes. The problem of integrating these two environments is multi-faceted. It includes consideration for the logical database models for the two environments, the integration of the physical data, and the database management system (DBMS) engine that runs against the data. Integration failure with any of these criteria could result in the inability to integrate the OLTP and OLAP environments. Broken down into three subordinate questions, the problem is now tractable and can be articulated as: a. Is it possible to use one logical database model for OLAP and OLTP? b. Is it possible to integrate the physical data residing in the OLAP and OLTP repositories? c. Is it possible to use the same DBMS engine to query the OLAP and OLTP data?

What is Business Intelligence?


Business Ingelligence (BI) - technology infrastructure for gaining maximum information from available data for the purpose of improving business processes. Typical BI infrastructure components are as follows: software solution for gathering, cleansing, integrating, analyzing and sharing data. Business Intelligence produces analysis and provides believable information to help making effective and high quality business decisions. The most common kinds of Business Intelligence systems are: 18

EIS - Executive Information Systems DSS - Decision Support Systems MIS - Management Information Systems GIS - Geographic Information Systems OLAP - Online Analytical Processing and multidimensional analysis CRM - Customer Relationship Management Business Intelligence systems based on Data Warehouse technology. A Data Warehouse(DW) gathers information from a wide range of company's operational systems, Business Intelligence systems based on it. Data loaded to DW is usually good integrated and cleaned that allows to produce credible information which reflected so called 'one version of the true'.

Business Intelligence tools


The most popular BI tools on the market are: - Siebel Business Analytics Applications - Business Intelligence - BusinessObjects XI - Cognos 8 BI - Hyperion System 9 BI+ - Analysis Services - Dynamic Enterprise Dashboards - Open BI Suite - WebFOCUS Business Intelligence QlikTech - QlikView - Enterprise Analytics - InfoMaker - IOLAP ShowCase

ETL tools List of the most popular ETL tools:


- Power Center - Websphere DataStage(Formerly known as Ascential DataStage) - BusinessObjects Data Integrator - Cognos Data Manager (Formerly known as Cognos DecisionStream) - SQL Server Integration Services - Data Integrator (Formerly known as Sunopsis Data Conductor) - Data Integration Studio - Warehouse Builder - Data Migrator - Pentaho Data Integration - DT/Studio - ETL4ALL - DB2 Warehouse Edition - Data Integrator - Transformation Manager 19

- DataFlow - Data Integrated Suite ETL Talend - Talend Open Studio - Expressor Semantic Data Integration System - Elixir Repertoire - CloverETL

ETL process
ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse. ETL involves the following tasks: - extracting the data from source systems (SAP, ERP, other oprational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing. - transforming the data may involve the following tasks: applying business rules (so-called derivations, e.g., calculating new measures and dimensions), cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.), filtering (e.g., selecting only certain columns to load), splitting a column into multiple columns and vice versa, joining together data from multiple sources (e.g., lookup, merge), transposing rows and columns, applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing) - loading the data into a data warehouse or data repository other reporting applications

20

Data warehouse
Data Warehouse is a central managed and integrated database containing data from the operational sources in an organization (such as SAP, CRM, ERP system). It may gather manual inputs from users determining criteria and parameters for grouping or classifying records. A source for the data warehouse is a data extract from operational databases. The data is validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded into the data warehouse. That database contains structured data for query analysis and can be accessed by users. The data warehouse can be created or updated at any time, with minimum disruption to operational systems. It is ensured by a strategy implemented in a ETL process. Data warehouse is a dedicated database which contains detailed, stable, non-volatile and consistent data which can be analyzed in the time variant. Sometimes, where only a portion of detailed data is required, it may be worth considering using a data mart. A data mart is generated from the data warehouse and contains data focused on a given subject and data that is frequently accessed or summarized. Business Intelligence - Data Warehouse - ETL:

Keeping the data warehouse filled with very detailed and not efficiently selected data may lead to growing the database to a huge size, which may be difficult to manage and unusable. To significantly reduce number of rows in the data warehouse, the data is aggregated which leads to easier data maintenance and efficiency.

21

Key Data Warehouse systems and the most widely used database engines for storing and serving data for the enterprise business intelligence and performance management: ta

- Business Information Warehouse (SAP Netweaver BI)

DataWarehouse Architecture
The main difference between the database architecture in a standard, on-line transaction processing oriented system (usually ERP or CRM system) and a DataWarehouse is that the systems relational model is usually de-normalized into dimension and fact tables which are typical to a data warehouse database design. The differences in the database architectures are caused by different purposes of their existence. In a typical OLTP system the database performance is crucial, as end-user interface responsiveness is one of the most important factors determining usefulness of the application. That kind of a database needs to handle inserting thousands of new records every hour. To achieve this usually the database is optimized for speed of Inserts, Updates and Deletes and for holding as few records as possible. So from a technical point of view most of the SQL queries issued will be INSERT, UPDATE and DELETE. Opposite to OLTP systems, a DataWarehouse is a system that should give response to almost any question regarding company performance measure. Usually the information delivered from a data warehouse is used by people who are in charge of making decisions. So the information should be accessible quickly and easily but it doesn't need to be the most recent possible and in the lowest detail level. Usually the data warehouses are refreshed on a daily basis (very often the ETL processes run overnight) or once a month (data is available for the end users around 5th working day of a new month). Very often the two approaches are combined. The main challenge of a DataWarehouse architecture is to enable business to access historical, summarized data with a read-only access of the end-users. Again, from a technical standpoint the most SQL queries would start with a SELECT statement. Data mart Data marts are designated to fulfill the role of strategic decision support for managers responsible for a specific business area. 22

Data warehouse operates on an enterprise level and contains all data used for reporting and analysis, while data mart is used by a specific business department and are focused on a specific subject (business area). A scheduled ETL process populates data marts within the subject specific data warehouse information. The typical approach for maintaining a data warehouse environment with data marts is to have one Enterprise Data Warehouse which comprises divisional and regional data warehouse instances together with a set of dependent data marts which derive the information directly from the data warehouse. It is crucial to keep data marts consistent with the enterprise-wide data warehouse system as this will ensure that they are properly defined, constituted and managed. Otherwise the DW environment mission of being "the single version of the truth" becomes a myth. However, in data warehouse systems there are cases where developing an independent data mart is the only way to get the required figures out of the DW environment. Developing independent data marts, which are not 100% reconciled with the data warehouse environment and in most cases includes a supplementary source of data, must be clearly understood and all the associated risks must be identified. Data marts are usually maintained and made available in the same environment as the data warehouse (systems like Oracle, Teradata, MS SQL Server, SAS) and are smaller in size than the enterprise data warehouse. There are also many cases when data marts are created and refreshed on a server and distributed to the end users using shared drives or email and stored locally. This approach generates high maintenance costs, however makes it possible to keep data marts available offline. There are two approaches to organize data in data marts: - one-dimensional, not aggregated data set; in most cases the data is processed and summarized many times by the reporting application. - aggregated data organized in multidimensional structure. The data is aggregated only once and is ready for business analysis right away. In the next stage, the data from data marts is usually gathered by a reporting or analytic processing (OLAP) tool, such as Cognos, Business Objects, Hyperion, Pentaho BI, Microsoft Excel and made available for business analysis. Usually, a company maintains multiple data marts serving the needs of finance, marketing, sales, operations, IT and other departments upon needs. Sample use of data marts in an organization: CRM reporting, customer migration analysis, production planning, monitoring of marketing campaigns, performance indicators, internal ratings

23

and scoring, risk management, integration with other systems (systems which use the processed DW data) and more uses specific to the individual business. ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database.

Extract is the process of reading data from a database. Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data. Load is the process of writing the data into the target database. The ETL process is also very often referred to as Data Integration process and ETL tool as a Data Integration platform. The terms closely related to and managed by ETL processes are: data migration, data management, data cleansing, data synchronization and data consolidation. The main goal of maintaining an ETL process in an organization is to migrate and transform data from the source OLTP systems to feed a data warehouse and form data marts. ETL Tools

At present the most popular and widely used ETL tools and applications on the market are: DataStage) Informatica PowerCenter Oracle ETL Ab Initio Data Integration - Kettle Project (open source ETL)

Data mining Data mining is an area of using intelligent information management tool to discover the knowledge and extract information to help support the decision making process in an organization. Data mining is an approach to discovering data behavior in large data sets by exploring the data, fitting different models and investigating different relationships in vast repositories. The information extracted with a data mining tool can be used in such areas as decision support, prediction, sales forecasts, financial and risk analysis, estimation and optimization. Sample real-world business use of data mining applications includes: - aids customers classification and retention campaigns 24

- guest behavior prediction or relevant content delivery se data mining to detect occurences of fraud such as money laundering and tax evasion, match crime and terrorist patterns, etc. - analysis of the vast data stores The most widely known and encountered Data Mining techniques: Statistical modeling which uses mathematical equations to do the analysis. The most popular statistical models are: generalized linear models, discriminant analysis, linear regression and logistic regression.

Neural networks

Data Management
Data management is a general term that covers a broad range of data applications. It may refer to basic data management concepts or to specific technologies. Some notable applications include 1) data design, 2) data storage, and 3) data security. 1. Data design, or data architecture, refers to the way data is structured. For example, when creating a file format for an application, the developer must decide how to organize the data in the file. For some applications, it may make sense to store data in a text format, while other programs may benefit from a binary file format. Regardless of what format the developer uses, the data must be organized within the file in a structure that can be recognized by the associated program. 2. Data storage refers to the many different ways of storing data. This includes hard drives, flash memory, optical media, and temporary RAM storage. When selecting an appropriate data storage medium, concepts such as data access and data integrity are important to consider. For example, data that is accessed and modified on a regular basis should be stored on a hard drive or flash media. This is because these types of media provide quick access and allow the data to be moved or changed. Archived data, on the other hand, may be stored on optical media, such as CDs and DVDs, since the data does not need to be changed. Optical discs also maintain data integrity longer than hard drives, which makes them a good choice for archival purposes. 3. Data security involves protecting computer data. Many individuals and businesses store valuable data on computer systems. If you've ever felt like your life is stored on your computer, you understand how important data can be. Therefore, it is wise to take steps to protect the privacy and integrity of your data. Some steps include installing a firewall to prevent unauthorized access to your computer and encrypting personal data that is submitted online or shared with other users. It

25

is also important to backup your data so that you will be able to recover your files in case your primary storage device fails.

DATABASE SECURITY
Security is a major concern for the modern age systems, network, and database administrators. It is natural for an administrator to worry about hackers and external attacks while implementing security. But there is more to it. It is essential to first I implement security within the organization, to make sure the right people have access to the right data. Without these security measures in place, you might find someone destroying your valuable data, or selling your company's secrets to your competitors, or someone invading the privacy of others. Primarily, a security plan must identify which users in the organization can see which data and perform which activities in the database.

Authentication vs Authorization Authentication is any process by which you verify that someone is who they claim they are. This usually involves a username and a password, but can include any other method of demonstrating identity, such as a smart card, retina scan, voice recognition, or fingerprints. Authentication is equivalent to showing your drivers license at the ticket counter at the airport. Authentication systems provide answers to the questions:

Who is the user? Is the user really who he/she represents himself to be?

Authorization, by contrast, is the mechanism by which a system determines what level of access a particular authenticated user should have to secure resources controlled by the system. For example, a database management system might be designed so as to provide certain specified individuals with the ability to retrieve information from a database but not the ability to change data stored in the database, while giving other individuals the ability to change data. Authorization systems provide answers to the questions:

Is user X authorized to access resource R? Is user X authorized to perform operation P? Is user X authorized to perform operation P on resource R?

Authentication and authorization are somewhat tightly-coupled mechanisms authorization systems depend on secure authentication systems to ensure that users are who they claim to be and thus prevent unauthorized users from gaining access to secured resources. Authentication SQL Server supports two authentication modes:

26

Windows Authentication Mode: With Windows authentication, you do not have to specify a login name and password to connect to SQL Server. Instead, your access to SQL Server is controlled by your Windows NT/2000 account (or the group to which your account belongs to), that you used to login to the Windows operating system on the client computer or workstation. A DBA must specify to SQL Server all the Microsoft Windows NT/2000 accounts or groups that can connect to SQL Server. This authentication mode is used by default because of its inherent better security. When it is used, Windows NT is responsible for managing users connections to the SQL Server through the users account name or group membership.

SQL Server Authentication: When a user connects with a specified login name and password, SQL Server performs the authentication itself by checking to see if a SQL Server login account has been set up and if the specified password matches the one previously recorded. If SQL Server does not have a login account set, authentication fails and the user receives an error message.

Windows authentication is the recommended security mode, as it is more secure and you don't have to send login names and passwords over the network. You should avoid mixed mode, unless you have a non-Windows NT/2000 environment, or when your SQL Server is installed on Windows 95/98, or for backward compatibility with your existing applications. Authorization Role-based Security Role-based security is a form of user-level security where a server doesn't focus on the individual user's identity but rather on a logical role he is in. A role is nothing but a group to which individual logins and users can be added, so that the permissions can be applied to a group, instead of applying the permissions to all the individual logins and users. There are three types of roles in SQL Server 7.0 and 2000:

Fixed server roles Fixed database roles Application roles

Fixed Server Roles Fixed server roles are server-wide roles. Logins can be added to these roles to gain the associated administrative permissions of the role. Fixed server roles cannot be altered and new server roles cannot be created. Here are the fixed server roles and their associated permissions in SQL Server 2000: Fixed Server Role Descriptions 27

sysadmin: Can perform any activity in SQL Server serveradmin: Can set server-wide configuration options, shut down the server setupadmin: Can manage linked servers and startup procedures securityadmin: Can manage logins and CREATE DATABASE permissions, also read error logs and change passwords processadmin: Can manage processes running in SQL Server dbcreator: Can create, alter, and drop databases diskadmin: Can manage disk files bulkadmin: Can execute BULK INSERT statements

Fixed Database Roles Each database has a set of fixed database roles, to which database users can be added. These fixed database roles are unique within the database. While the permissions of fixed database roles cannot be altered, new database roles can be created. Here are the fixed database roles and their associated permissions in SQL Server 2000:

Fixed Database Role Description


db_owner: Has all permissions in the database db_accessadmin: Can add or remove user IDs db_securityadmin: Can manage all permissions, object ownerships, roles and role memberships db_ddladmin: Can issue ALL DDL, but cannot issue GRANT, REVOKE, or DENY statements db_backupoperator: Can issue DBCC, CHECKPOINT, and BACKUP statements db_datareader: Can select all data from any user table in the database db_datawriter: Can modify any data in any user table in the database db_denydatareader: Cannot select any data from any user table in the database db_denydatawriter: Cannot modify any data in any user table in the database

Principals These are objects (for example a user login, a role, or an application) that may be granted permission to access particular database objects. SQL Server divides principals into three classes: 28

Windows principals: These represent Windows user accounts or groups, authenticated using Windows security.

SQL Server principals: These are server-level logins or groups that are authenticated using SQL Server security.

Database principals: These include database users, groups, and roles, as well as application Roles.

Securable Securable are objects (a table or view, for example) to which access can be controlled. These are the resources to which the DBMS authorization system regulates access. Some securable can be contained within others, creating nested hierarchies called "scopes" that can themselves be secured. The securable scopes are server, database, and schema.

Here are few examples: Server level securable Login Database

Database level securable User Role Schema

Schema level securable Tables Views Constraints 29

Type Procedures

Permissions These are individual rights, granted (or denied) to a principal, to access a securable object. The following T-SQL commands are used to manage permissions at the user and role level.

GRANT: Grants the specific permission (SELECT, DELETE etc.) to the specified user or role in the current database REVOKE: Removes a previously granted or denied permission from a user or role in the current database DENY: Denies a specific permission to the specified user or role in the current database

Using the above commands, permissions can be granted, denied, or revoked to users and roles on all database objects.

There is no way to manage permissions at the row level. That is, in a given table, you can't grant SELECT permission on a specific row to User1 and deny SELECT permission on another row to User2. This kind of security can be implemented by creating user specific views and granting SELECT permission on these views to users. Using Views as Security Mechanisms Views can serve as security mechanisms by restricting the data available to users. Some data can be accessible to users for query and modification, while the rest of the table or database is invisible and inaccessible. Permission to access the subset of data in a view must be granted, denied, or revoked, regardless of the set of permissions in force on the underlying table(s). For example, the salary column in a table contains confidential employee information, but the rest of the columns contain information that should be available to all users. You can define a view that includes all of the columns in the table with the exception of the sensitive salary column. By defining different views and granting permissions selectively on them, users, groups, or roles can be restricted to different subsets of data. For example:

Access can be restricted to a subset of the rows of a base table. For example, define a view that contains only rows for business and psychology books and keep information about other types of books hidden from users. Access can be restricted to a subset of the columns of a base table. For example, define a view that contains all the rows of the titles table but omits the royalty and advance columns because this information is sensitive. Access can be restricted to a row-and-column subset of a base table. 30

Access can be restricted to the rows that qualify for a join of more than one base table. For example, define a view that joins the titles, authors, and titleauthor tables to display the names of authors and books they have written. This view hides personal data about the authors, and financial information about the books. Access can be restricted to a statistical summary of data in a base table. For example, define a view that contains only the average price of each type of book. Access can be restricted to a subset of another view or of some combination of views and base tables.

SQL Injection SQL injection is a technique whereby an intruder enters data that causes your application to execute SQL statements you did not intend it to. SQL injection is possible as soon there is dynamic SQL which is handled carelessly, be that SQL statements sent from the client or dynamic SQL generated in T-SQL stored procedures. SQL injection may be possible if input is not filtered for escape characters and is then passed into a SQL statement. This result in the potential manipulation of the statements performed on the database by the end user of the application. The following line of code illustrates this vulnerability: statement := "SELECT * FROM users WHERE name = '" + userName + "';" This SQL code is designed to pull up the records of a specified username from its table of users, however, if the "userName" variable is crafted in a specific way by a malicious user, the SQL statement may do more than the code author intended. For example, setting the "userName" variable as a' or 't'='t renders this SQL statement by the parent language: SELECT * FROM users WHERE name = 'a' OR 't'='t'; If this code were to be used in an authentication procedure then this example could be used to force the selection of a valid username because the evaluation of 't'='t' is always true. On some SQL servers such as MS SQL Server any valid SQL command may be injected via this method, including the execution of multiple statements. The following value of "userName" in the statement below would cause the deletion of the "users" table as well as the selection of all data from the "data" table (in essence revealing the information of every user): a';DROP TABLE users; SELECT * FROM data WHERE name LIKE '% This input renders the final SQL statement as follows: 31

SELECT * FROM users WHERE name = 'a';DROP TABLE users; SELECT * FROM DATA WHERE name LIKE '%'; Other SQL implementations won't execute multiple commands in the same SQL query as a security measure. This prevents crackers from injecting entirely separate queries, but doesn't stop them from modifying queries.

32

Vous aimerez peut-être aussi