Vous êtes sur la page 1sur 95

Kanpur Institute of Technology Notes MCA-III Semester DBMS (MCA 313) Unit 1

DBMS-A Database Management System (DBMS) is a set of computer programs that controls the creation, maintenance, and the use of a database. It allows organizations to place control of database development in the hands of database administrators (DBAs) and other specialists. A DBMS is a system software package that helps the use of integrated collection of data records and files known as databases. It allows different user application programs to easily access the same database. DBMSs may use any of a variety of database models, such as the network model or relational model. In large systems, a DBMS allows users and other software to store and retrieve data in a structured way. Instead of having to write computer programs to extract information, user can ask simple questions in a query language. Thus, many DBMS packages provide Fourth-generation programming language (4GLs) and other application development features. It helps to specify the logical organization for a database and access and use the information within a database. It provides facilities for controlling data access, enforcing data integrity, managing concurrency, and restoring the database from backups. A DBMS also provides the ability to logically present database information to users. A DBMS is a set of Windows has become an increasingly popular platform for the deployment of database applications. The challenges to doing this successfully can be considerable. Database applications must simultaneously provide high performance, reliability, and scalability, all at a low total cost of ownership. Oracle Database 10g on Windows provides these capabilities. The grid is revolutionizing the way companies conduct business with customers, partners, and employees. Oracle10g for Windows enables customers to thrive in these new business environments. With solutions based on Oracle10g technology, any organization, large or small, can seize new opportunities while simultaneously reducing technology costs. The Oracle Database supports Windows operating systems on both Itanium and AMD64/EM64T hardware. You can download Oracle Database 10g from the software download page. To learn more about the Windows operating systems Oracle is certified on, log into Oracle Metalink and click on the Certify tab. software programs that controls the organization, storage, management, and retrieval of data in a database. DBMSs are categorized according to their data structures or types. The DBMS accepts requests for data from an application program and instructs the operating system to transfer the appropriate data. The queries and responses must be submitted and received according to a format that conforms to one or more applicable protocols. When a DBMS is used, information systems can be changed much more easily as the organization's information requirements change. New categories of data can be added to the database without disruption to the existing system. Database servers are computers that hold the actual databases and run only the DBMS and related software. Database servers are usually multiprocessor computers, with generous memory

and RAID disk arrays used for stable storage. Hardware database accelerators, connected to one or more servers via a high-speed channel, are also used in large volume transaction processing environments. DBMSs are found at the heart of most database applications. DBMSs may be built around a custom multitasking kernel with built-in networking support, but modern DBMSs typically rely on a standard operating system to provide these functions.

Components of DBMS

DBMS Engine accepts logical request from the various other DBMS subsystems, converts them into physical equivalents, and actually accesses the database and data dictionary as they exist on a storage device. Data Definition Subsystem helps user to create and maintain the data dictionary and define the structure of the files in a database. Data Manipulation Subsystem helps user to add, change, and delete information in a database and query it for valuable information. Software tools within the data manipulation subsystem are most often the primary interface between user and the information contained in a database. It allows user to specify its logical information requirements. Application Generation Subsystem contains facilities to help users to develop transaction-intensive applications. It usually requires that user perform a detailed series of tasks to process a transaction. It facilitates easy-to-use data entry screens, programming languages, and interfaces. Data Administration Subsystem helps users to manage the overall database environment by providing facilities for backup and recovery, security management, query optimization, concurrency control, and change management.

External, Logical and Internal view

Traditional View of Data A database management system provides the ability for many different users to share data and process resources. But as there can be many different users, there are many different database

needs. The question now is: How can a single, unified database meet the differing requirement of so many users? A DBMS minimizes these problems by providing two views of the database data: an external view(or User view), logical view(or conceptual view)and physical(or internal) view. The users view, of a database program represents data in a format that is meaningful to a user and to the software programs that process those data. One strength of a DBMS is that while there is typically only one conceptual (or logical) and physical (or Internal) view of the data, there can be an endless number of different External views. This feature allows users to see database information in a more business-related way rather than from a technical, processing viewpoint. Thus the logical view refers to the way user views data, and the physical view to the way the data are physically stored and processed.

DBMS features and capabilities

Alternatively, and especially in connection with the relational model of database management, the relation between attributes drawn from a specified set of domains can be seen as being primary. For instance, the database might indicate that a car that was originally "red" might fade to "pink" in time, provided it was of some particular "make" with an inferior paint job. Such higher arity relationships provide information on all of the underlying domains at the same time, with none of them being privileged above the others.harsha lodhi

DBMS simple definition

Data base management system is the system in which related data is stored in an "efficient" and "compact" manner. Efficient means that the data which is stored in the DBMS is accessed in very quick time and compact means that the data which is stored in DBMS covers very less space in computer's memory. In above definition the phrase "related data" is used which means that the data which is stored in DBMS is about some particular topic. Throughout recent history specialized databases have existed for scientific, geospatial, imaging, document storage and like uses. Functionality drawn from such applications has lately begun appearing in mainstream DBMSs as well. However, the main focus there, at least when aimed at the commercial data processing market, is still on descriptive attributes on repetitive record structures. Thus, the DBMSs of today roll together frequently needed services or features of attribute management. By externalizing such functionality to the DBMS, applications effectively share code with each other and are relieved of much internal complexity. Features commonly offered by database management systems include: Query ability Querying is the process of requesting attribute information from various perspectives and combinations of factors. Example: "How many 2-door cars in Texas are green?" A

database query language and report writer allow users to interactively interrogate the database, analyze its data and update it according to the users privileges on data. Backup and replication Copies of attributes need to be made regularly in case primary disks or other equipment fails. A periodic copy of attributes may also be created for a distant organization that cannot readily access the original. DBMS usually provide utilities to facilitate the process of extracting and disseminating attribute sets. When data is replicated between database servers, so that the information remains consistent throughout the database system and users cannot tell or even know which server in the DBMS they are using, the system is said to exhibit replication transparency. Rule enforcement Often one wants to apply rules to attributes so that the attributes are clean and reliable. For example, we may have a rule that says each car can have only one engine associated with it (identified by Engine Number). If somebody tries to associate a second engine with a given car, we want the DBMS to deny such a request and display an error message. However, with changes in the model specification such as, in this example, hybrid gas-electric cars, rules may need to change. Ideally such rules should be able to be added and removed as needed without significant data layout redesign. Security Often it is desirable to limit who can see or change which attributes or groups of attributes. This may be managed directly by individual, or by the assignment of individuals and privileges to groups, or (in the most elaborate models) through the assignment of individuals and groups to roles which are then granted entitlements. Computation There are common computations requested on attributes such as counting, summing, averaging, sorting, grouping, cross-referencing, etc. Rather than have each computer application implement these from scratch, they can rely on the DBMS to supply such calculations. Change and access logging Often one wants to know who accessed what attributes, what was changed, and when it was changed. Logging services allow this by keeping a record of access occurrences and changes. Automated optimization If there are frequently occurring usage patterns or requests, some DBMS can adjust themselves to improve the speed of those interactions. In some cases the DBMS will merely provide tools to monitor performance, allowing a human expert to make the necessary adjustments after reviewing the statistics collected.

Meta-data repository
Metadata is data describing data. For example, a listing that describes what attributes are allowed to be in data sets is called "meta-information". The meta-data is also known as data about data.

Current trends

In 1998, database management was in need of new style databases to solve current database management problems. Researchers realized that the old trends of database management were becoming too complex and there was a need for automated configuration and management. Surajit Chaudhuri, Gerhard Weikum and Michael Stonebraker, were the pioneers that dramatically affected the thought of database management systems. They believed that database management needed a more modular approach and that there are so many specifications needs for various users. Since this new development process of database management we currently have endless possibilities. Database management is no longer limited to monolithic entities. Many solutions have developed to satisfy individual needs of users. Development of numerous database options has created flexible solutions in database management. Today there are several ways database management has affected the technology world as we know it. Organizations demand for directory services has become an extreme necessity as organizations grow. Businesses are now able to use directory services that provided prompt searches for their company information. Mobile devices are not only able to store contact information of users but have grown to bigger capabilities. Mobile technology is able to cache large information that is used for computers and is able to display it on smaller devices. Web searches have even been affected with database management. Search engine queries are able to locate data within the World Wide Web. Retailers have also benefited from the developments with data warehousing. These companies are able to record customer transactions made within their business. Online transactions have become tremendously popular with the e-business world. Consumers and businesses are able to make payments securely on company websites. None of these current developments would have been possible without the evolution of database management. Even with all the progress and current trends of database management, there will always be a need for new development as specifications and needs grow. As the speeds of consumer internet connectivity increase, and as data availability and computing become more ubiquitous, database are seeing migration to web services. Web based languages such as XML and PHP are being used to process databases over web based services. These languages allow databases to live in "the cloud." As with many other products, such as Google's GMail, Microsoft's Office 2010, and Carbonite's online backup services, many services are beginning to move to web based services due to increasing internet reliability, data storage efficiency, and the lack of a need for dedicated IT staff to manage the hardware. Faculty at Rochester Institute of Technology published a paper regarding the use of databases in the cloud and state that their school plans to add cloud based database computing to their curriculum to "keep [their] information technology (IT) curriculum at the forefront of technology The advantages of DBMS are as follows: -Controlling redundancy -Providing storage structure for efficient query processing. -Restricting unauthorized users. -Providing concurrency. -Providing backup and recovery. -Enforcing integrity constraints. The disadvantages are as follows: -Centralization: That is use of the same program at a time by many user sometimes lead to loss

of some data. -High cost of software. Disadvantages of DBMS: Problem Associate with centralized Cost of software, hardware and migration Complexity of backup and recover

Database Administrator- A database administrator (DBA) is a person responsible for the

design, implementation, maintenance and repair of an organization's database. They are also known by the titles Database Coordinator or Database Programmer, and is closely related to the Database Analyst, Database Modeler, Programmer Analyst, and Systems Manager. The role includes the development and design of database strategies, monitoring and improving database performance and capacity, and planning for future expansion requirements. They may also plan, co-ordinate and implement security measures to safeguard the database. Employing organizations may require that a database administrator have a certification or degree for database systems (for example, the Microsoft Certified Database Administrator). A database administrator may administer databases remotely, using a remote database administration client program, such as Microsoft SQL Management Console, that enables them to connect to the server system to both monitor and manage the database server software. DBA Responsibilities

Installation, configuration and upgrading of Microsoft SQL Server/MySQL/Oracle server software and related products. Evaluate MSSQL/MySQL/Oracle features and MSSQL/MySQL/Oracle related products. Establish and maintain sound backup and recovery policies and procedures. Take care of the Database design and implementation. Implement and maintain database security (create and maintain users and roles, assign privileges). Database tuning and performance monitoring. Application tuning and performance monitoring. Setup and maintain documentation and standards. Plan growth and changes (capacity planning). Work as part of a team and provide 724 supports when required. Do general technical trouble shooting and give consultation to development teams. Interface with MSSQL/MySQL/Oracle for technical support. ITIL Skill set requirement (Problem Management/Incident Management/Chain Management etc)

Data Models- A data model in software engineering is an abstract model that describes how
data are represented and accessed. Data models formally define data elements and relationships among data elements for a domain of interest. According to Hoberman (2009), "A data model is a wayfinding tool for both business and IT professionals, which uses a set of symbols and text to

precisely explain a subset of real information to improve communication within the organization and thereby lead to a more flexible and stable application environment." A data model explicitly determines the structure of data or structured data. Typical applications of data models include database models, design of information systems, and enabling exchange of data. Usually data models are specified in a data modeling language. Communication and precision are the two key benefits that make a data model important to applications that use and exchange data. A data model is the medium which project team members from different backgrounds and with different levels of experience can communicate with one another. Precision means that the terms and rules on a data model can be interpreted only one way and are not ambiguous. A data model can be sometimes referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.

The role of data models

How data models deliver benefit. The main aim of data models is to support the development of information systems by providing the definition and format of data. According to West and Fowler (1999) "if this is done consistently across systems then compatibility of data can be achieved. If the same data structures are used to store and access data then different applications can share data. The results of this are indicated above. However, systems and interfaces often cost more than they should, to build, operate, and maintain. They may also constrain the business rather than support it. A major cause is that the quality of the data models implemented in systems and interfaces is poor".

"Business rules, specific to how things are done in a particular place, are often fixed in the structure of a data model. This means that small changes in the way business is conducted lead to large changes in computer systems and interfaces".

"Entity types are often not identified, or incorrectly identified. This can lead to replication of data, data structure, and functionality, together with the attendant costs of that duplication in development and maintenance". "Data models for different systems are arbitrarily different. The result of this is that complex interfaces are required between systems that share data. These interfaces can account for between 25-70% of the cost of current systems". "Data cannot be shared electronically with customers and suppliers, because the structure and meaning of data has not been standardised. For example, engineering design data and drawings for process plant are still sometimes exchanged on paper".

The reason for these problems is a lack of standards that will ensure that data models will both meet business needs and be consistent.

Three perspectives

The ANSI/SPARC three level architecture. This shows that a data model can be an external model (or view), a conceptual model, or a physical model. This is not the only way to look at data models, but it is a useful way, particularly when comparing models.[ A data model instance may be one of three kinds according to ANSI in 1975:

Conceptual schema : describes the semantics of a domain, being the scope of the model. For example, it may be a model of the interest area of an organization or industry. This consists of entity classes, representing kinds of things of significance in the domain, and relationships assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model. In that sense, it defines the allowed expressions in an artificial 'language' with a scope that is limited by the scope of the model. The use of conceptual schema has evolved to become a powerful communication tool with business users. Often called a subject area model (SAM) or high-level data model (HDM), this model is used to communicate core data concepts, rules, and definitions to a business user as part of an overall application development or enterprise initiative. The number of objects should be very small and focused on key concepts. Try to limit this model to one page, although for extremely large organizations or complex projects, the model might span two or more pages.

Logical schema : describes the semantics, as represented by a particular data manipulation technology. This consists of descriptions of tables and columns, object oriented classes, and XML tags, among other things. Physical schema : describes the physical means by which data are stored. This is concerned with partitions, CPUs, tablespaces, and the like.

The significance of this approach, according to ANSI, is that it allows the three perspectives to be relatively independent of each other. Storage technology can change without affecting either the logical or the conceptual model. The table/column structure can change without (necessarily) affecting the conceptual model. In each case, of course, the structures must remain consistent with the other model. The table/column structure may be different from a direct translation of the entity classes and attributes, but it must ultimately carry out the objectives of the conceptual entity class structure. Early phases of many software development projects emphasize the design of a conceptual data model. Such a design can be detailed into a logical data model. In later stages, this model may be translated into physical data model. However, it is also possible to implement a conceptual model directly.

Types of data models

Database model
A database model is a theory or specification describing how a database is structured and used. Several such models have been suggested. Common models include:

Flat model

Hierarchical modelNetwork model

Relational model

Flat model: This may not strictly qualify as a data model. The flat (or table) model consists of a single, two-dimensional array of data elements, where all members of a given column are assumed to be similar values, and all members of a row are assumed to be related to one another. Hierarchical model: In this model data is organized into a tree-like structure, implying a single upward link in each record to describe the nesting, and a sort field to keep the records in a particular order in each same-level list. Network model: This model organizes data using two fundamental constructs, called records and sets. Records contain fields, and sets define one-to-many relationships between records: one owner, many members. Relational model: is a database model based on first-order predicate logic. Its core idea is to describe a database as a collection of predicates over a finite set of predicate variables, describing constraints on the possible values and combinations of values.

Concept-oriented modelStar schema

Object-relational model: Similar to a relational database model, but objects, classes and inheritance are directly supported in database schemas and in the query language. Star schema is the simplest style of data warehouse schema. The star schema consists of a few "fact tables" (possibly only one, justifying the name) referencing any number of "dimension tables". The star schema is considered an important special case of the snowflake schema.

Data Structure Diagram

Example of a Data Structure Diagram. A data structure diagram (DSD) is a diagram and data model used to describe conceptual data models by providing graphical notations which document entities and their relationships, and the constraints that binds them. The basic graphic elements of DSDs are boxes, representing entities, and arrows, representing relationships. Data structure diagrams are most useful for documenting complex data entities.

Data structure diagrams are an extension of the entity-relationship model (ER model). In DSDs, attributes are specified inside the entity boxes rather than outside of them, while relationships are drawn as boxes composed of attributes which specify the constraints that bind entities together. The E-R model, while robust, doesn't provide a way to specify the constraints between relationships, and becomes visually cumbersome when representing entities with several attributes. DSDs differ from the ER model in that the ER model focuses on the relationships between different entities, whereas DSDs focus on the relationships of the elements within an entity and enable users to fully see the links and relationships between each entity. There are several styles for representing data structure diagrams, with the notable difference in the manner of defining cardinality. The choices are between arrow heads, inverted arrow heads (crow's feet), or numerical representation of the cardinality.

Entity-relationship model
An entity-relationship model is an abstract conceptual data model (or semantic data model) used in software engineering to represent structured data. Entity relationship models (ERMs) produce a conceptual data model of a system, and its requirements in a top-down fashion. There are several notations for data modeling. The actual model is frequently called "Entity relationship model", because it depicts data in terms of the entities and relationships described in the data.

Geographic data model

A data model in Geographic information systems is a mathematical construct for representing geographic objects or surfaces as data. For example,

the vector data model represents geography as collections of points, lines, and polygons; the raster data model represent geography as cell matrixes that store numeric values; and the Triangulated irregular network (TIN) data model represents geography as sets of contiguous, nonoverlapping triangles.

Groups relate to process of NGMDB data model making a map applications

NGMDB databases linked together

Representing 3D map information

Generic data model

Generic data models are generalizations of conventional data models. They define standardised general relation types, together with the kinds of things that may be related by such a relation

type. Generic data models are developed as an approach to solve some shortcomings of conventional data models. For example, different modelers usually produce different conventional data models of the same domain. This can lead to difficulty in bringing the models of different people together and is an obstacle for data exchange and data integration. Invariably, however, this difference is attributable to different levels of abstraction in the models and differences in the kinds of facts that can be instantiated (the semantic expression capabilities of the models). The modelers need to communicate and agree on certain elements which are to be rendered more concretely, in order to make the differences less significant.

emantic data model

Semantic data models A semantic data model in software engineering is a technique to define the meaning of data within the context of its interrelationships with other data. A semantic data model is an abstraction which defines how the stored symbols relate to the real world. A semantic data model is sometimes called a conceptual data model. The logical data structure of a database management system (DBMS), whether hierarchical, network, or relational, cannot totally satisfy the requirements for a conceptual definition of data because it is limited in scope and biased toward the implementation strategy employed by the DBMS. Therefore, the need to define data from a conceptual view has led to the development of semantic data modeling techniques. That is, techniques to define the meaning of data within the context of its interrelationships with other data. As illustrated in the figure. The real world, in terms of resources, ideas, events, etc., are symbolically defined within physical data stores. A semantic data model is an abstraction which defines how the stored symbols relate to the real world. Thus, the model must be a true representation of the real world.

Data model topics

Data architecture
Data architecture is the design of data for use in defining the target state and the subsequent planning needed to hit the target state. It is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture.

A data architecture describes the data structures used by a business and/or its applications. There are descriptions of data in storage and data in motion; descriptions of data stores, data groups and data items; and mappings of those data artifacts to data qualities, applications, locations etc. Essential to realizing the target state, Data architecture describes how data is processed, stored, and utilized in a given system. It provides criteria for data processing operations that make it possible to design data flows and also control the flow of data in the system.

Data modeling

The data modeling process. Data modeling in software engineering is the process of creating a data model by applying formal data model descriptions using data modeling techniques. Data modeling is a technique for defining business requirements for a database. It is sometimes called database modeling because a data model is eventually implemented in a database. The figure illustrates the way data models are developed and used today. A conceptual data model is developed based on the data requirements for the application that is being developed, perhaps in the context of an activity model. The data model will normally consist of entity types, attributes, relationships, integrity rules, and the definitions of those objects. This is then used as the start point for interface or database design.

Data properties
Some important properties of data for which requirements need to be met are:

definition-related properties o relevance: the usefulness of the data in the context of your business. o clarity: the availability of a clear and shared definition for the data. o consistency: the compatibility of the same type of data from different sources.

Some important properties of data.

content-related properties o timeliness: the availability of data at the time required and how up to date that data is. o accuracy: how close to the truth the data is. properties related to both definition and content o completeness: how much of the required data is available. o accessibility: where, how, and to whom the data is available or not available (e.g. security). o cost: the cost incurred in obtaining the data, and making it available for use.

Data organization
Another kind of data model describes how to organize data using a database management system or other data management technology. It describes, for example, relational tables and columns or object-oriented classes and attributes. Such a data model is sometimes referred to as the physical data model, but in the original ANSI three schema architecture, it is called "logical". In that architecture, the physical model describes the storage media (cylinders, tracks, and tablespaces). Ideally, this model is derived from the more conceptual data model described above. It may differ, however, to account for constraints like processing capacity and usage patterns. While data analysis is a common term for data modeling, the activity actually has more in common with the ideas and methods of synthesis (inferring general concepts from particular instances) than it does with analysis (identifying component concepts from more general ones). {Presumably we call ourselves systems analysts because no one can say systems synthesists.} Data modeling strives to bring the data structures of interest together into a cohesive, inseparable, whole by eliminating unnecessary data redundancies and by relating data structures with relationships. A different approach is through the use of adaptive systems such as artificial neural networks that can autonomously create implicit models of data.

Data structure

A binary tree, a simple type of branching linked data structure. A data structure is a way of storing data in a computer so that it can be used efficiently. It is an organization of mathematical and logical concepts of data. Often a carefully chosen data structure will allow the most efficient algorithm to be used. The choice of the data structure often begins from the choice of an abstract data type. A data model describes the structure of the data within a given domain and, by implication, the underlying structure of that domain itself. This means that a data model in fact specifies a dedicated grammar for a dedicated artificial language for that domain. A data model represents classes of entities (kinds of things) about which a company wishes to hold information, the attributes of that information, and relationships among those entities and (often implicit) relationships among those attributes. The model describes the organization of the data to some extent irrespective of how data might be represented in a computer system. The entities represented by a data model can be the tangible entities, but models that include such concrete entity classes tend to change over time. Robust data models often identify abstractions of such entities. For example, a data model might include an entity class called "Person", representing all the people who interact with an organization. Such an abstract entity class is typically more appropriate than ones called "Vendor" or "Employee", which identify specific roles played by those people.


Hash table

Linked list Stack (data structure)

Data model theory

The term data model can have two meanings:

1. A data model theory, i.e. a formal description of how data may be structured and accessed. 2. A data model instance, i.e. applying a data model theory to create a practical data model instance for some particular application. A data model theory has three main components:

The structural part: a collection of data structures which are used to create databases representing the entities or objects modeled by the database. The integrity part: a collection of rules governing the constraints placed on these data structures to ensure structural integrity. The manipulation part: a collection of operators which can be applied to the data structures, to update and query the data contained in the database.

For example, in the relational model, the structural part is based on a modified concept of the mathematical relation; the integrity part is expressed in first-order logic and the manipulation part is expressed using the relational algebra, tuple calculus and domain calculus. A data model instance is created by applying a data model theory. This is typically done to solve some business enterprise requirement. Business requirements are normally captured by a semantic logical data model. This is transformed into a physical data model instance from which is generated a physical database. For example, a data modeler may use a data modeling tool to create an entity-relationship model of the corporate data repository of some business enterprise. This model is transformed into a relational model, which in turn generates a relational database.

Patterns are common data modeling structures that occur in many data models.

Related models

Data flow diagram

Data Flow Diagram example. A data flow diagram (DFD) is a graphical representation of the "flow" of data through an information system. It differs from the flowchart as it shows the data flow instead of the control flow of the program. A data flow diagram can also be used for the visualization of data

processing (structured design). Data flow diagrams were invented by Larry Constantine, the original developer of structured design, based on Martin and Estrin's "data flow graph" model of computation. It is common practice to draw a context-level Data flow diagram first which shows the interaction between the system and outside entities. The DFD is designed to show how a system is divided into smaller portions and to highlight the flow of data between those parts. This context-level Data flow diagram is then "exploded" to show more detail of the system being modeled.

Information model

Example of an EXPRESS G Information model. An Information model is not a type of data model, but more or less an alternative model. Within the field of software engineering both a data model and an information model can be abstract, formal representations of entity types that includes their properties, relationships and the operations that can be performed on them. The entity types in the model may be kinds of realworld objects, such as devices in a network, or they may themselves be abstract, such as for the entities used in a billing system. Typically, they are used to model a constrained domain that can be described by a closed set of entity types, properties, relationships and operations. According to Lee (1999) an information model is a representation of concepts, relationships, constraints, rules, and operations to specify data semantics for a chosen domain of discourse. It can provide sharable, stable, and organized structure of information requirements for the domain context.[20] More in general the term information model is used for models of individual things, such as facilities, buildings, process plants, etc. In those cases the concept is specialised to Facility Information Model, Building Information Model, Plant Information Model, etc. Such an information model is an integration of a model of the facility with the data and documents about the facility.

An information model provides formalism to the description of a problem domain without constraining how that description is mapped to an actual implementation in software. There may be many mappings of the information model. Such mappings are called data models, irrespective of whether they are object models (e.g. using UML), entity relationship models or XML schemas.

Dcument Object Model, a standard object model for representing HTML or XML.

Object model
An object model in computer science is a collection of objects or classes through which a program can examine and manipulate some specific parts of its world. In other words, the objectoriented interface to some service or system. Such an interface is said to be the object model of the represented service or system. For example, the Document Object Model (DOM) is a collection of objects that represent a page in a web browser, used by script programs to examine and dynamically change the page. There is a Microsoft Excel object model for controlling Microsoft Excel from another program, and the ASCOM Telescope Driver is an object model for controlling an astronomical telescope. In computing the term object model has a distinct second meaning of the general properties of objects in a specific computer programming language, technology, notation or methodology that uses them. For example, the Java object model, the COM object model, or the object model of OMT. Such object models are usually defined using concepts such as class, message, inheritance, polymorphism, and encapsulation. There is an extensive literature on formalized object models as a subset of the formal semantics of programming languages.

Object Role Model

Object Role Modeling (ORM) is a method for conceptual modeling, and can be used as a tool for information and rules analysis.

Object Role Modeling is a fact-oriented method for performing systems analysis at the conceptual level. The quality of a database application depends critically on its design. To help ensure correctness, clarity, adaptability and productivity, information systems are best specified first at the conceptual level, using concepts and language that people can readily understand. The conceptual design may include data, process and behavioral perspectives, and the actual DBMS used to implement the design might be based on one of many logical data models (relational, hierarchic, network, object-oriented etc.).

Unified Modeling Language models

The Unified Modeling Language (UML) is a standardized general-purpose modeling language in the field of software engineering. It is a graphical language for visualizing, specifying, constructing, and documenting the artifacts of a software-intensive system. The Unified Modeling Language offers a standard way to write a system's blueprints, including:[26]

Conceptual things such as business processes and system functions Concrete things such as programming language statements, database schemas, and Reusable software components.

UML offers a mix of functional models, data models, and database models. Primary Key- A table can have at most one primary key, but more than one unique key. A primary key is a combination of columns which uniquely specify a row. It is a special case of unique keys. One difference is that for unique keys the implicit NOT NULL constraint is not automatically enforced, while for primary keys it is enforced. Thus, the values in unique key columns may or may not be NULL. Another difference is that primary keys must be defined using another syntax. Thus Primary Key column allows no row having NULL while Unique Key column allows only one row having null value. The relational model, as expressed through relational calculus and relational algebra, does not distinguish between primary keys and other kinds of keys. Primary keys were added to the SQL standard mainly as a convenience to the application programmer. Super & Candidate Key- A superkey is defined in the relational model of database organization as a set of attributes of a relation variable (relvar) for which it holds that in all relations assigned to that variable, there are no two distinct tuples (rows) that have the same values for the attributes in this set. Equivalently a superkey can also be defined as a set of attributes of a relvar upon which all attributes of the relvar are functionally dependent. Note that if attribute set K is a superkey of relvar R, then at all times it is the case that the projection of R over K has the same cardinality asR itself.

Informally, a superkey is a set of columns within a table whose values can be used to uniquely identify a row. A candidate key is a minimal set of columns necessary to identify a row, this is also called a minimal superkey. For example, given an employee table, consisting of the columns employeeID, name, job, and departmentID, we could use the employeeID in combination with any or all other columns of this table to uniquely identify a row in the table. Examples of superkeys in this table would be {employeeID, Name}, {employeeID, Name, job}, and {employeeID, Name, job, departmentID}. In a real database we do not need values for all of those columns to identify a row. We only need, per our example, the set {employeeID}. This is a minimal superkey that is, a minimal set of columns that can be used to identify a single row. So, employeeID is a candidate key. Genralization- A generalization of a concept is an extension of the concept to less-specific criteria. It is a foundational element of logic and human reasoning.[citation needed] Generalizations posit the existence of a domain or set of elements, as well as one or more common characteristics shared by those elements. As such, it is the essential basis of all valid deductive inferences. The process of verification is necessary to determine whether a generalization holds true for any given situation. The concept of generalization has broad application in many related disciplines, sometimes having a specialized context-specific meaning. Of any two related concepts, such as A and B, A is considered a "generalization" of concept B if and only if:

every instance of concept B is also an instance of concept A; and there are instances of concept A which are not instances of concept B.

Aggregation- Aggregation may refer to uses in:

Business and economics:

Aggregation problem (economics)

Purchasing aggregation, the joining of multiple purchasers in a group purchasing organization to increase their buying power

Computer science and telecommunication:

Aggregate function, a function in data processing Aggregation, a form of object composition in object-oriented programming

Aggregation (object-oriented programming), a form of object composition in object-oriented programming

Link aggregation, using multiple Ethernet network cables/ports in parallel to increase link speed

Packet aggregation, joining multiple data packets for transmission as a single unit to increase network efficiency

Route aggregation, the process of forming a supernet in computer networking

Natural sciences and statistics:

Aggregation of soil granules to form soil structure

Particle aggregation, direct mutual attraction between particles (atoms or molecules) via van der Waals forces or chemical bonding* The accumulation of platelets to the site of a wound to form a platelet plug or a thrombus

Flocculation, a process where a solute comes out of solution in the form of floc or

flakes Statistical aggregation, where the variance of a distribution is higher than expected.

UNIT-2 Relational Database Management System- A relational database matches data by using common characteristics found within the data set. The resulting groups of data are organized and are much easier for many people to understand. For example, a data set containing all the real-estate transactions in a town can be grouped by the year the transaction occurred; or it can be grouped by the sale price of the transaction; or it can be grouped by the buyer's last name; and so on. Such a grouping uses the relational model (a technical term for this is schema). Hence, such a database is called a "relational database." The software used to do this grouping is called a relational database management system (RDBMS). The term "relational database" often refers to this type of software. Relational databases are currently the predominant choice in storing financial records, manufacturing and logistical information, personnel data and much more. Terminology The term relational database was originally defined and coined by Edgar Codd at IBM Almaden Research Center in 1970.

Relational database terminology. Relational database theory uses a set of mathematical terms, which are roughly equivalent to SQL database terminology. The table below summarizes some of the most important relational database terms and their SQL database equivalents. Relational term relation, base relvar derived relvar tuple attribute SQL equivalent table view, query result, result set row column

Relations or Tables A relation is defined as a set of tuples that have the same attributes. A tuple usually represents an object and information about that object. Objects are typically physical objects or concepts. A relation is usually described as a table, which is organized into rows and columns. All the data referenced by an attribute are in the same domain and conform to the same constraints. The relational model specifies that the tuples of a relation have no specific order and that the tuples, in turn, impose no order on the attributes. Applications access data by specifying queries, which use operations such as select to identify tuples, project to identify attributes, and join to combine relations. Relations can be modified using the insert, delete, and update operators. New tuples can supply explicit values or be derived from a query. Similarly, queries identify tuples for updating or deleting. It is necessary for each tuple of a relation to be uniquely identifiable by some combination (one or more) of its attribute values. This combination is referred to as the primary key. Base and derived relations In a relational database, all data are stored and accessed via relations. Relations that store data are called "base relations", and in implementations are called "tables". Other relations do not store data, but are computed by applying relational operations to other relations. These relations are sometimes called "derived relations". In implementations these are called "views" or "queries". Derived relations are convenient in that though they may grab information from several relations, they act as a single relation. Also, derived relations can be used as an abstraction layer. Domain A domain describes the set of possible values for a given attribute, and can be considered a constraint on the value of the attribute. Mathematically, attaching a domain to an attribute means that any value for the attribute must be an element of the specified set. The character data value 'ABC', for instance, is not in the integer domain. The integer value 123, satisfies the domain constraint. Constraints Constraints allow you to further restrict the domain of an attribute. For instance, a constraint can restrict a given integer attribute to values between 1 and 10. Constraints provide one method of implementing business rules in the database. SQL implements constraint functionality in the form of check constraints. Constraints restrict the data that can be stored in relations. These are usually defined using expressions that result in a boolean value, indicating whether or not the data satisfies the constraint. Constraints can apply to single attributes, to a tuple (restricting combinations of attributes) or to an entire relation.

Since every attribute has an associated domain, there are constraints (domain constraints). The two principal rules for the relational model are known as entity integrity and referential integrity. Primary keys A primary key uniquely defines a relationship within a database. In order for an attribute to be a good primary key it must not repeat. While natural attributes are sometimes good primary keys, Surrogate keys are often used instead. A surrogate key is an artificial attribute assigned to an object which uniquely identifies it (for instance, in a table of information about students at a school they might all be assigned a Student ID in order to differentiate them). The surrogate key has no intrinsic meaning, but rather is useful through its ability to uniquely identify a tuple. Another common occurrence, especially in regards to N:M cardinality is the composite key. A composite key is a key made up of two or more attributes within a table that (together) uniquely identify a record. (For example, in a database relating students, teachers, and classes. Classes could be uniquely identified by a composite key of their room number and time slot, since no other class could have that exact same combination of attributes. In fact, use of a composite key such as this can be a form of data verification, albeit a weak one.) Foreign keys A foreign key is a reference to a key in another relation, meaning that the referencing table has, as one of its attributes, the values of a key in the referenced table. Foreign keys need not have unique values in the referencing relation. Foreign keys effectively use the values of attributes in the referenced relation to restrict the domain of one or more attributes in the referencing relation. A foreign key could be described formally as: "For all tables in the referencing relation projected over the referencing attributes, there must exist a table in the referenced relation projected over those same attributes such that the values in each of the referencing attributes match the corresponding values in the referenced attributes." Stored procedures A stored procedure is executable code that is associated with, and generally stored in, the database. Stored procedures usually collect and customize common operations, like inserting a tuple into a relation, gathering statistical information about usage patterns, or encapsulating complex business logic and calculations. Frequently they are used as an application programming interface (API) for security or simplicity. Implementations of stored procedures on SQL DBMSs often allow developers to take advantage of procedural extensions (often vendorspecific) to the standard declarative SQL syntax. Stored procedures are not part of the relational database model, but all commercial implementations include them. Indices

An index is one way of providing quicker access to data. Indices can be created on any combination of attributes on a relation. Queries that filter using those attributes can find matching tuples randomly using the index, without having to check each tuple in turn. This is analogous to using the index of a book to go directly to the page on which the information you are looking for is found i.e. you do not have to read the entire book to find what you are looking for. Relational databases typically supply multiple indexing techniques, each of which is optimal for some combination of data distribution, relation size, and typical access pattern. B+ trees, Rtrees, and bitmaps. Indices are usually not considered part of the database, as they are considered an implementation detail, though indices are usually maintained by the same group that maintains the other parts of the database. Relational operations Queries made against the relational database, and the derived relvars in the database are expressed in a relational calculus or a relational algebra. In his original relational algebra, Codd introduced eight relational operators in two groups of four operators each. The first four operators were based on the traditional mathematical set operations:

The union operator combines the tuples of two relations and removes all duplicate tuples from the result. The relational union operator is equivalent to the SQL UNION operator. The intersection operator produces the set of tuples that two relations share in common. Intersection is implemented in SQL in the form of the INTERSECT operator. The difference operator acts on two relations and produces the set of tuples from the first relation that do not exist in the second relation. Difference is implemented in SQL in the form of the EXCEPT or MINUS operator. The cartesian product of two relations is a join that is not restricted by any criteria, resulting in every tuple of the first relation being matched with every tuple of the second relation. The cartesian product is implemented in SQL as the CROSS JOIN join operator.

The remaining operators proposed by Codd involve special operations specific to relational databases:

The selection, or restriction, operation retrieves tuples from a relation, limiting the results to only those that meet a specific criteria, i.e. a subset in terms of set theory. The SQL equivalent of selection is the SELECT query statement with a WHERE clause. The projection operation retrieves tuples containing only the specified attributes. The join operation defined for relational databases is often referred to as a natural join. In this type of join, two relations are connected by their common attributes. SQL's approximation of a natural join is the INNER JOIN join operator. The relational division operation is a slightly more complex operation, which involves essentially using the tuples of one relation (the dividend) to partition a second relation (the divisor). The relational division operator is effectively the opposite of the cartesian product operator (hence the name).

Other operators have been introduced or proposed since Codd's introduction of the original eight including relational comparison operators and extensions that offer support for nesting and hierarchical data, among others. Relational database management systems Relational databases, as implemented in relational database management systems, have become a predominant choice for the storage of information in new databases used for financial records, manufacturing and logistical information, personnel data and much more. Relational databases have often replaced legacy hierarchical databases and network databases because they are easier to understand and use, even though they are much less efficient. As computer power has increased, the inefficiencies of relational databases, which made them impractical in earlier times, have been outweighed by their ease of use. However, relational databases have been challenged by Object Databases, which were introduced in an attempt to address the objectrelational impedance mismatch in relational database, and XML databases. The three leading commercial relational database vendors are Oracle, Microsoft, and IBM.[2]. The three leading open source implementations are MySQL, PostgreSQL, and SQLite. Integrity Constraints-Integrity constraints are used to ensure accuracy and consistency of data in a relational database. Data integrity is handled in a relational database through the concept of referential integrity. There are many types of integrity constraints that play a role in referential integrity. Codd, initially defined two sets of constraints, but in his second version of the relational model, he came up with five integrity constraints:

Entity integrity

In the relational data model, entity integrity is one of the three inherent integrity rules. Entity integrity is an integrity rule which states that every table must have a primary key and that the column or columns chosen to be the primary key should be unique and not null [1]. A direct consequence of this integrity rule is that duplicate rows are forbidden in a table. If each value of a primary key must be unique no duplicate rows can logically appear in a table. The NOT NULL characteristic of a primary key ensures that a value can be used to identify all rows in a table. Within relational databases using SQL, entity integrity is enforced by adding a primary key clause to a schema definition. The system enforces Entity Integrity by not allowing operations (INSERT, UPDATE) to produce an invalid primary key. Any operation that is likely to create a duplicate primary key or one containing nulls is rejected. The Entity Integrity ensures that the data that you store remains in the proper format as well as comprehendable.

Referential Integrity

1. Dangling tuples. * Consider a pair of relations r(R) and s(S), and the natural join tex2html_wrap_inline926 .

* There may be a tuple tex2html_wrap_inline928 in r that does not join with any tuple in s. * That is, there is no tuple tex2html_wrap_inline934 in s such that tex2html_wrap_inline938 . * We call this a dangling tuple. * Dangling tuples may or may not be acceptable. 2. Suppose there is a tuple tex2html_wrap_inline940 in the account relation with the value tex2html_wrap_inline942 ``Lunartown, but no matching tuple in the branch relation for the Lunartown branch. This is undesirable, as tex2html_wrap_inline940 should refer to a branch that exists. Now suppose there is a tuple tex2html_wrap_inline946 in the branch relation with tex2html_wrap_inline948 ``Mokan, but no matching tuple in the account relation for the Mokan branch.

This means that a branch exists for which no accounts exist. This is possible, for example, when a branch is being opened. We want to allow this situation. 3. Note the distinction between these two situations: bname is the primary key of branch, while it is not for account. In account, bname is a foreign key, being the primary key of another relation. * Let tex2html_wrap_inline950 and tex2html_wrap_inline952 be two relations with primary keys tex2html_wrap_inline954 and tex2html_wrap_inline956 respectively.

* We say that a subset tex2html_wrap_inline958 of tex2html_wrap_inline960 is a foreign key referencing tex2html_wrap_inline954 in relation tex2html_wrap_inline964 if it is required that for every tuple tex2html_wrap_inline946 in tex2html_wrap_inline968 there must be a tuple tex2html_wrap_inline940 in tex2html_wrap_inline964 such that tex2html_wrap_inline974 * We call these requirements referential integrity constraints. * Also known as subset dependencies, as we require

Domain Integrity

A domain defines the possible values of an attribute. Domain Integrity rules govern these values. In a database system, the domain integrity is defined by:

The datatype and the length The NULL value acceptance The allowable values, through techniques like constraints or rules The default value

1. A domain of possible values should be associated with every attribute. These domain constraints are the most basic form of integrity constraint. They are easy to test for when data is entered.

1. Domain types 1. Attributes may have the same domain, e.g. cname and employee-name. 2. It is not as clear whether bname and cname domains ought to be distinct. 3. At the implementation level, they are both character strings. 4. At the conceptual level, we do not expect customers to have the same names as branches, in general. 5. Strong typing of domains allows us to test for values inserted, and whether queries make sense. Newer systems, particularly object-oriented database systems, offer a rich set of domain types that can be extended easily. 1. The check clause in SQL-92 permits domains to be restricted in powerful ways that most programming language type systems do not permit. 1. The check clause permits schema designer to specify a predicate that must be satisfied by any value assigned to a variable whose type is the domain. 2. Examples: create domain hourly-wage numeric(5,2) constraint wage-value-test check(value >= 4.00) Note that ``constraint wage-value-test is optional (to give a name to the test to signal which constraint is violated). create domain account-number char(10)

For example, if you define the attribute of Age of an Employee entity, as an integer, the value of every instance of that attribute must always be numeric and an integer. If you also define that this attribute must always be positive, then a negative value is forbidden. The value of this attribute being mandatory indicates that the attribute can not be NULL. All of these characteristics form the domain integrity of this attribute. This type of data integrity warrants the following: the identity and purpose of a field is clear and all of the tables in which it appears are properly identified; field definitions are consistent throughout the database; the values of a field are consistent and valid; the types of modifications, comparisons and operators that can be applied to the values in the field are clearly identified. Column Integrity User Defined Integrity SQL- SQL (officially pronounced "S-Q-L" but is often incorrectly pronounced / like "Sequel") often referred to as Structured Query Language,[is a database computer language designed for managing data in relational database management systems (RDBMS), and originally based upon relational algebra. Its scope includes data insert, query, update and delete, schema creation and modification, and data access control. SQL was one of the first languages for Edgar F. Codd's relational model in his influential 1970 paper, "A Relational Model of Data for Large Shared Data Banks" and became the most widely used language for relational databases.

Data types

Each column in an SQL table declares the type(s) that column may contain. ANSI SQL includes the following datatypes. Character strings CHARACTER(n) or CHAR(n) fixed-width n-character string, padded with spaces as needed CHARACTER VARYING(n) or VARCHAR(n) variable-width string with a maximum size of n characters NATIONAL CHARACTER(n) or NCHAR(n) fixed width string supporting an international character set NATIONAL CHARACTER VARYING(n) or NVARCHAR(n) variable-width NCHAR string Bit strings

BIT(n) an array of n bits BIT VARYING(n) an array of up to n bits


INTEGER and SMALLINT FLOAT, REAL and DOUBLE PRECISION NUMERIC(precision, scale) or DECIMAL(precision, scale)

SQL provides a function to round numerics or dates, called TRUNC (in Informix, DB2, PostgreSQL, Oracle and MySQL) or ROUND (in Informix, Sybase, Oracle, PostgreSQL and Microsoft SQL Server) Date and time

DATE for date values (e.g., 2010-05-30) TIME for time values (e.g., 14:55:37). The granularity of the time value is usually a tick (100 nanoseconds). TIME WITH TIME ZONE or TIMESTAMP the same as TIME, but including details about the time zone in question. TIMESTAMP This is a DATE and a TIME put together in one variable (e.g., 201005-30 14:55:37). TIMESTAMP WITH TIME ZONE or TIMESTAMPTZ the same as TIMESTAMP, but including details about the time zone in question.

SQL provides several functions for generating a date / time variable out of a date / time string (TO_DATE, TO_TIME, TO_TIMESTAMP), as well as for extracting the respective members (seconds, for instance) of such variables. The current system date / time of the database server can be called by using functions like NOW. Aggregate Functions- SQL Aggregate functions return a single value, using values in a table column. In this chapter we are going to introduce a new table called Sales, which will have the following columns and data:

OrderID 1 2 3 4 5 6 7

OrderDate 12/22/2005 08/10/2005 07/13/2005 07/15/2005 12/22/2005 10/2/2005 11/03/2005

OrderPrice OrderQuantity CustomerName 160 190 500 420 1000 820 2000 2 2 5 2 4 4 2 Smith Johnson Baldwin Smith Wood Smith Baldwin

The SQL COUNT function returns the number of rows in a table satisfying the criteria specified in the WHERE clause. If we want to count how many orders has made a customer with CustomerName of Smith, we will use the following SQL COUNT expression: SELECT COUNT (*) FROM Sales WHERE CustomerName = 'Smith' Lets examine the SQL statement above. The COUNT keyword is followed by brackets surrounding the * character. You can replace the * with any of the tables columns, and your statement will return the same result as long as the WHERE condition is the same. The result of the above SQL statement will be the number 3, because the customer Smith has made 3 orders in total. If you dont specify a WHERE clause when using COUNT, your statement will simply return the total number of rows in the table, which in our case is 7: SELECT COUNT(*) FROM Sales How can we get the number of unique customers that have ordered from our store? We need to use the DISTINCT keyword along with the COUNT function to accomplish that: SELECT COUNT (DISTINCT CustomerName) FROM Sales The SQL SUM function is used to select the sum of values from numeric column. Using the Sales table, we can get the sum of all orders with the following SQL SUM statement: SELECT SUM(OrderPrice) FROM Sales As with the COUNT function we put the table column that we want to sum, within brackets after

the SUM keyword. The result of the above SQL statement is the number 4990. If we want to know how many items have we sold in total (the sum of OrderQuantity), we need to use this SQL statement: SELECT SUM(OrderQuantity) FROM Sales

The SQL AVG function retrieves the average value for a numeric column. If we need the average number of items per order, we can retrieve it like this: SELECT AVG(OrderQuantity) FROM Sales Of course you can use AVG function with the WHERE clause, thus restricting the data you operate on: SELECT AVG(OrderQuantity) FROM Sales WHERE OrderPrice > 200 The above SQL expression will return the average OrderQuantity for all orders with OrderPrice greater than 200, which is 17/5. The SQL MIN function selects the smallest number from a numeric column. In order to find out what was the minimum price paid for any of the orders in the Sales table, we use the following SQL expression: SELECT MIN(OrderPrice) FROM Sales

The SQL MAX function retrieves the maximum numeric value from a numeric column. The MAX SQL statement below returns the highest OrderPrice from the Sales table: SELECT MAX(OrderPrice) FROM Sales Commands- i) The CREATE TABLE statement is used to create a new database table. Here is how a simple CREATE TABLE statement looks like: CREATE TABLE TableName ( Column1 DataType, Column2 DataType, Column3 DataType, .

) The DataType specified after each column name is a placeholder fro the real data type of the column. The following CREATE TABLE statement creates the Users table we used in one of the first chapters: CREATE TABLE Users ( FirstName CHAR(100), LastName CHAR(100), DateOfBirth DATE ) The CREATE TABLE statement above creates a table with 3 columns FirstName of type CHAR with length of 100 characters, LastName of type CHAR with length of 100 characters and DateOfBirth of type DATE. ii) SQL DROP Statement: The SQL DROP command is used to remove an object from the database. If you drop a table, all the rows in the table is deleted and the table structure is removed from the database. Once a table is dropped we cannot get it back, so be careful while using RENAME command. When a table is dropped all the references to the table will not be valid. Syntax to drop a sql table structure: DROP TABLE table_name; For Example: To drop the table employee, the query would be like DROP TABLE employee; iii) The SQL INSERT INTO clause is used to insert data into a SQL table. The SQL INSERT INTO is frequently used and has the following generic syntax: INSERT INTO Table1 (Column1, Column2, Column3) VALUES (ColumnValue1, ColumnValue2, ColumnValue3 ) The SQL INSERT INTO clause has actually two parts the first specifying the table we are inserting into and giving the list of columns we are inserting values for, and the second specifying the values inserted in the column list from the first part. If we want to enter a new data row into the Users table, we can do it with the following SQL INSERT INTO statement: INSERT INTO Users (FirstName, LastName, DateOfBirth, Email, City) VALUES ('Frank', 'Drummer', '10/08/1955', 'frank.drummer@frankdrummermail.com', 'Seattle')

iv) Using our Users table we will illustrate the SQL DELETE usage. One of the users in the Users table (Stephen Grant) has just left the company, and your boss has asked you to delete his record. How do you do that? Consider the SQL statement below: DELETE FROM Users WHERE LastName = 'Grant' The first line in the SQL DELETE statement above specifies the table that we are deleting the record(s) from. The second line (the WHERE clause) specifies which rows exactly do we delete (in our case all rows which has LastName of Grant). As you can see the DELETE SQL queries have very simple syntax and in fact are very close to the natural language. But wait, there is something wrong with the statement above! The problem is that we have more than one user having last name of Grant, and all users with this last name will be deleted. Because we dont want to do that, we need to find a table field or combination of fields that uniquely identifies the user Stephen Grant. Looking at the Users table an obvious candidate for such a unique field is the Email column (its not likely that different users use one and the same email). Our improved SQL query which deletes only the record of Stephen Grants record will look like this: DELETE FROM Users WHERE Email = 'sgrant@sgrantemail.com' What happens if you dont specify a WHERE clause in your DELETE query? DELETE FROM Users The answer is that all records in the Users table will be deleted. The SQL TRUNCATE statement below will have the exact same effect as the last DELETE statement: TRUNCATE TABLE Users The TRUNCATE statement will delete all rows in the Users table, without deleting the table itself. Be very careful when using DELETE and TRUNCATE, because you cannot undo these statements, and once row(s) are deleted fro your table they are gone forever if you dont have a backup. Applying Column Constraints- Column Constraints If you perform a slash h (\h) on create table, you may have noticed that there are additional notes on how to create a table or column constraint. The portion that applies to a column constraint is as follows: where column_constraint can be:

[CONSTRAINT constraint_name] { NOT NULL | NULL | UNIQUE | PRIMARY KEY | DEFAULT value | CHECK ( condition ) | REFERENCES table [( column )] [ MATCH FULL | MATCH PARTIAL ] [ ON DELETE action ] [ ON UPDATE action ] [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] } There are five column constraints available:

1. 2. 3. 4. 5.

Not Null Unique Primary Key Check References

The definitions for each clause are: NULL | NOT NULL NULL specifies that this column is allowed to have NULL values. This is the default so you don't need to implicitly specify this. This is only allowed as a column constraint, not a table constraint. NOT NULL specifies that this column is not allowed to contain NULL values. Using the constraint CHECK ( column NOT NULL) is the equivalent to using NOT NULL. UNIQUE This column can contain only unique non-repeating values. UNIQUE does not necessarily mean NOT NULL. UNIQUE allows repeating NULL values to be in a column. PRIMARY KEY This column may contain only unique and non-null values. A table or column primary key is restricted to having only one primary key. CHECK condition This constraint defines tests that the column must satisfy for an insert or update operation to succeed on that row. The condition is an expression that must return a boolean result. For column constraint definitions, only one column can be referenced by the CHECK clause. The following clauses apply to reference constraints: REFERENCES reftable ( refcolumn )

The values in this column are checked against the values of another column that this references. reftable - this table contains the data that are compared with. refcolumn - this column is located in the reftable to compare data against. If refcolumn is left empty, then the PRIMARY KEY of the reftable is used.

MATCH FULL | MATCH PARTIAL MATCH FULL rules out foreign key columns that contain NULL values, unless all foreign key columns are NULL. MATCH PARTIAL is not supported, but a default type is. The default allows NULL columns to satisfy the constraint. ON DELETE action When a DELETE is performed on a referenced row in the referenced table, one of these possible actions should likewise be executed: NO ACTION - Produces an error if the foreign key is violated. This is the default if an action is not specified. RESTRICT - Same as NO ACTION. CASCADE - Removes all rows which references the deleted row. SET NULL - Assigns a NULL to all referenced column values. SET DEFAULT - Sets all referenced columns to their default values.

ON UPDATE action When an UPDATE is performed on a referenced row in the referenced table, an action occurs. If a row is updated, but the referenced column is not affected, then the action will not occur. The possible actions that can occur when an UPDATE is applied to a referenced column are the same as with ON DELETE. The only exception is the CASCADE action. CASCADE updates all of the rows which references the updated row. DEFERRABLE | NOT DEFERRABLE DEFERRABLE specifies the constraint to be postponed to the end of the transaction.

NOT DEFERRABLE means that the constraint is not postponed to the end of the transaction. This is the default when DEFERRABLE is not specified. INITIALLY checktime The constraint must be DEFERRABLE for you to specify a check time. The possible check times for a constraint to be deferred are: DEFERRED - postpone constraint checking until the end of the transaction is reached. IMMEDIATE - perform constraint checking after each statement. This is the default when a checktime is not specified.

To create a primary key column constraint on an employees table, use: Example. Creating a Primary Key Constraint CREATE TABLE employees ( emp_id INTEGER PRIMARY KEY, name TEXT );

This is the equivalent to the above operation: CREATE TABLE employees ( emp_id INTEGER, name TEXT, PRIMARY KEY (emp_id) );

This creates a new table with a check column constraint that applies a rule which makes sure that employee identification numbers are greater than 100 and are non-NULL values. It also makes sure that an employee name exists for each employee id:

CREATE TABLE employees ( emp_id INTEGER NOT NULL CHECK (emp_id > 100), name TEXT NOT NULL CHECK (name <> '') ); Note When using the Check option to restrict the columns from containing empty values, there are different ways to express this depending on the data type of the column. You can specify a 0 for integer data type, or a pair of empty single quotes for text. If you use a pair of empty quotes for integer type, then it will automatically convert that into a 0. Keep in mind that when you perform a check condition, the condition must return the same data type as the column value that you are checking. Views- In database theory, a view consists of a stored query accessible as a virtual table composed of the result set of a query. Unlike ordinary tables (base tables) in a relational database, a view does not form part of the physical schema: it is a dynamic, virtual table computed or collated from data in the database. Changing the data in a table alters the data shown in subsequent invocations of the view. Views can provide advantages over tables:

Views can represent a subset of the data contained in a table Views can join and simplify multiple tables into a single virtual table Views can act as aggregated tables, where the database engine aggregates data (sum, average etc) and presents the calculated results as part of the data Views can hide the complexity of data; for example a view could appear as Sales2000 or Sales2001, transparently partitioning the actual underlying table Views take very little space to store; the database contains only the definition of a view, not a copy of all the data it presents Depending on the SQL engine used, views can provide extra security Views can limit the degree of exposure of a table or tables to the outer world

Just as functions (in programming) can provide abstraction, so database users can create abstraction by using views. In another parallel with functions, database users can manipulate nested views, thus one view can aggregate data from other views. Without the use of views the normalization of databases above second normal form would become much more difficult. Views can make it easier to create lossless join decomposition. Just as rows in a base table lack any defined ordering, rows available through a view do not appear with any default sorting. A view is a relational table, and the relational model defines a

table as a set of rows. Since sets are not ordered - by definition - the rows in a view are not ordered, either. Therefore, an ORDER BY clause in the view definition is meaningless. The SQL standard (SQL:2003) does not allow an ORDER BY clause in a subselect in a CREATE VIEW statement, just as it is not allowed in a CREATE TABLE statement. However, sorted data can be obtained from a view, in the same way as any other table - as part of a query statement. Nevertheless, some DBMS (such as Oracle and SQL Server) allow a view to be created with an ORDER BY clause in a subquery, affecting how data is displayed. Indexes and Sequences Unique and Nonunique Indexes Unique indexes guarantee that no two rows of a table have duplicate values in the key column (or columns). For performance reasons, Oracle recommends that unique indexes be created explicitly, and not through enabling a unique constraint on a table. (Unique integrity constraints are enforced by automatically defining an index.) You can create many indexes for a table as long as the combination of columns differs for each index. CREATE INDEX emp_idx1 ON emp (ename, job); CREATE INDEX emp_idx2 ON emp (job, ename); The absence or presence of an index does not require a change in the wording of any SQL statement. An index is merely a fast access path to the data. The query optimizer can use an existing index to build another index. This results in a much faster index build. Index multiple columns A composite index is an index that you create on multiple columns in a table. This can speed retrieval of data if the SQL WHERE clause references all (or the leading portion) of the columns in the index. Therefore, the order of the columns used in the definition is important - the most commonly accessed or most selective columns go first. Rebuilding indexes Although indexes can be modified with ALTER INDEX abc REBUILD it is a commonly held myth about rebuilding indexes that performance will automatically improve. By contrast redesigning an index to suit the SQL queries being run will give measurable results. Function-Based Indexes You can create indexes on functions and expressions that involve columns in the table being indexed. A function-based index precomputes the value of the function or expression and stores it in an index (B-tree or bitmap).

Function-based indexes defined on UPPER(column_name) or LOWER(column_name) can facilitate case-insensitive searches. For example, the following index: CREATE INDEX uppercase_idx ON emp (UPPER(empname)); can facilitate processing queries such as this: SELECT * FROM emp WHERE UPPER(empname) = 'RICHARD'; To use function-based indexes you must gather optimizer statistics. (Not compatible with Rule-based optimization.) If the function is a PL/SQL function or package function, any changes to the function specification will cause the index to be automatically disabled. How Indexes Are Searched Index unique scan used when all columns of a unique (B-tree) index are specified with equality conditions. e.g. name = 'ALEX' Index range scan is used when you specify a wildcard or interval (bounded by a start key and/or end key.) e.g. name LIKE 'AL%' order_id BETWEEN 100 AND 120 order_book_date > SYSDATE - 30 Key Compression Like any form of compression, Key compression can lead to a huge saving in space, letting you store more keys in each index block, which can lead to less I/O and better performance. Although key compression reduces the storage requirements of an index, it can increase the CPU time required to reconstruct the key column values during an index scan. It also incurs some additional storage overhead. Reverse Key Indexes Creating a reverse key index, compared to a standard index, reverses the bytes of each column indexed (except the rowid) while keeping the column order. By reversing the keys of the index, the insertions become distributed across all leaf keys in the index. CREATE INDEX i ON t (my_id) REVERSE; The values 4771, 4772, 4773 in the index are reversed to 1774, 2774, 3774 The more even distribution of "hits" on the various leaf blocks is the RKI's best feature. In a heavy, concurrent insert environment, rather than having everyone wanting access to *the* block, you spread the blocks being hit and hence reduce the potentially expensive buffer busy waits.

The main disadvantage is the inability to perform index range scans as such values are now distributed all over the place, only fetch-by-key or full-index (table) scans can be performed. You can specify the keyword NOREVERSE to REBUILD a reverse-key index into one that is not reverse keyed: Indexes. Rebuilding a reverse-key index without the NOREVERSE keyword produces a rebuilt, reverse-key index. You cannot rebuild a normal index as a reverse key index. You must use the CREATE statement instead. Bitmap Indexes In a bitmap index, a bitmap for each key value is used instead of a list of rowids. Each bit in the bitmap corresponds to a possible rowid. If the bit is set, then it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so the bitmap index provides the same functionality as a regular index even though it uses a different representation internally. If the number of different key values is small, then bitmap indexes are very space efficient. Bitmap indexing is of great benefit to data warehousing applications. Bitmap indexes are good for: Low cardinality columns have a small number of distinct values (compared to the number of rows) e.g. Gender or Marital Status High cardinality columns have large numbers of distinct values (over 100). Bitmap indexes include rows that have NULL values, and can dramatically improve the performance of ad hoc queries. Bitmap indexing efficiently merges indexes that correspond to several conditions in a WHERE clause. Rows that satisfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatically. Unlike traditional a B-tree indexes, Bitmap indexes are typically only a fraction of the size of the indexed data in the table. Bitmap indexes are also not suitable for columns that are primarily queried with less than or greater than comparisons. For example, a salary column that usually appears in WHERE clauses in a comparison to a certain value is better served with a B-tree index. Bitmap indexes are not suitable for OLTP applications with large numbers of concurrent transactions modifying the data. These indexes are primarily intended for decision support in data warehousing applications where users typically query the data rather than update it.

The advantages of bitmap indexes are greatest for low cardinality columns: that is, columns in which the number of distinct values is small compared to the number of rows in the table. (See the Oracle concepts manual for an example of this) Unlike most other types of index, Bitmap indexes include rows that have NULL values. This can be useful for queries such as SELECT COUNT(*) FROM EMP; You can create bitmap indexes local to a partitioned table (not a global index). Bitmap Join Indexes A join index is an index on one table that involves columns of one or more different tables through a join. Similar to the materialized join view, a bitmap join index precomputes the join and stores it as a database object. The difference is that a materialized join view materializes the join into a table while a bitmap join index materializes the join into a bitmap index. See the Oracle concepts manual for a full example. Dimensions Define hierarchical relationships between pairs of columns or column sets. (Typically data warehouse parent-child relationships.) The columns in a dimension can come either from the same table (denormalized) or from multiple tables (fully or partially normalized). To define a dimension over columns from multiple tables, connect the tables using the JOIN clause of CREATE DIMENSION HIERARCHY. Sequences The sequence generator provides a sequential series of numbers. The sequence generator is especially useful for generating unique sequential ID numbers. Individual sequence numbers can be skipped if they were generated and used in a transaction that was ultimately rolled back. A sequence generates a serial list of unique numbers for numeric columns of a database's tables. Sequences simplify application programming by automatically generating unique numerical values for the rows of a single table or multiple tables. For example, assume two users are simultaneously inserting new employee rows into the EMP table. By using a sequence to generate unique employee numbers for the EMPNO column, neither user has to wait for the other to enter the next available employee number. The sequence automatically generates the correct values for each user. Sequence numbers are independent of tables, so the same sequence can be used for one or more tables. After creation, a sequence can be accessed by various users to generate actual sequence numbers. Structure of PL/SQL Statements

All PL/SQL functions and procedures, including packaged procedures and anonymous blocks follow the following basic layout: Structure: PROCEDURE procedure ( parameter1 datatype [DEFAULT default_value1 ], parameter2 datatype [DEFAULT default_value2 ] [,...]) IS DECLARE /* declarations */ BEGIN /* executable code */ EXCEPTION /* error handling */ END; / or for a function: FUNCTION function RETURN datatype IS DECLARE /* declarations */ BEGIN /* executable code */ [RETURN value] EXCEPTION /* error handling */ END; / To create a procedure: CREATE OR REPLACE PROCEDURE procedure IS ... END procedure; / Or a flat file SQL script can contain simply: BEGIN /* executable code */ EXCEPTION /* error handling */

END; Cursor- In database packages, a cursor comprises a control structure for the successive traversal (and potential processing) of records in a result set. Cursors provide a mechanism by which a database client iterates over the records in a database. Using cursors, the client can get, put, and delete database records. Database programmers use cursors for processing individual rows returned by the database system for a query. Cursors address the problem of impedance mismatch, an issue that occurs in many programming languages[citation needed]. Most procedural programming languages do not offer any mechanism for manipulating whole result-sets at once. In this scenario, the application must process rows in a result-set sequentially. Thus one can think of a database cursor as an iterator over the collection of rows in the result set. Several SQL statements do not require the use of cursors. That includes the INSERT statement, for example, as well as most forms of the DELETE and UPDATE statements. Even a SELECT statement may not involve a cursor if it is used in the variation of SELECT INTO. A SELECT INTO retrieves at most a single row directly into the application. Working with cursors This section introduces the ways the SQL:2003 standard defines how to use cursors in applications in embedded SQL. Not all application bindings for relational database systems adhere to that standard, and some (such as CLI or JDBC) use a different interface. A programmer makes a cursor known to the DBMS by using a DECLARE ... CURSOR statement and assigning the cursor a (compulsory) name: DECLARE cursor_name CURSOR FOR SELECT ... FROM ... Before code can access the data, it must open the cursor with the OPEN statement. Directly following a successful opening, the cursor is positioned before the first row in the result set. OPEN cursor_name Programs position cursors on a specific row in the result set with the FETCH statement. A fetch operation transfers the data of the row into the application. FETCH cursor_name INTO ... Once an application has processed all available rows or the fetch operation is to be positioned on a non-existing row (compare scrollable cursors below), the DBMS returns a SQLSTATE '02000' (usually accompanied by an SQLCODE +100) to indicate the end of the result set. The final step involves closing the cursor using the CLOSE statement:

CLOSE cursor_name After closing a cursor, a program can open it again, which implies that the DBMS re-evaluates the same query or a different query and builds a new result-set. Scrollable cursors Programmers may declare cursors as scrollable or not scrollable. The scrollability indicates the direction in which a cursor can move. With a non-scrollable cursor, also known as forward-only, one can FETCH each row at most once, and the cursor automatically moves to the immediately following row. A fetch operation after the last row has been retrieved positions the cursor after the last row and returns SQLSTATE 02000 (SQLCODE +100). A program may position a scrollable cursor anywhere in the result set using the FETCH SQL statement. The keyword SCROLL must be specified when declaring the cursor. The default is NO SCROLL, although different language bindings like JDBC may apply different default. DECLARE cursor_name sensitivity SCROLL CURSOR FOR SELECT ... FROM ... The target position for a scrollable cursor can be specified relative to the current cursor position or absolute from the beginning of the result set. FETCH [ NEXT | PRIOR | FIRST | LAST ] FROM cursor_name FETCH ABSOLUTE n FROM cursor_name FETCH RELATIVE n FROM cursor_name Scrollable cursors can potentially access the same row in the result set multiple times. Thus, data modifications (insert, update, delete operations) from other transactions could have an impact on the result set. A cursor can be SENSITIVE or INSENSITIVE to such data modifications. A sensitive cursor picks up data modifications impacting the result set of the cursor, and an insensitive cursor does not. Additionally, a cursor may be ASENSITIVE, in which case the DBMS tries to apply sensitivity as much as possible. "WITH HOLD" Cursors are usually closed automatically at the end of a transaction, i.e when a COMMIT or ROLLBACK (or an implicit termination of the transaction) occurs. That behavior can be changed if the cursor is declared using the WITH HOLD clause. (The default is WITHOUT HOLD.) A holdable cursor is kept open over COMMIT and closed upon ROLLBACK. (Some DBMS deviate from this standard behavior and also keep holdable cursors open over ROLLBACK.) DECLARE cursor_name CURSOR WITH HOLD FOR SELECT ... FROM ...

When a COMMIT occurs, a holdable cursor is positioned before the next row. Thus, a positioned UPDATE or positioned DELETE statement will only succeed after a FETCH operation occurred first in the transaction. Note that JDBC defines cursors as holdable per default. This is done because JDBC also activates auto-commit per default. Due to the usual overhead associated with auto-commit and holdable cursors, both features should be explicitly deactivated at the connection level. Positioned update/delete statements Cursors can not only be used to fetch data from the DBMS into an application but also to identify a row in a table to be updated or deleted. The SQL:2003 standard defines positioned update and positioned delete SQL statements for that purpose. Such statements do not use a regular WHERE clause with predicates. Instead, a cursor identifies the row. The cursor must be opened and positioned on a row already using the FETCH statement. UPDATE table_name SET ... WHERE CURRENT OF cursor_name DELETE FROM table_name WHERE CURRENT OF cursor_name The cursor must operate on an updatable result set in order to successfully execute a positioned update or delete statement. Otherwise, the DBMS would not know how to apply the data changes to the underlying tables referred to in the cursor. Cursors in distributed transactions Using cursors in distributed transactions (X/Open XA Environments), which are controlled using a transaction monitor, is no different than cursors in non-distributed transactions. One has to pay attention when using holdable cursors, however. Connections can be used by different applications. Thus, once a transaction has been ended and committed, a subsequent transaction (running in a different application) could inherit existing holdable cursors. Therefore, an application developer has to be aware of that situation. Cursors in XQuery The XQuery language allows cursors to be created using the subsequence() function. The format is: let $displayed-sequence := subsequence($result, $start, $item-count) Where $result is the result of the initial XQuery, $start is the item number to start and $itemcount is the number of items to return. Equivalently this can also be done using a predicate: let $displayed-sequence := $result[$start to $end]

Where $end is the end sequence. Disadvantages of cursors The following information may vary from database system to database system. Fetching a row from the cursor may result in a network round trip each time. This uses much more network bandwidth than would ordinarily be needed for the execution of a single SQL statement like DELETE. Repeated network round trips can severely impact the speed of the operation using the cursor. Some DBMSs try to reduce this impact by using block fetch. Block fetch implies that multiple rows are sent together from the server to the client. The client stores a whole block of rows in a local buffer and retrieves the rows from there until that buffer is exhausted. Cursors allocate resources on the server, for instance locks, packages, processes, temporary storage, etc. For example, Microsoft SQL Server implements cursors by creating a temporary table and populating it with the query's result-set. If a cursor is not properly closed (deallocated), the resources will not be freed until the SQL session (connection) itself is closed. This wasting of resources on the server can not only lead to performance degradations but also to failures Trigger- A database trigger is procedural code that is automatically executed in response to certain events on a particular table or view in a database. The trigger is mostly used for keeping the integrity of the information on the database. For example, when a new record (representing a new worker) is added to the employees table, new records should be created also in the tables of the taxes, vacations, and salaries. Triggers are commonly used to:

prevent changes (e.g. prevent an invoice from being changed after it's been mailed out) log changes (e.g. keep a copy of the old data) audit changes (e.g. keep a log of the users and roles involved in changes) enhance changes (e.g. ensure that every change to a record is time-stamped by the server's clock, not the client's) enforce business rules (e.g. require that every invoice have at least one line item) execute business rules (e.g. notify a manager every time an employee's bank account number changes) replicate data (e.g. store a record of every change, to be shipped to another database later) enhance performance (e.g. update the account balance after every detail transaction, for faster queries)

The examples above are called Data Manipulation Language (DML) triggers because the triggers are defined as part of the Data Manipulation Language and are executed at the time the data are manipulated. Some systems also support non-data triggers, which fire in response to Data Definition Language (DDL) events such as creating tables, or runtime events such as logon, commit, and rollback. Such DDL triggers can be used for auditing purposes. The following are major features of database triggers and their effects:

triggers do not accept parameters or arguments (but may store affected-data in temporary tables) triggers cannot perform commit or rollback operations because they are part of the triggering SQL statement (only through autonomous transactions) triggers can cancel a requested operation triggers can cause mutating table errors

In addition to triggers that fire when data is modified, Oracle 9i supports triggers that fire when schema objects (that is, tables) are modified and when user logon or logoff events occur. These trigger types are referred to as "Schema-level triggers". Schema-level triggers After Creation Before Alter After Alter Before Drop After Drop Before Logoff After Logon The two main types of triggers are: Row Level Trigger Statement Level Trigger Based on the 2 types of classifications, we could have 12 types of triggers. MySQL 5.0.2 introduced support for triggers. Some of the triggers MYSQL supports are Insert Trigger Update Trigger Delete Trigger The SQL:2003 standard mandates that triggers give programmers access to record variables by means of a syntax such as REFERENCING NEW AS n. For example, if a trigger is monitoring for changes to a salary column one could write a trigger like the following: CREATE TRIGGER salary_trigger BEFORE UPDATE ON employee_table REFERENCING NEW ROW AS n, OLD ROW AS o FOR EACH ROW IF n.salary <> o.salary THEN END IF; Procedures- A stored procedure is a subroutine available to applications accessing a relational database system. Stored procedures (sometimes called a proc, sproc, StoPro, StoredProc, or SP) are actually stored in the database data dictionary. Typical uses for stored procedures include data validation (integrated into the database) or access control mechanisms. Furthermore, stored procedures are used to consolidate and centralize logic that was originally implemented in applications. Extensive or complex processing that requires the execution of several SQL statements is moved into stored procedures, and all applications

call the procedures. One can use nested stored procedures, by executing one stored procedure from within another. The maximum level of nesting is 32.[1] Stored procedures are similar to user-defined functions (UDFs). The major difference is that UDFs can be used like any other expression within SQL statements, whereas stored procedures must be invoked using the CALL statement.[2] CALL procedure() or EXECUTE procedure() Stored procedures may return result sets, i.e. the results of a SELECT statement. Such result sets can be processed using cursors by other stored procedures by associating a result set locator, or by applications. Stored procedures may also contain declared variables for processing data and cursors that allow it to loop through multiple rows in a table. The standard Structured Query Language provides IF, WHILE, LOOP, REPEAT, CASE statements, and more. Stored procedures can receive variables, return results or modify variables and return them, depending on how and where the variable is declared. Comparison with dynamic SQL Overhead: Because stored procedure statements are stored directly in the database, they may remove all or part of the compilation overhead that is typically required in situations where software applications send inline (dynamic) SQL queries to a database. (However, most database systems implement "statement caches" and other mechanisms to avoid repetitive compilation of dynamic SQL statements.) In addition, while they avoid some overhead, pre-compiled SQL statements add to the complexity of creating an optimal execution plan because not all arguments of the SQL statement are supplied at compile time. Depending on the specific database implementation and configuration, mixed performance results will be seen from stored procedures versus generic queries or user defined functions. Avoidance of network traffic: A major advantage with stored procedures is that they can run directly within the database engine. In a production system, this typically means that the procedures run entirely on a specialized database server, which has direct access to the data being accessed. The benefit here is that network communication costs can be avoided completely. This becomes particularly important for complex series of SQL statements. Encapsulation of business logic: Stored procedures allow for business logic to be embedded as an API in the database, which can simplify data management and reduce the need to encode the logic elsewhere in client programs. This may result in a lesser likelihood of data becoming corrupted through the use of faulty client programs. The database system can ensure data integrity and consistency with the help of stored procedures.

Delegation of access-rights: In many systems, stored-procedures can be granted access rights to the database which the users who will execute those procedures do not directly have. Some protection from SQL injection attacks: Stored procedures can be used to protect against injection attacks. Stored procedure parameters will be treated as data even if an attacker inserts SQL commands. Also, some DBMSs will check the parameter's type. Comparison with functions

A function is a subprogram written to perform certain computations and return a single value. Functions must return a value (using the RETURN keyword), but for stored procedures this is not compulsory. Stored procedures can use RETURN keyword but without any value being passed. Functions could be used in SELECT statements, provided they dont do any data manipulation. However, procedures cannot be included in SELECT statements. A function can have only IN parameters, while stored procedures may have OUT or INOUT parameters. A stored procedure can return multiple values using the OUT parameter or return no value at all.

Disadvantages Stored procedures are "defined once, used many times." If any changes are necessary, the (one and only one) definition of the stored procedure must be replaced. Dynamic SQL, of course, allows any SQL query to be issued at any time. Any change to a stored procedure instantly impacts every other piece of software, report, etc. (inside or outside of the DBMS) which directly or indirectly refers to it. It is not always possible to determine with certainty exactly what those impacts will be, nor what changes can safely be made without adversely impacting something else. For various reasons, many organizations strictly limit who is allowed to define and issue a query against the database. Programmers and other users may therefore find themselves having no choice but to implement inefficient solutions to their problems using what stored procedures are available to them, whether or not the procedures are appropriate for this particular ancillary task. Though not directly related to stored procedures, the movement of business logic to the DBMS is problematic since it is the layer with the more complex scalability issues. Furthermore, some modern DBMS systems (notably from Microsoft SQL Server 2000 onwards) don't offer any performance benefits of using stored procedures against precompiled queries: they are compiled and cached in the same manner as dynamic SQL. Package- SQL packages are permanent objects that are used to store information related to prepared SQL statements. They are used by open database connectivity (ODBC) support when the Extended Dynamic box is checked on a data source. They are also used by applications that use an API.

Advantages- Because SQL packages are a shared resource, when a statement is prepared, the information is available to all the users of the package. This saves processing time, especially in an environment when many users are using the same or similar statements. Because SQL packages are permanent, this information is also saved across job initiation and end, and is also saved across system restarts. In fact, SQL packages can be saved and restored on other systems. By comparison, dynamic SQL requires that each user go through the preparatory processing for a particular statement, and this must be done every time the user starts the application. SQL packages also allow the system to accumulate statistical information about the SQL statements that result in better decisions about how long to keep cursors open internally and how to best process the data needed for the query. This information is shared across users and retained for future use. In the case of dynamic SQL, this information must be done by every job and every user. Ex. Use the database monitor to log information about SQL processing on the system. It includes the name of the package in the SQL summary records. The following statement shows the package, the SQL operation, and the statement text: SELECT qqc103, qqc21, qq1000 from <db monitor file> For ODBC, you can also look in the job log for the message Extended Dynamic has been disabled to determine if ODBC was unable to use an SQL package.

UNIT 3 Normalization- In the field of relational database design, normalization is a systematic way of ensuring that a database structure is suitable for general-purpose querying and free of certain undesirable characteristicsinsertion, update, and deletion anomaliesthat could lead to a loss of data integrity. Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the First Normal Form (1NF) in 1970. Codd went on to define the Second Normal Form (2NF) and Third Normal Form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974. Higher normal forms were defined by

other theorists in subsequent years, the most recent being the Sixth normal form (6NF) introduced by Chris Date, Hugh Darwen, and Nikos Lorentzos in 2002. Informally, a relational database table (the computerized representation of a relation) is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies, i.e. in most cases 3NF tables adhere to BCNF, 4NF, and 5NF (but typically not 6NF). A standard piece of database design guidance is that the designer should create a fully normalized design; selective denormalization can subsequently be performed for performance reasons. However, some modeling disciplines, such as the dimensional modeling approach to data warehouse design, explicitly recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF. Free the database of modification anomalies

An update anomaly. Employee 519 is shown as having different addresses on different records.

An insertion anomaly. Until the new faculty member, Dr. Newsome, is assigned to teach at least one course, his details cannot be recorded.

A deletion anomaly. All information about Dr. Giddens is lost when he temporarily ceases to be assigned to any courses. When an attempt is made to modify (update, insert into, or delete from) a table, undesired sideeffects may follow. Not all tables can suffer from these side-effects; rather, the side-effects can

only arise in tables that have not been sufficiently normalized. An insufficiently normalized table might have one or more of the following characteristics:

The same information can be expressed on multiple rows; therefore updates to the table may result in logical inconsistencies. For example, each record in an "Employees' Skills" table might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each of his skills). If the update is not carried through successfullyif, that is, the employee's address is updated on some records but not othersthen the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question of what this particular employee's address is. This phenomenon is known as an update anomaly. There are circumstances in which certain facts cannot be recorded at all. For example, each record in a "Faculty and Their Courses" table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Codethus we can record the details of any faculty member who teaches at least one course, but we cannot record the details of a newly-hired faculty member who has not yet been assigned to teach any courses. This phenomenon is known as an insertion anomaly. There are circumstances in which the deletion of data representing certain facts necessitates the deletion of data representing completely different facts. The "Faculty and Their Courses" table described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears, effectively also deleting the faculty member. This phenomenon is known as a deletion anomaly.

Minimize redesign when extending the database structure When a fully normalized database structure is extended to allow it to accommodate new types of data, the pre-existing aspects of the database structure can remain largely or entirely unchanged. As a result, applications interacting with the database are minimally affected. Make the data model more informative to users Normalized tables, and the relationship between one normalized table and another, mirror realworld concepts and their interrelationships. Avoid bias towards any particular pattern of querying Normalized tables are suitable for general-purpose querying. This means any queries against these tables, including future queries whose details cannot be anticipated, are supported. In contrast, tables that are not normalized lend themselves to some types of queries, but not others. For example, consider an online bookseller whose customers maintain wishlists of books they'd like to have. For the obvious, anticipated query -- what books does this customer want? -- it's enough to store the customer's wishlist in the table as, say, a homogeneous string of authors and titles.

With this design, though, the database can answer only that one single query. It cannot by itself answer interesting but unanticipated queries: What is the most-wished-for book? Which customers are interested in WWII espionage? How does Lord Byron stack up against his contemporary poets? Answers to these questions must come from special adaptive tools completely separate from the database. One tool might be software written especially to handle such queries. This special adaptive software has just one single purpose: in effect to normalize the non-normalized field. Unforeseen queries can be answered trivially, and entirely within the database framework, with a no Normal forms The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is to inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF. The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of its tables are in normal form n. Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these higher normal forms. The main normal forms are summarized below. Normal form Defined by First normal form Two versions: E.F. Codd (1970), (1NF) C.J. Date (2003) Brief definition Table faithfully represents a relation and has no repeating groups No non-prime attribute in the table is Second normal E.F. Codd (1971) functionally dependent on a part (proper form (2NF) subset) of a candidate key E.F. Codd (1971) Carlo Zaniolo's Every non-prime attribute is nonThird normal form equivalent but differentlytransitively dependent on every key of the (3NF) expressed definition (1982) table Boyce-Codd Raymond F. Boyce and E.F. Codd Every non-trivial functional dependency normal form (1974) in the table is a dependency on a superkey (BCNF) Fourth normal Every non-trivial multivalued dependency Ronald Fagin (1977) form (4NF) in the table is a dependency on a superkey Fifth normal form Ronald Fagin (1979) Every non-trivial join dependency in the

(5NF) Domain/key normal form (DKNF) Ronald Fagin (1981)

Sixth normal form C.J. Date, Hugh Darwen, and (6NF) Nikos Lorentzos (2002)

table is implied by the superkeys of the table Every constraint on the table is a logical consequence of the table's domain constraints and key constraints Table features no non-trivial join dependencies at all (with reference to generalized join operator)

Functional Dependencies- A functional dependency (FD) is a constraint between two sets of attributes in a relation from a database. Given a relation R, a set of attributes X in R is said to functionally determine another attribute Y, also in R, (written X Y) if and only if each X value is associated with precisely one Y value. Customarily we call X the determinant set and Y the dependent attribute. Thus, given a tuple and the values of the attributes in X, one can determine the corresponding value of the Y attribute. For the purposes of simplicity, given that X and Y are sets of attributes in R, X Y denotes that X functionally determines each of the members of Yin this case Y is known as the dependent set. Thus, a candidate key is a minimal set of attributes that functionally determine all of the attributes in a relation. (Note: the "function" being discussed in "functional dependency" is the function of identification.) A functional dependency FD: X Y is called trivial if Y is a subset of X. The determination of functional dependencies is an important part of designing databases in the relational model, and in database normalization and denormalization. The functional dependencies, along with the attribute domains, are selected so as to generate constraints that would exclude as much data inappropriate to the user domain from the system as possible. For example, suppose one is designing a system to track vehicles and the capacity of their engines. Each vehicle has a unique vehicle identification number (VIN). One would write VIN EngineCapacity because it would be inappropriate for a vehicle's engine to have more than one capacity. (Assuming, in this case, that vehicles only have one engine.) However, EngineCapacity VIN, is incorrect because there could be many vehicles with the same engine capacity.

This functional dependency may suggest that the attribute EngineCapacity be placed in a relation with candidate key VIN. However, that may not always be appropriate. For example, if that functional dependency occurs as a result of the transitive functional dependencies VIN VehicleModel and VehicleModel EngineCapacity then that would not result in a normalized relation. Properties of functional dependencies Given that X, Y, and Z are sets of attributes in a relation R, one can derive several properties of functional dependencies. Among the most important are Armstrong's axioms, which are used in database normalization:

Subset Property (Axiom of Reflexivity): If Y is a subset of X, then X Y Augmentation (Axiom of Augmentation): If X Y, then XZ YZ Transitivity (Axiom of Transitivity): If X Y and Y Z, then X Z

From these rules, we can derive these secondary rules:

Union: If X Y and X Z, then X YZ Decomposition: If X YZ, then X Y and X Z Pseudotransitivity: If X Y and WY Z, then WX Z

Equivalent sets of functional dependencies are called covers of each other. Every set of functional dependencies has a canonical cover. Example This example illustrates the concept of functional dependency. The situation modeled is that of college students visiting one or more lectures in each of which they are assigned a teaching assistant (TA). Let's further assume that every student is in some semester and is identified by an unique integer ID. StudentI Semester D 1234 1201 6 4 Lecture TA

Numerical Methods John Numerical Methods Peter

1234 1201 1201

6 4 4

Visual Computing


Numerical Methods Peter Physics II Simone

We notice that whenever two rows in this table feature the same StudentID, they also necessarily have the same Semester values. This basic fact can be expressed by a functional dependency:

StudentID Semester.

Other nontrivial functional dependencies can be identified, for example:

{StudentID, Lecture} TA {StudentID, Lecture} {TA, Semester}

The latter expresses the fact that the set {StudentID, Lecture} is a superkey of the relation. multivalued dependency is a full constraint between two sets of attributes in a relation. In contrast to the functional dependency, the multivalued dependency requires that certain tuples be present in a relation. Therefore, a multivalued dependency is a special case of tuple-generating dependency. The multivalued dependency plays a role in the 4NF database normalization. A join dependency is a constraint on the set of legal relations over a database scheme. A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T. If one of the tables in the join has all the attributes of the table T, the join dependency is called trivial. The join dependency plays an important role in the Fifth normal form, also known as projectjoin normal form, because it can be proven that if you decompose a scheme R in tables R1 to Rn, the decomposition will be a lossless-join decomposition if you restrict the legal relations onR to a join dependency on R called * (R1,R2,...Rn).

Another way to describe a join dependency is to say that the set of relationships in the join dependency is independent of each other.

UNIT 4 Database transaction comprises a unit of work performed within a database management system (or similar system) against a database, and treated in a coherent and reliable way independent of other transactions. Transactions in a database environment have two main purposes:

1. To provide reliable units of work that allow correct recovery from failures and keep a database consistent even in cases of system failure, when execution stops (completely or partially) and many operations upon a database remain uncompleted, with unclear status. 2. To provide isolation between programs accessing a database concurrently. Without isolation the program's outcomes are possibly erroneous. A database transaction, by definition, must be atomic, consistent, isolated and durable. Database practitioners often refer to these properties of database transactions using the acronym ACID. Transactions provide an "all-or-nothing" proposition, stating that each work-unit performed in a database must either complete in its entirety or have no effect whatsoever. Further, the system must isolate each transaction from other transactions, results must conform to existing constraints in the database, and transactions that complete successfully must get written to durable storage. Purpose Databases and other data stores which treat the integrity of data as paramount often include the ability to handle transactions to maintain the integrity of data. A single transaction consists of one or more independent units of work, each reading and/or writing information to a database or other data store. When this happens it is often important to ensure that all such processing leaves the database or data store in a consistent state. Examples from double-entry accounting systems often illustrate the concept of transactions. In double-entry accounting every debit requires the recording of an associated credit. If one writes a check for 100 to buy groceries, a transactional double-entry accounting system must record the following two entries to cover the single transaction: 1. 2. Debit 100 to Groceries Expense Account Credit 100 to Checking Account

A transactional system would make both entries or both entries would fail. By treating the recording of multiple entries as an atomic transactional unit of work the system maintains the integrity of the data recorded. In other words, nobody ends up with a situation in which a debit is recorded but no associated credit is recorded, or vice versa.

Transactional databases A 'transactional database is a DBMS where write transactions on the database are able to be rolled back if they are not completed properly (e.g. due to power or connectivity loss). Most modern relational database management systems fall into the category of databases that support transactions. In a database system a transaction might consist of one or more data-manipulation statements and queries, each reading and/or writing information in the database. Users of database systems consider consistency and integrity of data as highly important. A simple transaction is usually issued to the database system in a language like SQL wrapped in a transaction, using a pattern similar to the following: 1. 2. 3. 4. Begin the transaction Execute several data manipulations and queries If no errors occur then commit the transaction and end it If errors occur then rollback the transaction and end it

If no errors occurred during the execution of the transaction then the system commits the transaction. A transaction commit operation applies all data manipulations within the scope of the transaction and persists the results to the database. If an error occurs during the transaction, or if the user specifies a rollback operation, the data manipulations within the transaction are not persisted to the database. In no case can a partial transaction be committed to the database since that would leave the database in an inconsistent state. Internally, multi-user databases store and process transactions, often by using a transaction ID or XID. In SQL SQL is inherently transactional, and a transaction is automatically started when another ends. Some databases extend SQL and implement a START TRANSACTION statement, but while seemingly signifying the start of the transaction it merely deactivates autocommit. The result of any work done after this point will remain invisible to other database-users until the system processes a COMMIT statement. AROLLBACK statement can also occur, which will undo any work performed since the last transaction. Both COMMIT and ROLLBACK will end the transaction, and start a new. If autocommit was disabled using START TRANSACTION, autocommit will often also be reenabled.

Some database systems allow the synonyms BEGIN, BEGIN TRANSACTION, and may have other options available. Distributed transactions


Database systems implement distributed transactions as transactions against multiple applications or hosts. A distributed transaction enforces the ACID properties over multiple systems or data stores, and might include systems such as databases, file systems, messaging systems, and other applications. In a distributed transaction a coordinating service ensures that all parts of the transaction are applied to all relevant systems. As with database and other transactions, if any part of the transaction fails, the entire transaction is rolled back across all affected systems. Transactional filesystems The Namesys Reiser4 filesystem for Linux supports transactions, and as of Microsoft Windows Vista, the Microsoft NTFS filesystem supports distributed transactions across networks.

In concurrency control of databases, transaction processing (transaction management), and various transactional applications, both centralized and distributed, a transaction schedule (history) is serializable, has the Serializability property, if its outcome (the resulting database state, the values of the database's data) is equal to the outcome of its transactions executed serially, i.e., sequentially without overlapping in time. Transactions are normally executed concurrently (they overlap), since this is the most efficient way. Serializability is the major correctness criterion for concurrent transactions' executions. It is considered the highest level of isolation between transactions, and plays an essential role in concurrency control. As such it is supported in all general purpose database systems. Strong strict two-phase locking (SS2PL) is a popular serializability mechanism utilized in most of the database systems (in various variants) since their early days in the 1970s. Distributed serializability is the serializability of a schedule of a transactional distributed system (e.g., a distributed database system). With the proliferation of the Internet, Cloud computing, Grid computing, and small, portable, powerful computing devices (e.g., smartphones) the need for effective distributed serializability techniques to ensure correctness in, and among distributed applications seems to increase.Commitment ordering (or Commit ordering; CO; introduced publicly in 1991) is a general serializability technique that allows to effectively achieve distributed serializability (and Global serializability) across

different (any) concurrency control mechanisms, also in a mixed heterogeneous environment with different mechanisms. CO does not interfere with the mechanisms' operations, and also guarantees automatic distributed deadlock resolution. Unlike other distributed serializability mechanisms, CO does not require the (costly) distribution of local concurrency control information (e.g., local precedence relations, locks, timestamps, or tickets), a fact which provides scalability, and typically saves considerable overhead and delays. Thus, CO (including its variants, e.g., SS2PL) is the only known effective general method for distributed serializability (and it is probably the only existing one). The popular SS2PL, which is a special case of CO and inherits many of CO's qualities, is the de-facto standard for distributed serializability (and Global serializability) across multiple (SS2PL based) database systems since the 1980s[3]. CO has been utilized extensively since 1997 as a solution for distributed serializability in works on Transactional processes , and more recently an optimistic version of CO has been proposed as a solution for Grid computing and Cloud computing. Serializability theory provides the formal framework to reason about and analyze serializability and its techniques. Its fundamentals are informally introduced below. Comment: Unless explicitly referenced or linked, most of the material in the following sections is covered in the text books (Bernstein et al. 1987)[1] and (Weikum and Vossen 2001). However, the presentation of Commitment ordering in (Weikum and Vossen 2001, pages 102, 700) is partial and misses CO's essence (see Background in The History of Commitment Ordering).

Database transaction For this discussion a database transaction is a specific intended run (with specific parameters, e.g., with transaction identification, at least) of a computer program (or programs) that accesses a database (or databases). Such a program is written with the assumption that it is running in isolation from other executing programs, i.e., when running, its accessed data (after the access) are not changed by other running programs. Without this assumption the transaction's results are unpredictable and can be wrong. The same transaction can be executed in different situations, e.g., in different times and locations, in parallel with different programs. A live transaction (i.e., exists in a computing environment with already allocated computing resources; to distinguish from a transaction request, waiting to get execution resources) can be in one of three states, or phases: 1. Running - Its program(s) is (are) executing.

2. Ready - Its program's execution has ended, and it is waiting to be Ended (Completed). 3. Ended (or Completed) - It is either Committed or Aborted (Rolled-back), depending whether the execution is considered a success or not, respectively . When committed, all its recoverable (i.e., with states that can be controlled for this purpose), durable resources (typically database data) are put in their final states, states after running. When aborted, all its recoverable resources are put back in their initial states, as before running. Comments: 1. A failure in transaction's computing environment before ending typically results in its abort. However, a transaction may be aborted also for other reasons as well (e.g., see below). 2. Upon being ended (completed), transaction's allocated computing resources are released and the transaction disappears from the computing environment. However, the effects of a committed transaction remain in the database, while the effects of an aborted (rolled-back) transaction disappear from the database. The concept of atomic transaction ("all or nothing" semantics) was designed to exactly achieve this behavior, in order to control correctness in complex faulty systems. Serializability is a property of a transaction schedule (history). It relates to the isolation property of a database transaction. Serializability of a schedule means equivalence (in the outcome, the database state, data values) to a serial schedule (i.e., sequential with no transaction overlap in time) with the same transactions. It is the major criterion for the correctness of concurrent transactions' schedule, and thus supported in all general purpose database systems. The rationale behind serializability is the following: If each transaction is correct by itself, i.e., meets certain integrity conditions, then a schedule that comprises any serial execution of these transactions is correct (its transactions still meet their conditions): "Serial" means that transactions do not overlap in time and cannot interfere with each other, i.e, complete isolation between each other exists. Any order of the transactions is legitimate, if no dependencies among them exist, which is assumed (see comment below). As a result, a schedule that comprises any execution (not necessarily serial) that is equivalent (in its outcome) to any serial execution of these transactions, is correct.

Schedules that are not serializable are likely to generate erroneous outcomes. Well known examples are with transactions that debit and credit accounts with money: If the related schedules are not serializable, then the total sum of money may not be preserved. Money could disappear, or be generated from nowhere. This and violations of possibly needed other invariant preservations are caused by one transaction writing, and "stepping on" and erasing what has been written by another transaction before it has become permanent in the database. It does not happen if serializability is maintained. Comment: If any specific order between some transactions is requested by an application, then it is enforced independently of the underlying serializability mechanisms. These mechanisms are typically indifferent to any specific order, and generate some unpredictable partial orderthat is typically compatible with multiple serial orders of these transactions. This partial order results from the scheduling orders of concurrent transactions' data access operations, which depend on many factors. Correctness - recoverability A major characteristic of a database transaction is atomicity, which means that it either commits, i.e., all its operations' results take effect in the database, or aborts (rolledback), all its operations' results do not have any effect on the database ("all or nothing" semantics of a transaction). In all real systems transactions can abort for many reasons, and serializability by itself is not sufficient for correctness. Schedules also need to possess the recoverability property. Recoverability means that committed transactions have not read data written by aborted transactions (whose effects do not exist in the resulting database states). While serializability is currently compromised on purpose in many applications for better performance (only in cases when application's correctness is not harmed), compromising recoverability would quickly violate the database's integrity, as well as that of transactions' results external to the database. A schedule with the recoverability property (a recoverable schedule) "recovers" from aborts by itself, i.e., aborts do not harm the integrity of its committed transactions and resulting database. This is untrue without recoverability where the likely integrity violations (resulting incorrect database data) need special, typically manual, corrective actions in the database. Implementing recoverability in its general form may result in cascading aborts: Aborting one transaction may result in a need to abort a second transaction, and then a third, and so on. This results in a waste of already partially executed transactions, and may result also in a performance penalty. Avoiding cascading aborts (ACA, or Cascadelessness) is a special case of recoverability that exactly prevents such phenomenon. Often in practice a

special case of ACA is utilized: Strictness. Strictness allows an efficient database recovery from failure. Comment: Note that the recoverability property is needed even if no database failure occurs and no database recovery from failure is needed. It is rather needed to correctly automatically handle aborts, which may be unrelated to database failure and recovery from failure. Relaxing serializability In many applications, unlike with finances, absolute correctness is not needed. For example, when retrieving a list of products according to specification, in most cases it does not matter much if a product, whose data was updated a short time ago, does not appear in the list, even if it meets the specification. It will typically appear in such a list when tried again a short time later. Commercial databases provide concurrency control with a whole range of isolation levels which are in fact (controlled) serializability violations in order to achieve higher performance. Higher performance means better transaction execution rate and shorter average transaction response time (transaction duration). Snapshot isolation is an example of a popular, widely utilized efficient relaxed serializability method with many characteristics of full serializability, but still short of some, and unfit in many situations. Classes of schedules defined by relaxed serializability properties either contain the serializability class, or are incomparable with it. View and conflict serializability Mechanisms that enforce serializability need to execute in real time, or almost in real time, while transactions are running at high rates. In order to meet this requirement special cases of serializability, sufficient conditions for serializability which can be enforced effectively, are utilized. Two major types of serializability exist: view-serializability, and conflict-serializability. View-serializability matches the general definition of serializability given above. Conflict-serializability is a broad special case, i.e., any schedule that is conflictserializable is also view-serializable, but not necessarily the opposite. Conflictserializability is widely utilized because it is easier to determine and covers a substantial portion of the view-serializable schedules. Determining view-serializability of a schedule is an NP-complete problem (a class of problems with only difficult-to-compute, excessively time-consuming known solutions).

View-serializability of a schedule is defined by equivalence to a serial schedule (no overlapping transactions) with the same transactions, such that respective transactions in the two schedules read and write the same data values ("view" the same data values). Conflict-serializability is defined by equivalence to a serial schedule (no overlapping transactions) with the same transactions, such that both schedules have the same sets of respective chronologically-ordered pairs of conflicting operations (same precedence relations of respective conflicting operations). Operations upon data are read or write (a write: either insert or modify or delete). Two operations are conflicting, if they are of different transactions, upon the same datum (data item), and at least one of them is write. Each such pair of conflicting operations has a conflict type: It is either a read-write, or write-read, or a write-write conflict. The transaction of the second operation in the pair is said to be in conflict with the transaction of the first operation. A more general definition of conflicting operations (also for complex operations, which may consist each of several "simple" read/write operations) requires that they are noncommutative (changing their order also changes their combined result). Each such operation needs to be atomic by itself (by proper system support) in order to be considered an operation for a commutativity check. For example, the operations increment and decrement of a counter are both write operations (both modify the counter), but do not need to be considered conflicting (write-write conflict type) since they are commutative (e.g., already supported in the old IBM's IMS "fast path"). Only precedence (time order) in pairs of conflicting (non-commutative) operations is important when checking equivalence to a serial schedule, since schedules consisting of the same transactions can be transformed from one to another by changing orders between different transactions' operations (different transactions' interleaving), and since changing orders of commutative operations (non-conflicting) does not change an overall operation sequence result, i.e., a schedule outcome (the outcome is preserved through order change between non-conflicting operations, but typically not when conflicting operations change order). This means that if a schedule can be transformed to any serial schedule without changing orders of conflicting operations (but changing orders of non-conflicting, while preserving operation order inside each transaction), then the outcome of both schedules is the same, and the schedule is conflict-serializable by definition. Comment: A transaction can issue/request a conflicting operation and be in conflict with another transaction while its conflicting operation is delayed and not executed (e.g., blocked by a lock). Only executed (materialized) conflicting operations are relevant to conflict serializability(see more below).

Testing conflict serializability Schedule compliance with conflict serializability can be tested with the precedence graph (serializability graph, serialization graph, conflict graph) for committed transactions of the schedule. It is the directed graph representing precedence of transactions in the schedule, as reflected by precedence of conflicting operations in the transactions. In the precedence graph transactions are nodes and precedence relations are directed edges. There exists an edge from a first transaction to a second transaction, if the second transaction is in conflict with the first (see Conflict serializability above), and the conflict is materialized (i.e., if the requested conflicting operation is actually executed: in many cases a requested/issued conflicting operation by a transaction is delayed and even never executed, typically by a lock on the operation's object, held by another transaction; as long as a requested/issued conflicting operation is not executed, the conflict is nonmaterialized; non-materialized conflicts are not represented by an edge in the precedence graph). Comment: In many text books only committed transactions are included in the precedence graph. Here all transactions are included for convenience in later discussions. The following observation is a key characterization of conflict serializability: A schedule is conflict-serializable if and only if its precedence graph of committed transactions (when only committed transactions are considered) is acyclic. This means that a cycle consisting of committed transactions only is generated in the (general) precedence graph, if and only if conflict-serializability is violated. Cycles of committed transactions can be prevented by aborting an undecided (neither committed, nor aborted) transaction on each cycle in the precedence graph of all the transactions, which can otherwise turn into a cycle of committed transactions (and a committed transaction cannot be aborted). One transaction aborted per cycle is both required and sufficient number to break and eliminate the cycle (more aborts are possible, and can happen in some mechanisms, but unnecessary for serializability). The probability of cycle generation is typically low, but nevertheless, such a situation is carefully handled, typically with a considerable overhead, since correctness is involved. Transactions aborted due to serializability violation prevention are restarted and executed again immediately. Serializability enforcing mechanisms typically do not maintain a precedence graph as a data structure, but rather prevent or break cycles implicitly (e.g., SS2PL below).

Common mechanism - SS2PL Strong strict two phase locking (SS2PL) is a common mechanism utilized in database systems since their early days in the 1970s (the "SS" in the name SS2PL is newer though) to enforce both conflict serializability and strictness (a special case of recoverability which allows effective database recovery from failure) of a schedule. In this mechanism each datum is locked by a transaction before accessing it (any read or write operation): The item is marked by, associated with a lock of a certain type, depending on operation (and the specific implementation; various models with different lock types exist; in some models locks may change type during the transaction's life). As a result access by another transaction may be blocked, typically upon a conflict (the lock delays or completely prevents the conflict from being materialized and be reflected in the precedence graph by blocking the conflicting operation), depending on lock type and the other transaction's access operation type. Employing an SS2PL mechanism means that all locks on data on behalf of a transaction are released only after the transaction has ended (either committed or aborted). SS2PL is the name of the resulting schedule property as well, which is also called rigorousness. SS2PL is a special case (proper subset) of both Two-phase locking (2PL) and Commitment ordering (CO; see Other enforcing techniques below). Mutual blocking between transactions results in a deadlock, where execution of these transactions is stalled, and no completion can be reached. Thus deadlocks need to be resolved to complete these transactions' execution and release related computing resources. A deadlock is a reflection of a potential cycle in the precedence graph, that would occur without the blocking when conflicts are materialized. A deadlock is resolved by aborting a transaction involved with such potential cycle, and breaking the cycle. It is often detected using a wait-for graph (a graph of conflicts blocked by locks from being materialized; it can be also defined as the graph of non-materialized conflicts; conflicts not materialized are not reflected in the precedence graph and do not affect serializability), which indicates which transaction is "waiting for" lock release by which transaction, and a cycle means a deadlock. Aborting one transaction per cycle is sufficient to break the cycle. Transactions aborted due to deadlock resolution are restarted and executed again immediately. Other enforcing techniques Other known mechanisms include:

Precedence graph (or Serializability graph, Conflict graph) cycle elimination Two-phase locking (2PL) Timestamp ordering (TO) (Local) commitment ordering (CO) Serializable snapshot isolation (SerializableSI) The above (conflict) serializability techniques in their general form do not provide recoverability. Special enhancements are needed for adding recoverability. Optimistic versus pessimistic techniques Concurrency control techniques are of two major types:

1. Pessimistic: In Pessimistic concurrency control a transaction blocks data access operations of other transactions upon conflicts, and conflicts are non-materialized until blocking is removed. This to ensure that operations that may violate serializability (and in practice also recoverability) do not occur. 2. Optimistic: In Optimistic concurrency control data access operations of other transactions are not blocked upon conflicts, and conflicts are immediately materialized. When the transaction reaches the ready state, i.e., its running state has been completed, possible serializability (and in practice also recoverability) violation by the transaction's operations (relatively to other running transactions) is checked: If violation has occurred, the transaction is typically aborted (sometimes aborting another transaction to handle serializability violation is preferred). Otherwise it is committed.

Schedule classes containment: An arrow from class A to class B indicates that class A strictly contains B; a lack of a directed path between classes means that the classes are incomparable. A property is inherently-blocking, if it can be enforced only by blocking transactions data access operations until certain events occur in other transactions (Raz 1992). The main difference between the two types is the way conflicts are handled. A pessimistic method blocks a transaction operation upon conflict, and generates a nonmaterialized conflict, while an optimistic method does not block, and generates a materialized conflict. At any type, conflicts (either materialized or non-materialized) are generated by the way transaction operations are scheduled, independently of the type. A cycle of committed transactions (with materialized conflicts) in the precedence

graph (conflict graph) represents a serializability violation, and should be avoided for maintaining serializability. A cycle of (non-materialized) conflicts in the wait-for graph represents a deadlock, that should be resolved by breaking the cycle. Both cycle types result from conflicts, and should be broken. At any type conflicts should be detected and considered, with similar overhead for both materialized and nonmaterialized (typically using mechanisms like locking, while either blocking for locks, or not blocking for materialized conflicts). In a blocking method typically a context switching occurs upon conflict, with (additional) incurred overhead. Otherwise blocked transactions' related computing resources remain idle, unutilized, which may be a worse alternative. When conflicts do not occur frequently, optimistic methods typically have an advantage. With different transactions loads (mixes of transaction types) one technique type (i.e., either optimistic or pessimistic) may provide better performance than the other. Some mechanisms mix blocking in certain situations (and thus they are pessimistic) with not blocking in other situations, and have been referred to as semi-optimistic. Such mechanisms employ both materialized and non-materialized conflicts (e.g., Strict CO (SCO)). In this case cycles in the graph which is the union of the (regular) conflict graph and the (reversed edge) wait-for graph are used to characterize serializability violations and deadlocks (see the Augmented conflict graph in Commitment ordering and below in the section Commitment ordering and how it works in a distributed environment). Serializable multi-version concurrency control Multi-version concurrency control (MVCC) is a common way today to increase concurrency and performance by generating a new version of a database object each time the object is written, and allowing transactions' read operations of several last relevant versions (of each object), depending on scheduling method. MVCC can be combined with all the serializability techniques listed above (except SerializableSI which is originally MVCC based). It is utilized in most general-purpose DBMS products. MVCC is especially popular nowadays through the relaxed serializability (see above) method Snapshot isolation (SI) which provides better performance than most known serializability mechanisms (at the cost of possible serializability violation in certain cases). SerializableSI, which is an efficient enhancement of SI to make it serializable, is intended to provide an efficient serializable solution. SerializableSI has been analyzed[11] [12] via a general theory of MVCC[13] developed for properly defining Multi-version Commitment ordering (MVCO). This theory (different from previous MVCC theories)

includes single-version concurrency control theory (the usual) as a special case when the maximum number of versions of each database object is limited to one. Multi-version Commitment ordering (MVCO), the multi-version variant of Commitment ordering (CO), is another multi-version serializability technique (e.g., combining MVCO with SI results in COSI; performance of COSI has not yet been compared with that of SerializableSI) with the advantage of providing efficient Distributed serializability (see below; unlike SerializableSI for which efficient distribution is unknown and probably does not exist, since it is not MVCO compliant which is a necessary condition for guaranteeing global serializability acrossautonomous transactional objects). Distributed serializability Distributed serializability is the serializability of a schedule of a transactional distributed system (e.g., a distributed database system). Such system is characterized by distributed transactions (also called global transactions), i.e., transactions that span computer processes (a process abstraction in a general sense, depending on computing environment; e.g., operating system's thread) and possibly network nodes. A distributed transaction comprises more than one local subtransactions that each has states as described above for a database transaction. A local sub-transaction comprises a single process, or more processes that typically fail together (e.g., in a single processor core). Distributed transactions imply a need in Atomic commitment protocol to reach consensus among its local sub-transactions on whether to commit or abort. Such protocols can vary from a simple (one-phase) hand-shake among processes that fail together, to more sophisticated protocols, like Two-phase commit, to handle more complicated cases of failure (e.g., process, node, communication, etc. failure). Distributed serializability is a major goal of distributed concurrency control for correctness. With the proliferation of the Internet, Cloud computing, Grid computing, and small, portable, powerful computing devices (e.g., smartphones) the need for effective distributed serializability techniques to ensure correctness in and among distributed applications seems to increase. Distributed serializability is achieved by implementing distributed versions of the known centralized techniques. Typically all such distributed versions require utilizing conflict information (either of materialized or non-materialized conflicts, or equivalently, transaction precedence or blocking information; conflict serializability is usually utilized) that is not generated locally, but rather in different processes, and remote locations. Thus information distribution is needed (e.g., precedence relations, lock information,

timestamps, or tickets). When the distributed system is of a relatively small scale, and message delays across the system are small, the centralized concurrency control methods can be used unchanged, while certain processes or nodes in the system manage the related algorithms. However, in a large-scale system (e.g., Grid and Cloud), due to the distribution of such information, substantial performance penalty is typically incurred, even when distributed versions of the methods (Vs. centralized) are used, primarily due to computer and communication latency. Also, when such information is distributed, related techniques typically do not scale well. A well known example with scalability problems is a distributed lock manager, which distributes lock (non-materialized conflict) information across the distributed system to implement locking techniques. The only known exception, which needs only local information for its distributed version (to reach distributed serializability), and thus scales with no penalty and avoids concurrency control information distribution delays, is Commitment ordering (CO; including its many variants, e.g., SS2PL). CO can provide unbounded scalability also in new experimental distributed database systems architectures, where local sub-transactions in each processor core are one-threaded and serial (i.e., with no concurrency), but the overall (distributed) schedule is serializable (e.g., as in H-Store[14] and VoltDB; see also Hypothetical Multi Single-Threaded Core (MuSiC) environment). Distributed serializability and Commitment ordering Commitment ordering(CO; or commit ordering, or commit-order-serializability) is a serializability technique, both centralized and distributed. CO is also the name of the resulting schedule property, defining a broad subclass of the Conflict serializability class of schedules. The most significant aspects of CO that make it a uniquely effective general distributed serializability solution are: 1. Seamless, low overhead integration with any concurrency control mechanism, with neither changing or blocking any transaction's operation scheduling (thus allowing and keeping optimistic implementations), nor adding any new operation (like "Take timestamp" or "Take ticket"); but CO is also a standalone serializability mechanism. 2. No need of conflict or equivalent information distribution (e.g., local precedence relations, locks, timestamps, or tickets). 3. Automatic distributed deadlock resolution, and 4. Scalability.

All these aspects, except the first, are also possessed by the popular SS2PL (see above), which is a special case of CO, but blocking and constrained. Under certain general conditions distributed CO can be used effectively for guaranteeing distributed serializability without paying the penalty of distributing conflict information. This is a major distinguishing characteristic of distributed CO from other distributed serializability techniques. CO's net effect may be some commit delays (but no more added delay than that with its special cases, e.g., SS2PL, and on the average less). Instead of distributing conflict information, distributed CO utilizes (unmodified) messages of an atomic commitment protocol (e.g., the Two-phase commit protocol (2PC)), which are used at any case, also without CO. Such protocol is used to coordinate atomicity ofdistributed transactions, and is an essential component of any distributed transaction environment. CO can be applied to many distributed transactional systems for guaranteeing distributed serializability. Three conditions should be met (which can be enforced in the design of most distributed transactional systems): 1. Data partition: Recoverable data (transactional data, i.e., data under transactions' control (not to be confused with the Recoverability property of a schedule)) are partitioned among (possibly distributed) transactional data managers (also called resource managers) that the distributed system comprises, i.e., each recoverable datum (data item) is controlled by a single data manager (e.g., as common in a Shared nothing architecture). 2. Participants in atomic commitment protocol: These data managers are the participants of the system's atomic commitmentprotocol(s) (this requirement is not necessarily met in general, but is quite common and not difficult to be imposed), and 3. CO compliance: Each such data manager guarantees CO locally (i.e., has a CO compliant local schedule, which can be quite easilyachieved, possibly alongside with any relevant concurrency control mechanism; nesting is possible: each data manager may be distributed with its own private, separate atomic commitment protocol). These are conditions applied managers (resource managers). to the distributed system's transactional data

Distributed CO utilizes the (unmodified) atomic commitment protocol messages without any additional information communicated. This applies also to SS2PL (see above), which is a locking based special case of CO, typically utilized also for distributed serializability,

and thus, can be implemented in a distributed transactional system without a distributed lock manager, a fact that in many cases has been overlooked. An important side-benefit of distributed CO is that distributed deadlocks (deadlocks which span each two or more transactional data managers) are automatically resolved by atomic commitment (including the special case of a completely SS2PL based distributed system; also this has not been noticed in any research article except the CO articles until today (2009)). Global serializability and Commitment ordering Guaranteeing distributed serializability in a heterogeneous system that comprises several transactional objects (i.e., with states controlled by atomic transactions; e.g., database systems) with different concurrency controls has been considered a difficult problem. In such system distributed serializability is usually called Global serializability (or Modular serializability: Also each object maintains serializability). For example, in a federated database system or any other more loosely defined multidatabase system, which are typically distributed in a communication network, transactions span multiple (and possibly distributed) databases. The database systems involved may utilize different concurrency control mechanism for serializability. However, even if every local schedule of a single database is serializable, the global schedule of the whole system is not necessarily serializable. The massive communication exchanges of conflict information needed between database systems to reach conflict serializability would lead to unacceptable performance, primarily due to computer and communicationlatency. The problem of achieving global serializability effectively in such heterogeneous systems had been characterized as open until the introduction of Commitment ordering (CO) in 1991. Even years after CO's introduction the problem has been considered unsolvable, due to misunderstanding of the CO solution (see Global serializability). On the other hand, later, for years CO has been extensively utilized in research articles (which is evident by the simple definition of CO) without mentioning the name "Commitment ordering", and without providing satisfactory correctness proofs (i.e., why global serializability is guaranteed; e.g., see Grid computing, Cloud computing, and Commitment ordering below). CO provides an effective solution since when each database system in a heterogeneous multidatabase system is CO compliant, the entire system obeys the three conditions for distributed CO given in the previous section. Thus global serializability is guaranteed, as

well as automatic global deadlock (a deadlock over two or more transactional objects) resolution. SS2PL implies CO, and any SS2PL-compliant database can participate in multidatabase systems that utilize the CO solution for global serializability without any modification or addition of a CO (2PC) protocol since the eighties, and SS2PL in conjunction with 2PC is the de facto standard to reach global serializability across (the common, SS2PLcompliant) databases. A algorithm component. As a matter of fact global serializability has been achieved in all-SS2PL multidatabase environments with the Two-phase commitlso the following is true: any CO-compliant database system can transparently join any such existing SS2PL based solution for global serializability. Grid computing, Cloud computing, and Commitment ordering Commitment ordering (CO; including its variants, e.g., SS2PL), being an attractive solution for Distributed serializability in general and Global serializability in particular, seems to be an ideal solution for guaranteeing serializability in Grid computing and Cloud computing. Serializability is a must for many applications within such environments. CO being a necessary condition for guaranteeing global serializability across autonomous transactional objects is actually the only general solution for distributed serializability in such environments, which typically include many autonomous transactional objects (without CO serializability is very likely to be quickly violated, even if a single autonomous object, uncompliant with CO, participates). To support CO, the Grid and Cloud infrastructures need to provide Atomic commitment protocol (e.g., 2PC) services only. Each participating transactional object (e.g., a database system) needs to support some variant of CO by itself, in order to enjoy global serializability while inter-operating with other transactional objects (all CO variants transparently inter-operate). Each object can choose any proper CO variant, in order to optimize performance, from implementing CO with Strictness (e.g., SCO and SS2PL), to Optimistic CO, generic (optimistic) ECO, and MVCO (together with any local variant of recoverability). Thus the infrastructure support for CO (i.e., atomic commitment protocol) is minimal and already common. Utilizing CO in an environment with Grid and Cloud characteristics is described in . Re:GRIDiTis a proposed approach to support transaction management with data replication in the Grid and the Cloud. Its concurrency control is based on CO. This approach extends the DSGT protocol, which has been proposed for Transactional processes earlier and utilizes CO as well. Re:GRIDit utilizes an optimistic version of CO.

It uses internal system transactions for replication, which makes replication for high availability transparent to a user. The approach does not suggest to use an external atomic commitment protocol, but rather uses an integrated solution, which must include some form of atomic commitment protocol to achieve atomicity of distributed transactions. No benefit of an integrated atomic commitment protocol seems to exist. Correctness arguments regarding global serializability are unsatisfactory: Correctness arguments, which must use voting deadlock resolution to eliminate unavoidable global cycles in the precedence graph, are not given. Such arguments are detailed by the CO publications which are referenced by neither the Re:GRIDiT nor the DSGT articles (see also explanation about CO below). Also, no concurrency control alternatives for different transactional objects (which exist in the general CO solution) are suggested by Re:GRIDiT. See more details on Re:GRIDiT in The History of Commitment Ordering. Commitment ordering and how it works in a distributed environment A schedule has the Commitment ordering (CO) property, if the order in time of its transactions' commitment events is compatible with the precedence (partial) order of the respective transactions, as determined by their local conflict graph (precedence graph, serializability graph). Any conflict serializable schedule can be made a CO compliant one, without aborting any transaction in the schedule, by delaying commitment events to comply with the needed partial order. Enforcing the CO property in each local schedule is an effective way to enforce conflict serializability globally: CO is a broad special case of conflict serializability, and enforcing it locally in each local schedule also enforces it, and hence serializability, globally. The only needed communication between the databases for this purpose is that of the atomic commitment protocol (such as the Two-phase commit protocol(2PC)), which exists in most distributed environments and already must be utilized for the atomicity of each distributed transaction, independently of concurrency control and CO. Thus CO incurs no communication overhead. In each single database, a local CO algorithm can run alongside any local concurrency control mechanism (serializability enforcing mechanism) without interfering with its resource access scheduling strategy, and without adding any access operations to transactions (like acquiring timestamps or tickets) which reduces performance. As such CO provides a general, high performance, fully distributed solution. Neither central processing component nor central data structure are needed. Moreover, CO works also in heterogeneous environments with different database system types and other multipletransactional objects that may employ different serializability mechanisms. The CO solution scales up effectively with network size and

the number of databases without any negative impact on performance (assuming the statistics of a single distributed transaction, e.g., the average number of databases involved with such transaction, are unchanged). This makes CO instrumental for global concurrency control. CO implementation by itself is not sufficient as a concurrency control mechanism, since it lacks the important recoverability property. The commitment event of a distributed transaction is always generated by some atomic commitment protocol, utilized to reach consensus among its processes on whether to commit or abort it. This procedure is always carried out for distributed transactions, independently of CO. The atomic commitment protocol plays a central role in the distributed CO algorithm. In case of incompatible local commitment orders in two or more databases (no global partial order can embed the respective local partial orders together), which implies a global cycle (a cycle that spans two or more databases) in the global conflict graph, CO generates a voting-deadlock for the atomic commitment protocol (resulting in missing votes), and the protocol resolves that deadlock by aborting a transaction with a missing vote on the cycle and breaking the cycle. Furthermore: with CO, the global augmented conflict graph provides a complete characterization of votingdeadlocks. In this graph also being blocked by a lock to prevent a conflict from being materialized is represented by an edge as a materialized conflict. It is the union of the regular precedence graph with the (reversed edge, for time-order compatibility of conflicting operations) regular wait-for graph, and reflects both materialized and nonmaterialized conflicts. This graph is a (reversed edge) wait-for graph for voting (an edge from a first transaction to a second transaction indicates that either the voting or local commit of the first is waiting to the second to end), and a global cycle means a votingdeadlock. Thus also global deadlocks due to locking (when at least one edge for lock blocking exists on a global cycle) generate voting-deadlocks and are resolved automatically by the same mechanism (such locking based global deadlocks are resolved automatically also in the common, completely SS2PL based environments, but no research article besides the CO articles is known to notice this fact). No implementation of such global graph is needed, and it is used only to explain the behavior of CO and its effectiveness in both guaranteeing global serializability and resolving locking-based global deadlocks. A deadlock is a situation where in two or more competing actions are each waiting for the other to finish, and thus neither ever does. It is often seen in a paradox like the "chicken or the egg". The concept of a Catch 22 is similar.

When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone.

In computer science, Coffman deadlock refers to a specific condition when two or more processes are each waiting for each other to release a resource, or more than two processes are waiting for resources in a circular chain (see Necessary conditions). Deadlock is a common problem in multiprocessing where many processes share a specific type of mutually exclusive resource known as a software lock orsoft lock. Computers intended for the timesharing and/or real-time markets are often equipped with a hardware lock (or hard lock) which guarantees exclusive access to processes, forcing serialized access. Deadlocks are particularly troubling because there is no generalsolution to avoid (soft) deadlocks. This situation may be likened to two people who are drawing diagrams, with only one pencil and one ruler between them. If one person takes the pencil and the other takes the ruler, a deadlock occurs when the person with the pencil needs the ruler and the person with the ruler needs the pencil to finish his work with the ruler. Neither request can be satisfied, so a deadlock occurs. The telecommunications description of deadlock is weaker than Coffman deadlock because processes can wait for messages instead of resources. A deadlock can be the result of corrupted messages or signals rather than merely waiting for resources. For example, a dataflow element that has been directed to receive input on the wrong link will never proceed even though that link is not involved in a Coffman cycle. Examples An example of a deadlock which may occur in database products is the following. Client applications using the database may require exclusive access to a table, and in order to gain exclusive access they ask for a lock. If one client application holds a lock on a table and attempts to obtain the lock on a second table that is already held by a second client application, this may lead to deadlock if the second application then attempts to obtain the lock that is held by the first application. (But this particular type of deadlock is easily prevented, e.g., by using an all-ornone resource allocation algorithm.) Another example might be a text formatting program that accepts text sent to it to be processed and then returns the results, but does so only after receiving "enough" text to work on (e.g. 1KB). A text editor program is written that sends the formatter some text and then waits for the results. In this case a deadlock may occur on the last block of text. Since the formatter may not have sufficient text for processing, it will suspend itself while waiting for the additional text, which

will never arrive since the text editor has sent it all of the text it has. Meanwhile, the text editor is itself suspended waiting for the last output from the formatter. This type of deadlock is sometimes referred to as a deadly embrace (properly used only when only two applications are involved) or starvation. However, this situation, too, is easily prevented by having the text editor send a forcing message (e.g. EOF, (End Of File)) with its last (partial) block of text, which will force the formatter to return the last (partial) block after formatting, and not wait for additional text. In communications, corrupted messages may cause computers to go into bad states where they are not communicating properly. The network may be said to be deadlocked even though no computer is waiting for a resource. This is different than a Coffman deadlock.

Necessary conditions There are four necessary and sufficient conditions for a Coffman deadlock to occur, known as the Coffman conditions from their first description in a 1971 article by E. G. Coffman. 1. Mutual exclusion condition: a resource that cannot be used by more than one process at a time 2. Hold and wait condition: processes already holding resources may request new resources held by other processes 3. No preemption condition: No resource can be forcibly removed from a process holding it, resources can be released only by the explicit action of the process. The first three conditions are necessary but not sufficient for a deadlock to exist. For deadlock to actually take place, a fourth condition is required: 1. Circular wait condition: two or more processes form a circular chain where each process waits for a resource that the next process in the chain holds. When circular waiting is triggered by mutual exclusion operations it is sometimes called lock inversion. Prevention Removing the mutual exclusion condition means that no process may have exclusive access to a resource. This proves impossible for resources that cannot be spooled, and even with spooled resources deadlock could still occur. Algorithms that avoid mutual exclusion are called non-blocking synchronization algorithms.

The "hold and wait" conditions may be removed by requiring processes to request all the resources they will need before starting up (or before embarking upon a particular set of operations); this advance knowledge is frequently difficult to satisfy and, in any case, is an inefficient use of resources. Another way is to require processes to release all their resources before requesting all the resources they will need. This too is often impractical. (Such algorithms, such as serializing tokens, are known as the all-or-none algorithms.)

A "no preemption" (lockout) condition may also be difficult or impossible to avoid as a process has to be able to have a resource for a certain amount of time, or the processing outcome may be inconsistent or thrashing may occur. However, inability to enforce preemption may interfere with a priority algorithm. (Note: Preemption of a "locked out" resource generally implies a rollback, and is to be avoided, since it is very costly in overhead.) Algorithms that allow preemption include lock-free and wait-free algorithms and optimistic concurrency control.

The circular wait condition: Algorithms that avoid circular waits include "disable interrupts during critical sections", and "use a hierarchy to determine a partial ordering of resources" (where no obvious hierarchy exists, even the memory address of resources has been used to determine ordering) and Dijkstra's solution.

Avoidance Deadlock can be avoided if certain information about processes is available in advance of resource allocation. For every resource request, the system sees if granting the request will mean that the system will enter an unsafe state, meaning a state that could result in deadlock. The system then only grants requests that will lead to safe states. In order for the system to be able to figure out whether the next state will be safe or unsafe, it must know in advance at any time the number and type of all resources in existence, available, and requested. One known algorithm that is used for deadlock avoidance is the Banker's algorithm, which requires resource usage limit to be known in advance. However, for many systems it is impossible to know in advance what every process will request. This means that deadlock avoidance is often impossible. Two other algorithms are Wait/Die and Wound/Wait, each of which uses a symmetry-breaking technique. In both these algorithms there exists an older process (O) and a younger process (Y). Process age can be determined by a timestamp at process creation time. Smaller time stamps are older processes, while larger timestamps represent younger processes. Wait/Die Wound/Wait

O needs a resource held by Y O waits Y needs a resource held by O Y dies

Y dies Y waits

It is important to note that a process may be in an unsafe state but would not result in a deadlock. The notion of safe/unsafe states only refers to the ability of the system to enter a deadlock state or not. For example, if a process requests A which would result in an unsafe state, but releases B which would prevent circular wait, then the state is unsafe but the system is not in deadlock. Detection Often, neither avoidance nor deadlock prevention may be used. Instead deadlock detection and process restart are used by employing an algorithm that tracks resource allocation and process states, and rolls back and restarts one or more of the processes in order to remove the deadlock. Detecting a deadlock that has already occurred is easily possible since the resources that each process has locked and/or currently requested are known to the resource scheduler or OS. Detecting the possibility of a deadlock before it occurs is much more difficult and is, in fact, generally undecidable, because the halting problem can be rephrased as a deadlock scenario. However, in specific environments, using specific means of locking resources, deadlock detection may be decidable. In the general case, it is not possible to distinguish between algorithms that are merely waiting for a very unlikely set of circumstances to occur and algorithms that will never finish because of deadlock. Deadlock detection techniques include, but is not limited to, Model checking. This approach constructs a Finite State-model on which it performs a progress analysis and finds all possible terminal sets in the model. These then each represent a deadlock. Distributed deadlock Distributed deadlocks can occur in distributed systems when distributed transactions or concurrency control is being used. Distributed deadlocks can be detected either by constructing a global wait-for graph, from local wait-for graphs at a deadlock detector or by a distributed algorithm like edge chasing. In a Commitment ordering based distributed environment (including the Strong strict two-phase locking (SS2PL, or rigorous) special case) distributed deadlocks are resolved automatically by the atomic commitment protocol (e.g. two-phase commit (2PC)), and no global wait-for graph or other resolution mechanism are needed. Similar automatic global deadlock resolution occurs also

in environments that employ 2PLthat is not SS2PL (and typically not CO; see Deadlocks in 2PL). However 2PL that is not SS2PL is rarely utilized in practice. Phantom deadlocks are deadlocks that are detected in a distributed system due to system internal delays, but no longer actually exist at the time of detection. Distributed deadlock prevention Lets consider the "When two trains approach each other at a crossing" example defined above. Just-in-time Prevention works like having a person standing at the crossing (the crossing guard) with a switch that will let only one train onto "super tracks" which runs above and over the other waiting train(s). Before we look into threads using Just-in-time Prevention, lets look into the conditions which already exist for regular locking. For non-recursive locks, this lock may be entered only once (where a single thread entering twice without unlocking will cause a deadlock, or throw an exception to enforce circular wait prevention).

For recursive locks, only one thread is allowed to pass through a lock. If any other threads enter the lock, they must wait until the initial thread that passed through completes n number of times it has entered.

So the issue with the first one is it does no deadlock prevention at all. The second doesn't do Distributed deadlock prevention. But the 2nd one is redefined to prevent a deadlock scenario the first one doesn't address. And the only other scenario I am aware of that may cause deadlocks is when two or more lockers lock on each other. So why not expand the definition above one more time? Well, we can, if we use add a variable to the recursive lock condition which guarantees that at least one thread runs among all locksdistributed deadlock prevention. And just like having a super track in the train example, I use "super thread" in this locking example. Recursively, only one thread is allowed to pass through a lock. If other threads enter the lock, they must wait until the initial thread that passed through completes n number of times. But if the number of threads that enter locking equal the number that are locked, assign one thread as the super-thread, and only allow it to run (tracking the number of times it enters/exits locking) until it completes.

After a super-thread is finished, the condition changes back to using the logic from the recursive lock, and the exiting super-thread 1. sets itself as not being a super-thread

2. notifies the locker that other locked, waiting threads need to re-check this condition If a deadlock scenario exists, set a new super-thread and follow that logic. Otherwise, resume regular locking. Issues not addressed above A lot of confusion revolves around the halting problem. But this logic in-no-way solves the halting problem. This is because we know and control the conditions in which locking occurs, giving us a specific solution (instead of the otherwise required general solution the halting problem requires). Still this locker prevents all deadlocked! Well, it does when only considering locks using this logic. But if it is use with other locking mechanisms, a lock that is started never unlocks (e.g. exception thrown jumping out without unlocking, looping indefinitely within a lock, or coding error forgetting to call unlock), deadlocking is very much possible. And to increase our condition to include these would require solving the halting issue, since we would be dealing with conditions we know nothing about and are unable to change. Another issue is that this doesn't address the temporary deadlocking issue (not really a deadlock, but a performance killer), where two or more threads lock on each other while another unrelated threads is running. These temporary deadlocks could have a thread running exclusively within them, increasing parallelism. But because of how the distributed deadlock detection works for all locks, and not subsets therein, the unrelated running thread must complete before performing the super-thread logic to remove the temporary deadlock. I hope you see the temporary live-lock scenario in the above. If another unrelated running thread begins before the first unrelated thread exits, another duration of temporary deadlocking will occur. And if this happens continuously (extremely rare), the temporary deadlock can be extended until right before the program exits, when the other unrelated threads are guaranteed to finish (because of the guarantee that one thread will always run to completion). Further expansion This can be further expanded to involve additional logic to increase parallelism where temporary deadlocks might otherwise occur. But for each step of adding more logic, we add more overhead.

A couple of examples include: expanding distributed super-thread locking mechanism to consider each subset of existing locks; Wait-For-Graph (WFG) algorithms, which tracks all cycles that cause deadlocks (including temporary deadlocks); and heuristics algorithms which don't necessarily increase parallelism in 100% of the places that temporary deadlocks are possible, but instead compromise by solving them in enough places that performance/overhead vs parallelism is acceptable (e.g. for each processor available, work towards finding deadlock cycles less than the number of processors + 1 deep). Livelock A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing. Livelock is a special case of resource starvation; the general definition only states that a specific process is not progressing. A real-world example of livelock occurs when two people meet in a narrow corridor, and each tries to be polite by moving aside to let the other pass, but they end up swaying from side to side without making any progress because they both repeatedly move the same way at the same time. Livelock is a risk with some algorithms that detect and recover from deadlock. If more than one process takes action, the deadlock detection algorithm can be repeatedly triggered. This can be avoided by ensuring that only one process (chosen randomly or by priority) takes action. A distributed data store is a network in which a user stores his or her information on a number of peer network nodes. The user also usually reciprocates and allows users to use his or her computer as a storage node as well. Information may or may not be accessible to other users depending on the design of the network. Most of the peer-to-peer networks do not have distributed data stores in that the user's data is only available when their node is on the network. However, this distinction is somewhat blurred in a system such as BitTorrent, where it is possible for the originating node to go offline but the content to continue to be served. Still, this is only the case for individual files requested by the redistributors, as contrasted with a network such as Freenet where all computers are made available to serve all files. Distributed data stores typically use an error detection and correction technique. Some distributed data stores (such as Parchive over NNTP) use forward error correction techniques to recover the original file when parts of that file are damaged or unavailable. Others try again to download that file from a different mirror.

Unit 5 Computer systems, both software and hardware, consist of modules, or components. Each component is designed to operate correctly, i.e., to obey to or meet certain consistency rules. When components that operate concurrently interact by messaging or by sharing accessed data (in memory or storage), a certain component's consistency may be violated by another component. The general area of concurrency control provides rules, methods, design methodologies, and theories to maintain the consistency of components operating concurrently while interacting, and thus the consistency and correctness of the whole system. Introducing concurrency control into a system means applying operation constraints which typically result in some performance reduction. Operation consistency and correctness should be achieved with as good as possible efficiency, without reducing performance below reasonable.

Concurrency control in databases Comments: 1. This section is applicable to all transactional systems, i.e., to all systems that use database transactions (atomic transactions; e.g., transactional objects in Systems management and in networks of smartphones), not only database management systems (DBMSs). 2. DBMSs need to deal also with concurrency control issues not typical just to database transactions but rather to operating systems in general. These issues (e.g., see Concurrency control in operating systems below) are out of the scope of this section. Concurrency control in Database management systems (DBMS; e.g., Bernstein et al. 1987, Weikum and Vossen 2001), other transactionalobjects, and related distributed applications (e.g., Grid computing and Cloud computing) ensures that database transactions are performedconcurrently without violating the data integrity of the respective databases. Thus concurrency control is an essential element for correctness in any system where two database transactions or more, executed with time overlap, can access the same data, e.g., virtually in any general-purpose database system. Consequently a vast body of related research has been accumulated since database systems have emerged in the early 1970s. A well established

concurrency control theory exists for database systems: serializability theory, which allows to effectively design and analyze concurrency control methods and mechanisms. To ensure correctness, a DBMS usually guarantees that only serializable transaction schedules are generated, unless serializability isintentionally relaxed to increase performance, but only in cases that application correctness is not harmed. For maintaining correctness in cases of failed (aborted) transactions (which can always happen for many reasons) schedules also need to have the recoverability property. A DBMS also guarantees that no effect of committed transactions is lost, and no effect of aborted (rolled back) transactions remains in the related database. Overall transaction characterization is usually summarized by the ACID rules below. As databases become distributed, or cooperate in distributed environments (e.g., Cloud computing), the effective distribution of concurrency control mechanisms receives special attention. Database transaction and the ACID rules The concept of a database transaction (or atomic transaction) has evolved in order to enable both a well understood database system behavior in a faulty environment where crashes can happen any time, and recovery from a crash to a well understood database state. A database transaction is a unit of work, typically encapsulating a number of operations over a database (e.g., reading a database object, writing, acquiring lock, etc.), an abstraction supported in database and also other systems. Each transaction has well defined boundaries in terms of which program/code executions are included in that transaction (determined by the transaction's programmer via special transaction commands). Every database transaction obeys the following rules (by support in the database system; i.e., a database system is designed to guarantee them for the transactions it runs): Atomicity - Either the effects of all or none of its operations remain ("all or nothing" semantics) when a transaction is completed (committed or aborted respectively). In other words, to the outside world a committed transaction appears (by its effects) to be indivisible, atomic, and an aborted transaction does not leave effects at all, as if never existed.

Consistency - Every transaction must leave the database in a consistent (correct) state, i.e., maintain the predetermined integrity rules of the database (constraints upon and among the database's objects). A transaction must transform a database from one consistent state to another consistent state (it is the responsibility of the transaction's programmer to make sure that the transaction itself is correct, i.e., performs correctly what it intends to perform while maintaining the integrity rules). Thus since a database can be normally changed only by

transactions, all the database's states are consistent. An aborted transaction does not change the state. Isolation - Transactions cannot interfere with each other. Moreover, usually the effects of an incomplete transaction are not visible to another transaction. Providing isolation is the main goal of concurrency control.

Durability - Effects of successful (committed) transactions must persist through crashes (typically by recording the transaction's effects and its commit event in a non-volatile memory).

Why is concurrency control needed? If transactions are executed serially, i.e., sequentially with no overlap in time, no transaction concurrency exists. However, if concurrent transactions with interleaving operations are allowed in an uncontrolled manner, some unexpected, undesirable result may occur. Here are some typical examples: 1. The lost update problem: A second transaction writes a second value of a dataitem (datum) on top of a first value written by a first concurrent transaction, and the first value is lost to other transactions running concurrently which need, by their precedence, to read the first value. The transactions that have read the wrong value end with incorrect results. 2. The dirty read problem: Transactions read a value written by a transaction that has been later aborted. This value disappears from the database upon abort, and should not have been read by any transaction ("dirty read"). The reading transactions end with incorrect results. 3. The incorrect summary problem: While one transaction takes a summary over the values of all the instances of a repeated data-item, a second transaction updates some instances of that data-item. The resulting summary does not reflect a correct result for any (usually needed for correctness) precedence order between the two transactions (if one is executed before the other), but rather some random result, depending on the timing of the updates, and whether certain update results have been included in the summary or not. Concurrency control mechanisms Types of mechanisms The main categories of concurrency control mechanisms are:

Optimistic - Delay the checking of whether a transaction meets the isolation and other integrity rules (e.g., serializability andrecoverability) until its end, without blocking any of its (read, write) operations ("...and be optimistic about the rules being met..."), and then abort a transaction to prevent the violation, if the desired rules are to be violated upon its commit. An aborted transaction is immediately restarted and re-executed, which incurs an obvious overhead (versus executing it to the end only once). If not too many transactions are aborted, then being optimistic is usually a good strategy.

Pessimistic - Block an operation of a transaction, if it may cause violation of the rules, until the possibility of violation disappears. Blocking operations is typically involved with performance reduction.

Semi-optimistic - Block operations in some situations, if they may cause violation of

some rules, and do not block in other situations while delaying rules checking to transaction's end, as done with optimistic. Different categories provide different performance, i.e., different average transaction completion rates (throughput), depending on transaction types mix, computing level of parallelism, and other factors. If selection and knowledge about trade-offs are available, then category and method should be chosen to provide the highest performance. Many methods for concurrency control exist. Most of them can be implemented within either main category above. The major methods, which have each many variants, and in some cases may overlap or be combined, are: 1. Locking (e.g., Two-phase locking - 2PL) - Controlling access to data by locks assigned to the data. Access of a transaction to a data item (database object) locked by another transaction may be blocked (depending on lock type and access operation type) until lock release. 2. Serialization graph checking (also called Serializability, or Conflict, or Precedence graph checking) - Checking for cycles in the schedule's graph and breaking them by aborts. 3. Timestamp ordering (TO) - Assigning timestamps to transactions, and controlling or checking access to data by timestamp order. 4. Commitment ordering (or Commit ordering; CO) - Controlling or checking transactions' order of commit events to be compatible with their respective precedence order.

Other major concurrency control types that are utilized in conjunction with the methods above include: Multiversion concurrency control (MVCC) - Increasing concurrency and performance by generating a new version of a database object each time the object is written, and allowing transactions' read operations of several last relevant versions (of each object) depending on scheduling method.

Index concurrency control - Synchronizing access operations to indexes, rather than to user data. Specialized methods provide substantial performance gains.

The most common mechanism type in database systems since their early days in the 1970s has been Strong strict Two-phase locking(SS2PL; also called Rigorous scheduling or Rigorous 2PL) which is a special case (variant) of both Two-phase locking (2PL) andCommitment ordering (CO). It is pessimistic. In spite of its long name (for historical reasons) the idea of the mechanism is simple: "Release all locks applied by a transaction only after the transaction has ended." SS2PL (or Rigorousness) is also the name of the set of all schedules that can be generated by this mechanism, i.e., these are SS2PL (or Rigorous) schedules, have the SS2PL (or Rigorousness) property. Major goals Concurrency control mechanisms are usually designed to achieve some of, or all the following goals: Serializability For correctness, a common major goal of most concurrency control mechanisms is generating schedules with the Serializability property. Without serializability undesirable phenomena may occur, e.g., money may disappear from accounts, or be generated from nowhere. Serializability of a schedule means equivalence (in the resulting database values) to some serial schedule with the same transactions (i.e., in which transactions are sequential with no overlap in time, and thus completely isolated from each other: No concurrent access by any two transactions to the same data is possible). Serializability is considered the highest level of isolation among database transactions, and the major correctness criterion for concurrent transactions. In some cases compromised, relaxed forms of serializability are allowed for better performance (e.g., the popular Snapshot isolation mechanism), if application's correctness is not violated by the relaxation. Almost all implemented concurrency control mechanisms achieve serializability by providing Conflict serializablity, a broad special case of serializability (i.e., it covers, enables

most serializable schedules, and does not impose significant additional delay-causing constraints) which can be implemented efficiently. Recoverability Concurrency control typically also ensures the Recoverability property of schedules for maintaining correctness in cases of aborted transactions (which can always happen for many reasons). Recoverability means that no committed transaction in a schedule has read data written by an aborted transaction. Such data disappear from the database (upon the abort) and are parts of an incorrect database state. Reading such data violates the consistency rule of ACID. Unlike Serializability, Recoverability cannot be compromised, relaxed at any case, since any relaxation results in quick database integrity violation upon aborts. The major mechanisms listed above are serializability mechanisms. None of them in its general form automatically provides recoverability, and special considerations and mechanism enhancements are needed to support recoverability. A commonly utilized special case of recoverability is Strictness, which allows efficient database recovery from failure (but excludes optimistic implementations; e.g, Strict CO (SCO) does not not allow an optimistic implementation, but allows semi-optimistic). Comment: Note that the Recoverability property is needed even if no database failure occurs and no database recovery from failure is needed. It is rather needed to correctly automatically handle transaction aborts, which may be unrelated to database failure and recovery from it. Distribution: Distributed serializability and Commitment ordering As database systems become distributed, or cooperate in distributed environments (e.g., in Grid computing, Cloud computing, and networks with smartphones), transactions may become distributed. A distributed transaction means that the transaction spans processes, and may span computers and geographical sites. This generates a need in effective distributed concurrency control mechanisms. Achieving the Serializability property of a distributed system's schedule (see Distributed serializability and Global serializability (Modular serializability)) effectively poses special challenges typically not met by most of the regular serializability mechanisms, originally designed to operate locally. This is especially due to a need in costly distribution of concurrency control information amid communication and computer latency. The only known general effective technique for distribution is Commitment ordering (Commit ordering, CO; Raz 1992), which was disclosed publicly in 1991 (after being patented). CO does not require the distribution of

concurrency control information and provides a general effective solution (reliable, high-performance, and scalable) for both distributed and global serializability, also in a heterogeneous environment with database systems (or other transactional objects) with different (any) concurrency control mechanisms. CO is indifferent to which mechanism is utilized, since it does not interfere with any transaction operation scheduling (which most mechanisms control), and only determines the order of commit events. Thus, CO enables the efficient distribution of all other mechanisms, and also the distribution of a mix of different (any) local mechanisms, for achieving distributed and global serializability. The existence of such a solution has been considered "unlikely" until 1991, and by many experts also later, due to misunderstanding of the CO solution (see Quotations in Global serializability). An important side-benefit of CO is automatic distributed deadlock resolution. Contrary to CO, virtually all other techniques (without CO) are prone todistributed deadlocks (also called global deadlocks) which need special handling. CO is also the name of the resulting schedule property: A schedule has the CO property if the chronological order of its transactions' commit events is compatible with the respective transactions'precedence (partial) order. SS2PL mentioned above is a variant (special case) of CO and thus also effective to achieve distributed and global serializability. It also provides automatic distributed deadlock resolution (a fact overlooked in the research literature even after CO's publication), as well as Strictness and thus Recoverability. Possessing these desired properties together with known efficient locking based implementations explains SS2PL's popularity. SS2PL has been utilized to efficiently achieve Distributed and Global serializability since the 1980, and has become the de-facto standard for it. However, SS2PL is blocking and constraining (pessimistic), and with the proliferation of distribution and utilization of systems different from traditional database systems (e.g., as in Cloud computing), less constraining types of CO (e.g.,Optimistic CO) may be needed for better performance. Comments: 1. The Distributed conflict serializability property in its general form is difficult to achieve efficiently, but it is achieved efficiently via its special case Distributed CO: Each local component (e.g., a local DBMS) needs both to provide some form of CO, and enforce a special voting strategy for the Twophase commit protocol (2PC: utilized to commit distributed transactions).

Unlike Serializability,Distributed recoverability and Distributed strictness can be achieved efficiently in a straightforward way (Raz 1992, page 307), similarly to the way Distributed CO is achieved (applied locally with similar voting strategies). Differently from the general Distributed CO,Distributed SS2PL exists automatically when all local components are SS2PL based (in each component CO exists, implied, and the voting strategy is now met automatically). This fact has been known and utilized since the 1980s (i.e., that SS2PL exists globally, without knowing about CO) for efficient Distributed SS2PL, which implies Distributed serializability and strictness (e.g., see Raz 1992, page 293; it is also implied in Bernstein et al. 1987, page 78). Less constrained Distributed serializability and strictness can be efficiently achieved by Distributed Strict CO (SCO), or by a mix of SS2PL based and SCO based local components. 2. About the references and Commitment ordering: (Bernstein et al. 1987) was published before the discovery of CO in 1990. CO is described in (Weikum and Vossen 2001, pages 102, 700), but the description is partial and misses CO's essence. (Raz 1992) was the first refereed and accepted for publication article about CO. Other CO articles followed. The Online Certificate Status Protocol (OCSP) is an Internet protocol used for obtaining the revocation status of an X.509 digital certificate. It is described in RFC 2560 and is on the Internet standards track. It was created as an alternative to certificate revocation lists (CRL), specifically addressing certain problems associated with using CRLs in a public key infrastructure (PKI). Messages communicated via OCSP are encoded in ASN.1 and are usually communicated over HTTP. The "request/response" nature of these messages leads to OCSP serversbeing termed OCSP responders.

Comparison to CRLs Since an OCSP response contains less information than a typical CRL (certificate revocation list), OCSP can feasibly provide more timely information regarding the revocation status of a certificate without burdening the network. However, the greater number of requests and connection overhead may overwhelm this benefit if the client does not cache responses.

Using OCSP, clients do not need to parse CRLs themselves, saving client-side complexity. However, this is balanced by the practical need to maintain a cache. In practice, such considerations are of little consequence, since most applications rely on thirdparty librariesfor all X.509 functions.

CRLs may be seen as analogous to a credit card company's "bad customer list" an unnecessary public exposure.

OCSP discloses to the requester that a particular network host used a particular certificate at a particular time. OCSP does not mandate encryption, so this information also may be intercepted by other parties.

Basic PKI implementation 1. Alice and Bob have public key certificates issued by Ivan, the Certificate Authority (CA). 2. Alice wishes to perform a transaction with Bob and sends him her public key certificate. 3. Bob, concerned that Alice's private key may have been compromised, creates an 'OCSP request' that contains Alice's certificate serial number and sends it to Ivan. 4. Ivan's OCSP responder reads the certificate serial number from Bob's request. The OCSP responder uses the certificate serial number to look up the revocation status of Alice's certificate. The OCSP responder looks in a CA database that Ivan maintains. In this scenario, Ivan's CA database is the only trusted location where a compromise to Alice's certificate would be recorded. 5. Ivan's OCSP responder confirms that Alice's certificate is still OK, and returns a signed, successful 'OCSP response' to Bob. 6. Bob cryptographically verifies Ivan's signed response. Bob has stored Ivan's public key sometime before this transaction. Bob uses Ivan's public key to verify Ivan's response. 7. Bob completes the transaction with Alice.

Protocol details An OCSP responder may return a signed response signifying that the certificate specified in the request is 'good', 'revoked' or 'unknown'. If it cannot process the request, it may return an error code.

The OCSP request format supports additional extensions. This enables extensive customization to a particular PKI scheme. OCSP can be resistant to replay attacks, where a signed, 'good' response is captured by a malicious intermediary and replayed to the client at a later date after the subject certificate may have been revoked. OCSP overcomes this by allowing a nonce to be included in the request that must be included in the corresponding response. However, the replay attack, while a possibility, is not a major threat to validation systems. This is due to the steps it takes to actually exploit this weakness. The attacker would have to be in a position to 1. 2. capture the traffic and subsequently replay that traffic, capture the status of a certificate whose status is about to change and

3. conduct a transaction requiring the status of that certificate within the time frame of the validity of the response. Since it is not common for revoked certificates to be unrevoked (only if it is suspended is it even possible) a person would have to capture a good response and wait until the certificate was revoked then replay it. OCSP can support more than one level of CA. OCSP requests may be chained between peer responders to query the issuing CA appropriate for the subject certificate, with responders validating each other's responses against the root CA using their own OCSP requests. An OCSP responder may be queried for revocation information by delegated path validation (DPV) servers. OCSP does not, by itself, perform any DPV of supplied certificates. The key that signs a response need not be the same key that signed the certificate. The certificate's issuer may delegate another authority to be the OCSP responder. In this case, the responder's certificate (the one that is used to sign the response) must be issued by the issuer of the certificate in question, and must include a certain extension that marks it as an OCSP signing authority (more precisely, an extended key usage extension with the OID {iso(1) identifiedorganization(3) dod(6) internet(1) security(5) mechanisms(5) pkix(7) keyPurpose(3) ocspSigning(9)})