Vous êtes sur la page 1sur 31

Comparison of

Oracle, MySQL and PostgreSQL DBMS


in the context of ALICE needs
Wiktor Peryt, Warsaw University of Technology, Faculty of Physics

We have taken the following approach: first of all we determined what features of DBMS are important from the point of view of such a large experiment.

We chose the following features: Elementary features


basic data types SQL language features declarative integrity constraints programming abstractions automatic generation of identifiers national characters support

Transactions and multi-user access


transactions locks multi-user access

Programming in database

stored procedures triggers

Elements of database administration


access control backup copies data migration

Portability and scalability


portability of DBMS scalability

Performance and VLDB (Very Large Databases)


query optimization structures supporting query optimization support for analytical processing allocation of disk space data size limits VLDB implementations access to multiple databases heterogeneous systems support

Distributed databases

Distributed databases

access to multiple databases heterogeneous systems support large objects in database post-relational extensions support for special data types embedded SQL standard interfaces, additional interfaces interoperability with Web technology XML, CASE

Special data types


Application development and interfaces


Reliability

failure recovery

Commercial issues
technical support available market position Having completed step one we carried out subsequent work in 3 subgroups; each of them dealt with only one DBMS. The members of particular subgroups had their own practical experience with using DBMS being subject to investigation by their subgroup. Such a procedure gave us the possibility of verifying information contained in manuals and other documentation available (for instance on Internet). As a result 3 extended documents devoted to Oracle, MySQL and PostgreSQL were created

Konrad Bohuszewicz Maciej Czyzowicz Michal Janik Dawid Jarosz Piotr Mazan Marcin Mierzejewski Mikolaj Olszewski Wiktor S. Peryt Sylwester Radomski Piotr Szarwas Tomasz Traczyk Dominik Tukendorf Jacek Wojcieszuk

undergraduate student undergraduate student Ph.D. student undergraduate student undergraduate student undergraduate student undergraduate student undergraduate student Ph.D. student undergraduate student undergraduate student

Faculty of Electronics and Information Technology Faculty of Mathematics and Information Sciences Faculty of Physics

About Comparison

discussion by all people involved in this task compilation was made by Dr. Tomasz Traczyk compilation circulated within the whole group a few times to make sure we avoided some omissions or mistakes this version of the document is accepted by all co-authors we consider it a quite comprehensive and objective comparison it contains also some kind of "weights" called by us "importance", with differentiation for Central database and Lab-participants. Central database should be a kind of data warehouse at CERN, containing all the data, also data transferred from Lab-participants periodically the term "Lab-participants" denotes smaller databases in labs involved in ALICE experiment preparation few explanations of terminology used in the database domain are also included to make this document easy to comprehend for non-specialists

Summary
Importance Category Problem Basic data types SQL Declarati ve constraints Programming abstractions Generation of i ds National chars Transacti ons Transacti ons Programming in DB Administration Locks Multiuser access Stored pr ocedures and triggers Access control Backup Data migrati on Portability Scalability Query opti mization Structures supporting optimization Performance and VLDB Support for OLAP Allocation of the disk space Size limits VLDB i mplementati ons Access to multi ple databases Heterogeneous systems support Large objects Post-relational extensions Support for s pecial data types Embedde d SQL Standard interfaces Addi tional interfaces Web technolog y XML CAS E Recovery Prices Technical support Position on the market Central database C B B A C B A A A B B A C B A A B B A A A C B B C C C B A A B B A C A A Labpartici pants C B B C C C C C D C D C C C C C D D C B C D D B C C C C A A C C B A B C MySQL B C C D C B D D C D A C A B B B D D C B D C D B D D D B A B D D C A C D Assessment Oracle8 C B A A A A A A A A A A B A A A A A A A A A B A A A A A A A A A A D B A PostgreSQL A B A C A B A A C A B C A B C B B D C C B C D C B C B B A B D D C A D D

Elementary features

Portability and scalability

Distributed databases

Special data types

Application development and interfaces

Reliability Commerci al issues

Our preliminary conclusions: for Central Data Repository for ALICE at CERN:

only ORACLE can be taken into account seriously


for Labs-participants (mainly for production phase databases):

Oracle is also the best but using MySQL or PostgreSQL is possible


the choice one of them is not obvious at the moment Some extended tests concerning MySQL and PostgreSQL performance, stability etc. with real data for STAR SSD are still in progress in Warsaw. They will be published in 1-2 weeks on the website:

http://ITS_DB_ALICE.if.pw.edu.pl
the same place for document Comparison of Oracle, MySQL and PostgreSQL DBMS

Questions for ALICE

How to start with databases for ALICE and how to manage the project? General concept of system architecture Databases in production phase Software technologies recommended DBMS platform choice How to proceed?

Databases types for ALICE


The following main categories of information should go into databases:
production and assembly phase measurements and descriptive data ProdPhase database calibrations data Calibration database configuration data Configuration database detector condition data Condition database run logs data RunLog database geometry data (?) Geometry database or part of Calibration
DB (?)

some others? ... to be defined later, during "phase one" work

Databases contents

(1)

ProdPhase database

all information coming from test-beds, from manufacturers, assembly processes, object flow between manufacturers and labs, etc.

RunLog database
to store the summary information describing the contents of an experimental run and to point the locations where detailed information associated with the run is stored
Example of Web based interface developed by Sylwester Radomski (undergraduate student from Faculty of Physics, WUT) for STAR can be seen on http://www.star.bnl.gov -> Computing and from table New the first item

Why database in production phase?


The environment in which the archive facility operates is composed of many sources of information We have to deal with data: produced by various test-bench systems entered manually by operators submitted by collaborating institutes and companies Usually there is a number of distinct data formats Files are stored in many locations

Consequently, without database:

it is not only hard to locate the right piece of information but also to ensure the safety and good quality of data

Goals for production phase database

secure archiving of all the test results in repository easy availability of info upon location of objects (in geographic sense: manufacturers, labs) makes the assembly arrangement easier creating the possibility of automatic assignment of quality attributes according to the well defined criteria statistical analysis of the quality should be made easily and at any time preparing data for future on-line use by slow-control, DCS and DAQ easy access to all data during production and assembly phase In the future - easy access to all data during experiment run

DB production phase
Basic requirements:

data should be stored in central repository to make easy and reliable the management and maintenance access to the data should be assured for everybody which participates in tests during production phase, i.e. software allowing use of WEB browsers is necessary objects' registration should be possible manually (by operator with suitable privileges) as well as automatically (from LabVIEW application, for example or other software) The software should allow creating (SQL) queries to the database even for inexperienced users

From the point of view of domain experts ... (1)

there is an ever-increasing demand for centralized storage of data for consistent and easy to use search and retrieval facilities experts want to be able to retrieve and analyze the information in a user-friendly way, regardless of its origin They do not want to be forced to perform several queries just because data in question was taken by different data acquisition systems

From the point of view of domain experts ... (2)

they wish to do statistics on data sets spanning months (and more) without having to browse tens of subdirectories on backup storage devices usually - they prefer to use industry-standard, versatile software tools to process and analyze data they certainly would not mind should they be able to automate their routine, everyday tasks

Their task is to look at the information, not to look for it

Requirements addressed to software developers

we should address those issues by providing a modular framework for archiving and for platformindependent retrieval of data in heterogeneous distributed computing environment our database system must be open enough to follow inevitable evolution of information gathering systems related to the development of the particular detectors we should be able to cope with the fast evolving new Internet technologies in order to take full advantage of facilities they provide

DB for STAR - software technologies used Use of:


PHP4 software running on the server side C/C++ for API JAVA + SWING & JDBC for applications requiring more interactivity (JDBC = JAVA DataBase Connectivity) seems to be the right choice of tools used for client side software development

"

On the flight plots creation ...

From SQL query to plot ...


Generation of plots and histograms from database and putting them on the Web. Attempt made by S. Radomski:
Data chain: Http server (Apache - Tomacat) calls servlet (dbPlot) with parameter - SQL query. Servlet in http server connects to ROOT based server through socket and sends query ROOT server means ROOT script which handles connections and scripting dbPlot class. dbPlot::Init() reuse existing connection to database or creates new one if the old one does not exist. dbPlot::TakeData() server sends query to DB and takes data using TSQLServer class. dbPlot::TakeData() takes data from TSQLResult and put them to TNtuple. This function can recognise and parse 'private' format of data stored in BLOB. dbPlot::PlotData() calls TTree->Draw() with proper parameters. dbPlot::Style() set colors and labels.

Performance and problems ...


One histogram takes about 1-2 sec. Slowest element in the chain - convert. Convert makes use of GhostScript. Creation of PostScript and then conversion to PNG is overcomplicated and rather simple TGrph with ~758 lines takes ~10 s ROOT cannot generate Gif in -b mode. Problems with memory deallocation in ROOT after about 100 plots ROOT crashes. Modification of Draw() in Ttree: In 1-D Histogram Draw() always makes 100 bins. If data has its own grid (measurement precision) plots look terribly - especially when histogramming integers. Small modification in TTreePlayer permits to recognize if data are gridy and sets number of

Typical architecture for local site/lab i.e. Lab-participant

measurements are performed on dedicated computer data are transferred over Ethernet local network to database users can access the measurements by means of JAVA applets or PHP applications graphical user interface make the construction of complex queries easy even for user with no database experience another capability of this applet is the visualisation of selected data it is clear that using JAVA, JDBC and PHP allows to access the database over the Internet or local network with user's favourite browser

JAVA applet

PHP applications

DB server

(daemon)

ROOT or AliROOT LabVIEW application

repository

DUT

Production phase database for ALICE


JAVA applet PHP applications

Lab 4

Lab n

MySQL server (daemon)

ROOT or AliROOT LabVIEW application

.....
Data Archive Server Library
ORACLE server (daemon)

MySQL repository

Lab 3

DUT

AliROOT

Lab 2

Application services Data services


DATA repository

DBMS

Lab1 somewhere in Europe

CERN

Central data warehouse at CERN


three tier architecture
Client layer modules and applications

Filters Generic data loader Custom data loader

Interactive software: WWW browsers, JAVA applets, PHP, HTML, command line utilities etc...

Data Archive Server Library

ORACLE server

AliROOT

Application services Data services

DATA repository

DBMS

one can easily distinguish the three logical tiers - according to present tendencies: client layer, application services layer and data services layer each layer contains several components (not all shown on the picture) top level is a layer containing client applications, responsible for data transfer into database and visualisation the middle layer is composed of application services; this layer knows the logical structure and physical locations of data the bottom layer contains data and Database Management System

How to start with databases for ALICE?

Project should be managed in few phases Project is large so I strongly suggest to apply

methodology proven in "commercial environment"

How to manage the project?

(1)

Phase 1: strategy (or planning) for the WHOLE project:


determination partitioning

of scope of the project

into subsystems (natural way: subdetectors, but not only) of list of actors/participants time schedule for particular tasks

formulation creation

of general models which could be applied

approximate initial

choice of software technologies

How to manage the project?

(2)

Successive phases should be performed in "spiral cycle"

It simply means that particular subsystems are elaborated successively. Each subsystem must go through the following phases:
analysis/conceptual design software design development (programming) implementation
improvements/corrections in earlier completed subsystems must be continued during the work on successive subsystems simultaneous work on several subsystems is a good practice;

How to manage the project?

(3)

work on pilot project - in parallel to the main one; the same software technology, it should contain most urgent things efficiency tests; creation of "simulated data" with capacity volumes similar to the expected ones creation of "conceptual models" during the analysis phase is necessary before design of subsystems; the appropriate formalism and class CASE tools are needed for that. For linux - UML (Unified Modelling Language) is a appropriate option elaboration for the whole project of such standards as: system of keys, terminology, security, access rights etc.

First steps ...


Start to formally organize database central group for ALICE After that: begin phase 1 of the project, i.e. strategy for the WHOLE project/experiment Partial, of highest priority tasks for this group:

determination of scope of the project formulation of general models which could be applied creation of list of actors/participants (including 1-2 representatives from each subdetector!) initial choice of software technologies which could be used partitioning into subsystems analysis/conceptual design

Vous aimerez peut-être aussi