Vous êtes sur la page 1sur 41

Name: Pankaj L.

Chowkekar

Application ID: 6808

Class: S.Y. M.C.A. (Sem 4)

Subject: Advanced Database Techniques

Group : A
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Q. 1) Write short note on the following (attempt any Four)

a) KDD process
Ans :

Meaning
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process of finding
knowledge in data, and emphasizes the "high-level" application of particular data mining methods. It is
of interest to researchers in machine learning, pattern recognition, databases, statistics, artificial
intelligence, knowledge acquisition for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the context of large databases.

It does this by using data mining methods (algorithms) to extract (identify) what is deemed knowledge,
according to the specifications of measures and thresholds, using a database along with any required
preprocessing, subsampling, and transformations of that database.

An Outline of the Steps of the KDD Process

The overall process of finding and interpreting patterns from data involves the repeated application of
the following steps:
a) Developing an understanding of
i) the application domain
ii) the relevant prior knowledge
iii) the goals of the end-user
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

b) Creating a target data set: selecting a data set, or focusing on a subset of variables, or data
samples, on which discovery is to be performed.
c) Data cleaning and preprocessing.
i) Removal of noise or outliers.
ii) Collecting necessary information to model or account for noise.
iii) Strategies for handling missing data fields.
iv) Accounting for time sequence information and known changes.
d) Data reduction and projection.
i) Finding useful features to represent the data depending on the goal of the task.
ii) Using dimensionality reduction or transformation methods to reduce the effective
number of variables under consideration or to find invariant representations for the
data.
e) Choosing the data mining task.
i) Deciding whether the goal of the KDD process is classification, regression, clustering,
etc.
f) Choosing the data mining algorithm(s).
i) Selecting method(s) to be used for searching for patterns in the data.
ii) Deciding which models and parameters may be appropriate.
iii) Matching a particular data mining method with the overall criteria of the KDD process.
g) Data mining.
i) Searching for patterns of interest in a particular representational form or a set of such
representations as classification rules or trees, regression, clustering, and so forth.
h) Interpreting mined patterns.
i) Consolidating discovered knowledge.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

b) Distributed catalog management


Ans :

Efficient catalog management in distributed databases is critical to ensure satisfactory performance


related to site autonomy, view management, and data distribution and replication. Catalogs are
databases themselves containing metadata about the distributed database system.

Three popular management schemes for distributed catalogs are centralized catalogs, fully replicated
catalogs, and partitioned catalogs. The choice of the scheme depends on the database itself as well as
the access patterns of the applications to the underlying data.

Centralized Catalogs. In this scheme, the entire catalog is stored in one single site. Owing to its central
nature, it is easy to implement. On the other hand, the advantages of reliability, availability, autonomy,
and distribution of processing load are adversely impacted. For read operations from noncentral sites,
the requested catalog data is locked at the central site and is then sent to the requesting site. On
completion of the read operation, an acknowledgement is sent to the central site, which in turn unlocks
this data. All update operations must be processed through the central site. This can quickly become a
performance bottleneck for write-intensive applications.

Fully Replicated Catalogs. In this scheme, identical copies of the complete catalog are present at each
site. This scheme facilitates faster reads by allowing them to be answered locally. However, all updates
must be broadcast to all sites. Updates are treated as transactions and a centralized two-phase commit
scheme is employed to ensure catalog consistency. As with the centralized scheme, write-intensive
applications may cause increased network traffic due to the broadcast associated with the writes.

Partially Replicated Catalogs. The centralized and fully replicated schemes restrict site autonomy since
they must ensure a consistent global view of the catalog. Under the partially replicated scheme, each
site maintains complete catalog information on data stored locally at that site. Each site is also
permitted to cache entries retrieved from remote sites. However, there are no guarantees that these
cached copies will be the most recent and updated. The system tracks catalog entries for sites where the
object was created and for sites that contain copies of this object. Any changes to copies are propagated
immediately to the original (birth) site. Retrieving updated copies to replace stale data may be delayed
until an access to this data occurs. In general, fragments of relations across sites should be uniquely
accessible. Also, to ensure data distribution transparency, users should be allowed to create synonyms
for remote objects and use these synonyms for subsequent referrals.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

c) Search engines
Ans :

Introduction
Search Engine refers to a huge database of internet resources such as web pages, newsgroups,
programs, images etc. It helps to locate information on World Wide Web.
User can search for any information by passing query in form of keywords or phrase. It then searches for
relevant information in its database and return to the user.

Search Engine Components


Generally there are three basic components of a search engine as listed below:
1. Web Crawler
2. Database
3. Search Interfaces

Web crawler
It is also known as spider or bots. It is a software component that traverses the web to gather
information.

Database
All the information on the web is stored in database. It consists of huge web resources.

Search Interfaces
This component is an interface between user and the database. It helps the user to search through the
database.

Search Engine Working


Web crawler, database and the search interface are the major component of a search engine that
actually makes search engine to work. Search engines make use of Boolean expression AND, OR, NOT to
restrict and widen the results of a search. Following are the steps that are performed by the search
engine:
● The search engine looks for the keyword in the index for predefined database instead of going
directly to the web to search for the keyword.
● It then uses software to search for the information in the database. This software component is
known as web crawler.
● Once web crawler finds the pages, the search engine then shows the relevant web pages as a
result. These retrieved web pages generally include title of page, size of text portion, first several
sentences etc.
These search criteria may vary from one search engine to the other. The retrieved information is ranked
according to various factors such as frequency of keywords, relevancy of information, links etc.
● User can click on any of the search results to open it.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Search Engine Processing


Indexing Process
Indexing process comprises of the following three tasks:
● Text acquisition
● Text transformation
● Index creation

TEXT ACQUISITION
It identifies and stores documents for indexing.

TEXT TRANSFORMATION
It transforms document into index terms or features.

INDEX CREATION
It takes index terms created by text transformations and create data structures to support fast
searching.

Query Process
Query process comprises of the following three tasks:
● User interaction
● Ranking
● Evaluation

USER INTERACTION
It support creation and refinement of user query and displays the results.

RANKING
It uses query and indexes to create ranked list of documents.

EVALUATION
It monitors and measures the effectiveness and efficiency. It is done offline.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

d) Neural network
Ans:

Meaning
Thе ѕimрlеѕt dеfinitiоn of a neural nеtwоrk, mоrе рrореrlу rеfеrrеd tо аѕ аn
'аrtifiсiаl' neural nеtwоrk (ANN), is provided bу thе invеntоr оf оnе of thе first
nеurосоmрutеrѕ, Dr. Robert Hecht-Niеlѕеn. Hе dеfinеѕ a neural nеtwоrk аѕ: "...a
computing ѕуѕtеm made up of a numbеr of ѕimрlе, highly interconnected processing
еlеmеntѕ, whiсh process infоrmаtiоn bу thеir dynamic ѕtаtе rеѕроnѕе to еtеrnаl
inрutѕ.

In "Neural Network Primer: Pаrt I" bу Maureen Caudill, AI Expert, Fеb. 1989 ANNs are processing
dеviсеѕ (аlgоrithmѕ or actual hardware) that are loosely mоdеlеd after the nеurоnаl
ѕtruсturе оf the mаmаliаn сеrеbrаl соrtеx but on much ѕmаllеr ѕсаlеѕ.

A large ANN might have hundrеdѕ or thоuѕаndѕ оf рrосеѕѕоr unitѕ, whereas a mаmаliаn
brаin hаѕ billiоnѕ of nеurоnѕ with a соrrеѕроnding increase in magnitude of thеir overall
intеrасtiоn аnd emergent bеhаviоr. Although ANN rеѕеаrсhеrѕ are gеnеrаllу not
соnсеrnеd with whether their networks ассurаtеlу rеѕеmblе biological ѕуѕtеmѕ,
ѕоmе hаvе.

For ехаmрlе, rеѕеаrсhеrѕ have ассurаtеlу ѕimulаtеd the function of the retina
аnd mоdеlеd thе eye rather well. Although thе mathematics involved with neural nеtwоrking
iѕ not a triviаl mаttеr, a uѕеr can rather easily gain at lеаѕt an ореrаtiоnаl understanding
of their structure and function.

The Basics Of Neural Network


Neural networks аrе typically оrgаnizеd in lауеrѕ. Lауеrѕ аrе mаkе uр оf a number
оf intеrсоnnесtеd 'nodes' which соntаin an 'асtivаtiоn funсtiоn'. Patterns аrе
presented tо the network via the 'input layer', whiсh communicates to one or more 'hidden layers'
where the actual рrосеѕѕing iѕ dоnе via a ѕуѕtеm оf weighted 'соnnесtiоnѕ'. The
hidden lауеrѕ thеn link tо аn 'output рlауеr' whеrе thе answer iѕ output as ѕhоwn in
thе grарhiс below

Most ANNѕ соntаin some form оf 'lеаrning rulе' whiсh modifies thе weights оf thе
соnnесtiоnѕ according tо thе input patterns that it iѕ рrеѕеntеd with. In a sense, ANNѕ
lеаrn bу ехаmрlе аѕ do their biological counterparts; a сhild lеаrnѕ to recognize dogs
from еxаmрlеѕ оf dogs.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Althоugh thеrе are many different kinds оf lеаrning rulеѕ used by neural networks, this
dеmоnѕtrаtiоn is concerned оnlу with оnе; the dеltа rule. The dеltа rule iѕ оftеn
utilized bу thе mоѕt common class of ANNs called 'bасkрrораgаtiоnаl nеurаl networks'
(BPNNѕ). Bасkрrораgаtiоn is аn аbbrеviаtiоn fоr the bасkwаrdѕ
рrораgаtiоn of error.

Applications Should Neural Networks Be Used For


Nеurаl nеtwоrkѕ are universal аррrоximаtоrѕ, аnd thеу work bеѕt if thе system
уоu are uѕing the tо mоdеl hаѕ a high tolerance to error. ALL ABOUT NEURAL NETWORK ©2016
by Ravi Chandra Srikanth Page14 One wоuld therefore nоt be advised to use a nеurаl nеtwоrk
tо balance оnе'ѕ cheque bооk! However thеу wоrk vеrу wеll for: capturing associations or
discovering regularities within a set of patterns; where the volume, number of variables or diversity of
the data is very great;
the relationships between variables are vaguely understood; or, the relationships are difficult to
describe adequately with conventional approaches
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

e) OODBMS
Ans:

Meaning
Object oriented databases or object databases incorporate the object data model to define data
structures on which database operations such as Create, View, Update and Delete can be performed.
They store objects rather than data such as integers and strings.

The relationship between various data is implicit to the object and manifests as object attributes and
methods. Object Oriented Database Management Systems (OODBMSs) actually extend the object
programming language with the database concepts like transparently persistent data, concurrency
control, data recovery, associative queries, and other database capabilities.
Simply, an Object Oriented Database System should satisfy two criteria: it should be a DBMS, and it
should be an object-oriented system.

Thus OODB implements OO concepts such as object identity, polymorphism, encapsulation and
inheritance to provide access to persistent objects using any Object Oriented Programming Language.
Some OODBMSs – db4o, Intersystems, Versant.

Advantages of Object Oriented DBMS (OODBMS)

● Enhanced modeling capabilities​ – it is easy to model the real-world object as close as possible
like the case of Object Oriented Programming concepts.

● Extensibility ​– Support for new data types - Unlike traditional DBMS products where the basic
data types are hard-coded in the DBMS and are unchangeable by the users, with an ODBMS the
user can encode any type of structure that is necessary and the ODBMS will manage that type.

● Support for long-duration transactions​ – the process of object data involves increased
complexity. Hence, we need to provide support for long-duration transactions.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

● Applicability to advanced database applications​ – enhanced modeling capabilities of OODBMS


makes it usable to application like computer-aided design (CAD), computer-aided software
engineering (CASE), office information systems (OISs), multimedia systems, and many more.

● Improved performance​ – improved performance in the case of object based applications.

● Reusability ​– the code can be reused. Inheritance, method support, etc enables the possibility of
reusing the code. An OODBMS can be programmed with small procedural differences without
affecting the entire system

● OODBs reduce need for Joins​ - The capability of navigating through object structures and the
resulting path expressions in object attributes gives us a new perspective on the issue of joins in
OODBs. The relational join is a mechanism that correlates two relations on the basis of values of
a corresponding pair or attributes in the relations. Since two classes in an OODB may have
corresponding pairs of attributes, the relational join (or, explicit join) may still be necessary in
OODBs.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Q. 2) a) What are multidimensional cubes? Explain how the slice and dice operations are performed.
Ans :

Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows
managers, and analysts to get an insight of the information through fast, consistent, and interactive
access to information. This chapter cover the types of OLAP, operations on OLAP, difference between
OLAP, and statistical databases and OLTP.

Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With
multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore,
many OLAP server use two levels of data storage representation to handle dense and sparse data sets.

A cube is a multidimensional structure that contains information for analytical purposes; the main
constituents of a cube are dimensions and measures. Dimensions define the structure of the cube that
you use to slice and dice over, and measures provide aggregated numerical values of interest to the end
user. As a logical structure, a cube allows a client application to retrieve values, of measures, as if they
were contained in cells in the cube; cells are defined for every possible summarized value. A cell, in the
cube, is defined by the intersection of dimension members and contains the aggregated values of the
measures at that specific intersection.

CONCEPT HIERARCHIES

In the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies. This organization provides users
with the flexibility to view data from different perspectives.

For example, we have attributes as day, temperature and humidity, we can group values in subsets and
name these subsets, thus obtaining a set of hierarchies as shown in figure below.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

OLAP OPERATIONS

OLAP provides a user-friendly environment for interactive data analysis. A number of OLAP data cube
operations exist to materialize different views of data, allowing interactive querying and analysis of the
data.

Below are some of the operations on dimensional data:


Slicing

Slice performs a selection on one dimension of the given cube, thus resulting in a subcube. For example,
in the cube example above, if we make the selection, temperature=cool we will obtain the following
cube:

Dicing
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

The dice operation defines a subcube by performing a selection on two or more dimensions. For
example, applying the selection (time = day 3 OR time = day 4) AND (temperature = cool OR
temperature = hot) to the original cube we get the following subcube (still two-dimensional):

2) b) What is apriori property? Describe an algorithm for finding frequent item sets. Explain
application of data mining in various sectors.
Ans :

Meaning
It expresses monotonic decrease of an evaluation criterion accompanying with the progress of a
sequential pattern. It is activated in order to efficiently discover all frequent sequential patterns.
The Apriori property is the property showing that values of evaluation criteria of sequential patterns are
smaller than or equal to those of their sequential subpatterns.

Mining Frequent itemsets – Apriori Algorithm

Apriori algorithm is an algorithm for frequent item set mining and association rule learning over
transaction databases. Its followed by identifying the frequent individual items in the database and
extending them to larger and larger item sets as long as those item sets appear sufficiently often in the
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

database. The frequent item sets determined by Apriori can be used to determine association rules
which highlight general trends in the database.

Definition of Apriori Algorithm


● The Apriori Algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules.
● Apriori uses a “bottom up” approach, where frequent subsets are extended one item at a time
(a step known as candidate generation, and groups of candidates are tested against the data.
● Apriori is designed to operate on database containing transactions (for example, collections of
items bought by customers, or details of a website frequentation).
Key Concept
● Frequent itemsets: All the sets which contain the item with the minimum support (denoted as
for item set.
● Apriori Property: Any subset of frequent item set must be frequent.
● Join operation: To find, a set of candidate k-itemsets is generated by joining with itself.

Steps to perform Apriori Algorithm


1. Scan the transaction database to get the support ‘S’ each 1-itemset, compare ‘S’ with min_sup,
and get a support of 1-itemsets,
2. Use join to generate a set of candidate k-item set. Use apriori property to prune the
unfrequented k-itemsets from this set.
3. Scan the transaction database to get the support ‘S’ of each candidate k-item set in the given
set, compare ‘S’ with min_sup, and get a set of frequent k-item set
4. If the candidate set is NULL, for each frequent item set 1, generate all nonempty subsets of 1.
5. For every nonempty subsets of 1, output the rule “s=>(1-s)” if confidence C of the rule “s=>(1-s)”
min_conf
6. If the candidate set is not NULL, go to step 2.

Applications of Data Mining in various sectors


Data Mining is primarily used today by companies with a strong consumer focus — retail, financial,
communication, and marketing organizations, to “drill down” into their transactional data and
determine pricing, customer preferences and product positioning, impact on sales, customer satisfaction
and corporate profits. With data mining, a retailer can use point-of-sale records of customer purchases
to develop products and promotions to appeal to specific customer segments.

Here is the list of 14 other important areas where data mining is widely used:
● Future Healthcare
Data mining holds great potential to improve health systems. It uses data and analytics to identify best
practices that improve care and reduce costs. Researchers use data mining approaches like
multi-dimensional databases, machine learning, soft computing, data visualization and statistics. Mining
can be used to predict the volume of patients in every category. Processes are developed that make sure
that the patients receive appropriate care at the right place and at the right time. Data mining can also
help healthcare insurers to detect fraud and abuse.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

● Corporate Surveillance
Corporate surveillance is the monitoring of a person or group’s behaviour by a corporation. The data
collected is most often used for marketing purposes or sold to other corporations, but is also regularly
shared with government agencies. It can be used by the business to tailor their products desirable by
their customers. The data can be used for direct marketing purposes, such as the targeted
advertisements on Google and Yahoo, where ads are targeted to the user of the search engine by
analyzing their search history and emails.

● Fraud Detection
Billions of dollars have been lost to the action of frauds. Traditional methods of fraud detection are time
consuming and complex. Data mining aids in providing meaningful patterns and turning data into
information. Any information that is valid and useful is knowledge. A perfect fraud detection system
should protect information of all the users. A supervised method includes collection of sample records.
These records are classified fraudulent or non-fraudulent. A model is built using this data and the
algorithm is made to identify whether the record is fraudulent or not.

● Financial Banking
With computerised banking everywhere huge amount of data is supposed to be generated with new
transactions. Data mining can contribute to solving business problems in banking and finance by finding
patterns, casualties, and correlations in business information and market prices that are not
immediately apparent to managers because the volume data is too large or is generated too quickly to
screen by experts. The managers may find these information for better segmenting,targeting, acquiring,
retaining and maintaining a profitable customer.

● Lie Detection
Apprehending a criminal is easy whereas bringing out the truth from him is difficult. Law enforcement
can use mining techniques to investigate crimes, monitor communication of suspected terrorists. This
field includes text mining also. This process seeks to find meaningful patterns in data which is usually
unstructured text. The data sample collected from previous investigations are compared and a model for
lie detection is created. With this model processes can be created according to the necessity.

● Bioinformatics
Data Mining approaches seem ideally suited for Bioinformatics, since it is data-rich. Mining biological
data helps to extract useful knowledge from massive datasets gathered in biology, and in other related
life sciences areas such as medicine and neuroscience. Applications of data mining to bioinformatics
include gene finding, protein function inference, disease diagnosis, disease prognosis, disease treatment
optimization, protein and gene interaction network reconstruction, data cleansing, and protein
subcellular location prediction.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Q. 3) a) Discuss how the scanning, sorting and join operations can be parallelized using data partition
technique.
Ans :

Relational operations work on relations that contain large sets of tuples, that we can parallelize the
operations by executing them in parallel on different subsets of the relations.
The number of tuples in a relation can be large so the degree of parallelism is potentially enormous.
Hence, we can say intraoperation parallelism is natural in a database system.
The parallel versions of some common relational operations are as follows:

1) Parallel Sort
For example, we want to sort a relation that resides on n disks D0,D1,......Dn-1.
If this relation is range partitioned on the attributes then each partition is sorted out separately and can
concatenate the results to get the full sorted relation.
As the tuples are partitioned on the n disks the time that is required for reading the entire relation is
reduced by the parallel access.

If the relation is partitioned in any other way in can be sorted out by using any of the following ways:
1. Range-partition it on the sort attributes and then sort each partition separately.
2. Use the parallel version of the external sort-merge algorithm.

Range-partitioning sort
It basically works in two steps: first is to range partition the relation and second is to sort out each
partition separately.
When we sort the relation it is not necessary to range -partition the relation on the same set of
processors or disks as those on which that relation is stored.
The range-partitioning should be done with a good range-partition vector so that each partition will
approximately have the same number of tuples.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Parallel External Sort-Merge


It is an alternative to range partitioning.
Suppose a relation has already been partitioned among the disks D0,D1,....Dn-1.
The parallel sort-merge will work in the following manner:
1. Each processor Pi will locally sort the data on the disk Di.
2. To get the final sorted output the system merges the sorted runs on each processor.

2) Parallel Join
The join operation tests the pairs of tuples to see whether they satisfy the join condition and if they do
the system adds the pair to the join output.
The parallel join algorithms attempt to split the pairs that are to be tested over several processors.
Each processor then checks part of the join locally.
After this the system collects the results from each of the processor for producing the final result.

The types of joins are:


Partitioned join
Fragment and Replicate join
Partitioned Parallel Hash join
Parallel Nested-Loop join

Other relational operators

● Selection
● Duplicate elimination
The duplicates can be eliminated by sorting by using either of the parallel sort techniques. The
duplicate elimination can also be parallelized by partitioning the tuples and eliminating the
duplicates locally at each processor.
● Projection
The projection without the duplicate elimination can be performed as the tuples are read from
the disk in parallel. To eliminate the duplicates any of the techniques can be used.
● Aggregation
the operation can be parallelized by partitioning the relation on the grouping attributes and
computing the aggregate values locally at each processor. Either hash partitioning or range
partitioning can be used.

partitioning techniques are as follows:

Round Robin
● It scans the relation in any order and sends the i​th​ tuple to disk number D​i mod n​.
● The scheme ensures an even distribution of tuples across disks; that is, each disk has
approximately the same number of tuples as the others.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Hash partitioning
● It is a declustering strategy that designates one or more attributes from the given relation’s
schema as the partitioning attributes.
● A hash function is chosen whose range is {0, 1, . . . , n - 1}.
● Each tuple of the original relation is hashed on the partitioning attributes.
● If 'i' is returned by the hash function, then the tuple is placed on disk D​i​ 1.

Range partitioning
● It distributes the tuples by assigning contiguous attribute-value ranges to each disk.
● It selects a partitioning attribute, A, and a partitioning vector [v0, v1, . . . , vn-2], such that, if i < j,
then v​i​ < v​j​.
● The relation is partitioned as follows: Consider a tuple 't' such that t[A] = x. If x < v0, then 't' goes
on disk D0. If x = v​n-2​, then 't' goes on disk D​n-1​. If v​i​ = x < v​i+1​, then 't' goes on disk D​i+1​.
● Example of this can be with three disks numbered 0, 1, and 2 that may assign tuples with values
less than 5 to disk 0, values between 5 and 40 to disk 1, and values greater than 40 to disk 2.

Comparison of Partitioning Techniques


● A relation can be retrieved in parallel by using all the disks once a relation has been partitioned
among several disks.
● Similarly, when a relation is being partitioned, it can be written to multiple disks in parallel.
● The rate transfer for reading or writing an entire relation are much faster with I/O parallelism
than without it.
● However, it is only one kind of access to data for reading an entire relation, or scanning a
relation.

3) Access to data can be classified as follows: (Scanning)


● The entire relation is scanned.
● A tuple is located associatively (example, employee name = “Pooja”); these queries, also known
as point queries, seek tuples that have a specified value for a specific attribute.
● Locating all tuples for which the value of a given attribute lies within a specified range (example,
10000 < salary < 20000); these queries are called range queries.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Q. 3) b) Explain Bell-Lapadula Model.


Ans:

Bell-lapadula model​ is a state machine model that describes a set of access control rules which use
security labels on objects and clearances for subjects.

The security labels like Top secret, secret, confidential etc., to the least security labels like public or even
unclassified can be used. For e.g., TCS associates create documents which are affixed with the
classification depending on the data created in the document.

This model stresses on data confidentiality on the classified objects with the classification affixed to it.
The concept of state machine is introduced indicating the entities in a computer system are divided into
subjects and objects, and it can be formally proven that each state transition preserves security by
moving from one secure state to another secure state. For e.g., if the document is confidential it is
moved between the TCS & associated party only, which is directed by the NDA (non-disclosure
agreement).

This module also introduces a concept of state machine with a set of allowable states in a computer
system defining mathematical model of computation used to design both computer programs and
sequential logic circuits. This state machine is deemed to be secure if the same complies with the
security policy of the company. If the access of the document is allowed to the associated person in the
same project i.e. the clearance of a subject is compared to the classification of the object then the
transition is secure. These rules of clearance mentioned is said to be lattice based.

The bell-lapadula model stands on the basis of 3 properties namely 1. no read-up, 2. no write-down & 3.
the Discretionary Security Property.

Property 1: no read-up
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

This is a property which says an associate cannot read any documents prepared by his/her higher
officials. The documents are highly confidential or may be strategic and cannot be disclosed to lower
level officials for e.g., Annual Income Statement.

Property 2: no write-down
Suppose we have a log manager in the network which collects logs from all devices. Obviously this log
manager would be of great importance during network crisis. Hence the log manager would be branded
as a system HIGH. Now a network may have many processes which are supposed to be of less
importance and hence termed as system LOW, which in this case will not be able to send logs to the log
manager. Incidentally the whole picture of the network activity would be lost since we loose logs from
those processes branded as LOW processes not giving the actual picture of the network. To avoid this
we have "no write-down" property.

Property 3: the Discretionary Security Property


This is an access control which is based on the identity of the subjects. If an associate (subject) has
certain type of access on the object, he/she can transfer rights to other associate (subject) of their
choice.

Bell Lapadula confidentiality model can be a multi-level security model which formally specifies a kind of
MAC (Mandatory Access Control) policy. For the multi-level security, this policy needs a trusted subject
wherein the same transfers information from higher level document to lower level document. Obviously
the trusted subject must be 1. aligned to security policy of the company or otherwise this subject ceases
to be trusted & 2. it has to be free of the *-property which is "no write down" required to maintain
system security in an automated environment.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Q. 4) a) Discuss Deadlock detection in distributed database. Explain centralized, hierarchical, and time
out approach.

Deadlock detection in distributed database


The deadlock detection and removal approach runs a deadlock detection algorithm periodically and
removes deadlock in case there is one. It does not check for deadlock when a transaction places a
request for a lock. When a transaction requests a lock, the lock manager checks whether it is available. If
it is available, the transaction is allowed to lock the data item; otherwise the transaction is allowed to
wait.

Deadlocks are allowed to occur and are removed if detected. The system does not perform any checks
when a transaction places a lock request. For implementation, global wait-for-graphs are created.
Existence of a cycle in the global wait-for-graph indicates deadlocks. However, it is difficult to spot
deadlocks since transaction waits for resources across the network.

Alternatively, deadlock detection algorithms can use timers. Each transaction is associated with a timer
which is set to a time period in which a transaction is expected to finish. If a transaction does not finish
within this time period, the timer goes off, indicating a possible deadlock.

Another tool used for deadlock handling is a deadlock detector. In a centralized system, there is one
deadlock detector. In a distributed system, there can be more than one deadlock detectors. A deadlock
detector can find deadlocks for the sites under its control. There are three alternatives for deadlock
detection in a distributed system, namely.

● Centralized Deadlock Detector​ − One site is designated as the central deadlock detector.

● Hierarchical Deadlock Detector​ − A number of deadlock detectors are arranged in hierarchy.


Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

● Distributed Deadlock Detector​ − All the sites participate in detecting deadlocks and removing
them.

This approach is primarily suited for systems having transactions low and where fast response to lock
requests is needed.

Q. 4) b) Explain the relationship of data warehouse with ERP and CRM.


Ans :

Relationship of data warehouse with ERP

A sure way for a company to increase its revenue is by reducing its expenses. Reducing the number of
employees in a company by automating some of the jobs that were previously held by these employees
is one way of doing this. Enterprise resource planning (ERP) is management software that enables
companies to actualize the automation process in the business intelligence. Data warehousing, on the
other hand, is a system that works to report and analyze data the progress in business intelligence.

How data warehousing improve efficiency ERP systems


The objectives of data warehousing create a conducive environment for ERP systems to work at their
optimum levels. This can be seen from the objectives of data warehousing, which include,
To integrate data from different sources into a single database or model
For ERP systems to successfully achieve automation, company data will be needed. Data warehousing,
therefore, integrates the needed data into a central storage for the ERP systems to use it and achieve
full automation of the set processes.

Keep data history


For the ERP systems to work, data on past trends of the company’s transactions and other operations
will be needed. Data warehousing provides the ERP systems with this data and makes the automation
process a lot easier.

Integrate data from different branches of a company to create a central view of the company state.
To automate any process in a company, an analysis of the entire company data will be crucial. Data
warehousing collects and links all of the company’s data that is then used by the ERP systems. A clear
understanding of the links between different departments of the company will ensure that the
automation process flows smoothly.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

To successfully automate certain processes in a company ERP system rely on data warehousing to collect
all the data and information needed to make it possible. Data warehousing provides centrally stored
data for the ERP systems, provides company history data and provides data linking the different
company departments.

Relationship of data warehouse with CRM

CRM - Customer relationship Management (CRM) is integral part of every organization success story
board and using right source of information at the right time, it can change the dynamics of the
organization. CRM is used potentially by the business and its underlying data, infrastructure and
application are being managed by IT. Most of the time, CRM Application is the most downstream
application in the enterprise architecture; CRM gets data from various source system - Point of Sale
(PoS) systems like retail, online, direct channel, various data enrichment stream like Big data e.g,
Clickstream data, web log, social networking liking and comments, 3rd party file, customer review
comments, common touch points like customer complaint, call back information, prospect capture.
Some of the CRM application helps in capturing new prospect through direct channel and sometimes is
used for updating existing customer contact details along with notes as well.

As stated, as it relies on various touch point data - it is prudent to have an enterprise data warehouse to
be the source of CRM application whenever possible due to various factors. Isolated source of
information into CRM makes data abstraction from rest of the business and makes it less flexible and
makes life difficult from support and maintenance point of view. Before sponsors embark into building
their CRM system to perceive short term benefit using different source of data in isolation - it is very
much prudent to have enterprise data warehouse to be built up initially to have conformed dimension
and facts to represent individual business process at different grain.

An Enterprise data warehouse built up following Kimball methodology will have a virtual data
warehouse having separate logical data warehouse under same physical database with conformed
dimensions. The idea of having Enterprise data warehouse is to extract, clean, conform and transform
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

whenever possible to make more meaningful information and then load in physical star schema.
Different data mart like Retail, Telecom, Stock, Insurance, single Customer View for any CPG or TELCO or
Retail Organization can co-exist in the enterprise data warehouse. For organizations having business
presence in different countries, it is worth to have the country wide data available in the same
datawarehouse to make it single version of truth for reporting and for CRM System.

Q.5) a) Explain ETL process in data warehousing.


Ans:

ETL (Extract-Transform-Load)
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process of how
the data are loaded from the source system to the data warehouse. Currently, the ETL encompasses a
cleaning step as a separate step. The sequence is then Extract-Clean-Transform-Load. Let us briefly
describe each step of the ETL process.

Process

● Extract
The Extract step covers the data extraction from the source system and makes it accessible for further
processing. The main objective of the extract step is to retrieve all the required data from the source
system with as little resources as possible. The extract step should be designed in a way that it does not
negatively affect the source system in terms or performance, response time or any kind of locking.
There are several ways to perform the extract:
● Update notification - if the source system is able to provide a notification that a record has been
changed and describe the change, this is the easiest way to get the data.
● Incremental extract - some systems may not be able to provide notification that an update has
occurred, but they are able to identify which records have been modified and provide an extract
of such records. During further ETL steps, the system needs to identify changes and propagate it
down. Note, that by using daily extract, we may not be able to handle deleted records properly.
● Full extract - some systems are not able to identify which data has been changed at all, so a full
extract is the only way one can get the data out of the system. The full extract requires keeping
a copy of the last extract in the same format in order to be able to identify changes. Full extract
handles deletions as well.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

When using Incremental or Full extracts, the extract frequency is extremely important. Particularly for
full extracts; the data volumes can be in tens of gigabytes.

Clean
The cleaning step is one of the most important as it ensures the quality of the data in the data
warehouse. Cleaning should perform basic data unification rules, such as:
● Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not
Available are translated to standard Male/Female/Unknown)
● Convert null values into standardized Not Available/Not Provided value
● Convert phone numbers, ZIP codes to a standardized form
● Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
● Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street).

● Transform
The transform step applies a set of rules to transform the data from the source to the target. This
includes converting any measured data to the same dimension (i.e. conformed dimension) using the
same units so that they can later be joined. The transformation step also requires joining data from
several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated
values, and applying advanced validation rules.

● Load
During the load step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible. The target of the Load process is often a database. In order to make the load
process efficient, it is helpful to disable any constraints and indexes before the load and enable them
back only after the load completes. The referential integrity needs to be maintained by ETL tool to
ensure consistency.

Managing ETL Process


The ETL process seems quite straight forward. As with every application, there is a possibility that the
ETL process fails. This can be caused by missing extracts from one of the systems, missing values in one
of the reference tables, or simply a connection or power outage. Therefore, it is necessary to design the
ETL process keeping fail-recovery in mind.

Staging
It should be possible to restart, at least, some of the phases independently from the others. For
example, if the transformation step fails, it should not be necessary to restart the Extract step. We can
ensure this by implementing proper staging. Staging means that the data is simply dumped to the
location (called the Staging Area) so that it can then be read by the next processing phase. The staging
area is also used during ETL process to store intermediate results of processing. This is ok for the ETL
process which uses for this purpose. However, the staging area should is be accessed by the load ETL
process only. It should never be available to anyone else; particularly not to end users as it is not
intended for data presentation to the end-user.may contain incomplete or
in-the-middle-of-the-processing data.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Q. 5) b) Explain the architecture of parallel database.


Ans:

Parallel Database Architecture


Today everybody interested in storing the information they have got. Even small organizations collect
data and maintain mega databases. Though the databases eat space, they really helpful in many ways.
For example, they are helpful in taking decisions through a decision support system. To handle such a
voluminous data through conventional centralized system is bit complex. It means, even simple queries
are time consuming queries. The solution is to handle those databases through Parallel Database
Systems, where a table / database is distributed among multiple processors possibly equally to perform
the queries in parallel. Such a system which share resources to handle massive data just to increase the
performance of the whole system is called Parallel Database Systems.

We need certain architecture to handle the above said. That is, we need architectures which can handle
data through data distribution, parallel query execution thereby produce good throughput of queries or
Transactions. Figure 1, 2 and 3 shows the different architecture proposed and successfully implemented
in the area of Parallel Database systems. In the figures, P represents Processors, M represents Memory,
and D represents Disks/Disk setups.

1. Shared Memory Architecture


Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Figure 1 - Shared Memory Architecture

In Shared Memory architecture, single memory is shared among many processors as show in Figure 1. As
shown in the figure, several processors are connected through an interconnection network with Main
memory and disk setup. Here interconnection network is usually a high speed network (may be Bus,
Mesh, or Hypercube) which makes data sharing (transporting) easy among the various components
(Processor, Memory, and Disk).

Advantages:
● Simple implementation
● Establishes effective communication between processors through single memory addresses
space.
● Above point leads to less communication overhead.

Disadvantages:
● Higher degree of parallelism (more number of concurrent operations in different processors)
cannot be achieved due to the reason that all the processors share the same interconnection
network to connect with memory. This causes Bottleneck in interconnection network
(Interference), especially in the case of Bus interconnection network.
● Addition of processor would slow down the existing processors.
● Cache-coherency should be maintained. That is, if any processor tries to read the data used or
modified by other processors, then we need to ensure that the data is of latest version.
● Degree of Parallelism is limited. More number of parallel processes might degrade the
performance.

2. Shared Disk Architecture


Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Figure 2 - Shared Disk Architecture

In Shared Disk architecture, single disk or single disk setup is shared among all the available processors
and also all the processors have their own private memories as shown in Figure 2.

Advantages:
● Failure of any processors would not stop the entire system (Fault tolerance)
● Interconnection to the memory is not a bottleneck. (It was bottleneck in Shared Memory
architecture)
● Support larger number of processors (when compared to Shared Memory architecture)

Disadvantages:
● Interconnection to the disk is bottleneck as all processors share common disk setup.
● Inter-processor communication is slow. The reason is, all the processors have their own
memory. Hence, the communication between processors need reading of data from other
processors’ memory which needs additional software support.
Example Real Time Shared Disk Implementation
● DEC clusters (VMScluster) running Rdb

3. Shared Nothing Architecture


Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Figure 3 - Shared Nothing Architecture

In Shared Nothing architecture, every processor has its own memory and disk setup. This setup may be
considered as set of individual computers connected through high speed interconnection network using
regular network protocols and switches for example to share data between computers. (This
architecture is used in the Distributed Database System). In Shared Nothing parallel database system
implementation, we insist the use of similar nodes that are Homogeneous systems. (In distributed
database System we may use Heterogeneous nodes)

Advantages:
● Number of processors used here is scalable. That is, the design is flexible to add more number of
computers.
● Unlike in other two architectures, only the data request which cannot be answered by local
processors need to be forwarded through interconnection network.

Disadvantages:
● Non-local disk accesses are costly. That is, if one server receives the request. If the required data
not available, it must be routed to the server where the data is available. It is slightly complex.
● Communication cost involved in transporting data among computers.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Q. 6) a) Explain why recovery in a distributed DBMS is more complicated than in centralized system.

Ans:
Introduction
Centralized database is a database in which data is stored and maintained in a single location. This is the
traditional approach for storing data in large enterprises. Distributed database is a database in which
data is stored in storage devices that are not located in the same physical location but the database is
controlled using a central Database Management System (DBMS).
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Centralized Database
In a centralized database, all the data of an organization is stored in a single place such as a mainframe
computer or a server. Users in remote locations access the data through the Wide Area Network (WAN)
using the application programs provided to access the data.

The centralized database (the mainframe or the server) should be able to satisfy all the requests coming
to the system, therefore could easily become a bottleneck. But since all the data reside in a single place
it easier to maintain and back up data. Furthermore, it is easier to maintain data integrity, because once
data is stored in a centralized database, outdated data is no longer available in other places.

Distributed Database
In a distributed database, the data is stored in storage devices that are located in different physical
locations. They are not attached to a common CPU but the database is controlled by a central DBMS.
Users access the data in a distributed database by accessing the WAN. To keep a distributed database up
to date, it uses the replication and duplication processes.

The replication process identifies changes in the distributed database and applies those changes to
make sure that all the distributed databases look the same. Depending on the number of distributed
databases, this process could become very complex and time consuming. The duplication process
identifies one database as a master database and duplicates that database. This process is not
complicated as the replication process but makes sure that all the distributed databases have the same
data.

Difference between Distributed Database and Centralized Database


While a centralized database keeps its data in storage devices that are in a single location connected to a
single CPU, a distributed database system keeps its data in storage devices that are possibly located in
different geographical locations and managed using a central DBMS. A centralized database is easier to
maintain and keep updated since all the data are stored in a single location.

Furthermore, it is easier to maintain data integrity and avoid the requirement for data duplication.
Keeping the data up to date in distributed database system requires additional work, therefore increases
the cost of maintenance and complexity and also requires additional software for this purpose.
Furthermore, designing databases for a distributed database is more complex than the same for a
centralized database.

Recovery mechanism
In Centralized System all the requests coming to access data are processed by a single entity such as a
single mainframe, and therefore it could easily become a bottleneck. But with distributed databases, this
bottleneck can be avoided since the databases are parallelized making the load balanced between
several servers.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Q. 6) b) What is K-mean clustering algorithm? Explain with example.


Ans :

K-means clustering
Clustering is the process of partitioning a group of data points into a small number of clusters. For
instance, the items in a supermarket are clustered in categories (butter, cheese and milk are grouped in
dairy products). Of course this is a qualitative kind of partitioning. A quantitative approach would be to
measure certain features of the products, say percentage of milk and others, and products with high
percentage of milk would be grouped together. In general, we have n data points xi,i=1...n that have to
be partitioned in k clusters. The goal is to assign a cluster to each data point. K-means is a clustering
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

method that aims to find the positions μi,i=1...k of the clusters that minimize the distance from the data
points to the cluster. K-means clustering solves

where ci is the set of points that belong to cluster i. The K-means clustering uses the square of the
Euclidean distance d(x,μi)=∥x−μi∥22. This problem is not trivial (in fact it is NP-hard), so the K-means
algorithm only hopes to find the global minimum, possibly getting stuck in a different solution.

K-means algorithm
The Lloyd's algorithm, mostly known as k-means algorithm, is used to solve the k-means clustering
problem and works as follows. First, decide the number of clusters k. Then:
1. Initialize the center of the μi= some value ,i=1,...,k
clusters
2. Attribute the closest cluster
to each data point

3. Set the position of each


cluster to the mean of all data
points belonging to that cluster

4. Repeat steps 2-3 until


convergence
Notation |c|= number of elements in c

The algorithm eventually converges to a point, although it is not necessarily the minimum of the sum of
squares. That is because the problem is non-convex and the algorithm is just a heuristic, converging to a
local minimum. The algorithm stops when the assignments do not change from one iteration to the
next.

Deciding the number of clusters


The number of clusters should match the data. An incorrect choice of the number of clusters will
invalidate the whole process. An empirical way to find the best number of clusters is to try K-means
clustering with different number of clusters and measure the resulting sum of squares.
The most curious can look at this paper for a benchmarking of 30 procedures for estimating the number
of clusters.

Initializing the position of the clusters


It is really up to you! Here are some common methods:
● Forgy: set the positions of the k clusters to k observations chosen randomly from the dataset.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

● Random partition: assign a cluster randomly to each observation and compute means as in step
3.
Since the algorithm stops in a local minimum, the initial position of the clusters is very important

Example :

In this scatter plot you have several two-dimensional data points, clustered at 4 distinct positions. You
can choose the initialization method and the number of clusters used in the k-means algorithm. The
button 'Reset' resets the algorithm and generates a new dataset. The button 'Iterate' runs one step of
the algorithm, which becomes bolded in the text below the button. More often than not, you see that
the algorithm converges to the best solution. However, if you try enough times, there are some
initializations of the clusters that lead to a "bad" local minimum. If you choose the wrong number of
clusters, you can see the drastic effects on the result of the algorithm

Q. 7) Differentiate between the following:

a) Synchronous Vs Asynchronous replication


Ans:
When a technician or server operator needs to create copies of data, be it over a SAN, LAN or WAN,
there are two main methods of doing this. Synchronous and asynchronous replication each have their
own benefits and drawbacks. Learn more about these two methods of copying data, so you can choose
the right option for your needs.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Synchronous Replication
Essentially, synchronous replication writes the data to both the primary and to the secondary area at the
same time. In doing this, the data remains completely current and identical. The process works quickly
and there is a extremely small margin of error. Because of this, it is ideal for disaster recover and is the
method preferred for projects that require absolutely no data loss.

Asynchronous replication
Asynchronous replication also writes data to both a primary and secondary site, however with this
process there is a delay when data is copied from one to another. Experts call this approach to data
backup, “store and forward”. With this type of replication the data first writes to the primary array and
then commits the data for replication to a secondary source: either memory or disk-based. Finally, the
data copies at scheduled intervals to the target. This method can work over longer distances than
synchronous replication, so at times it may be the only option.
Synchronous vs. Asynchronous Comparison Table

Type of Synchronous Asynchronous


Replication

Recovery Point Zero 15 minutes to a few hours


Objective

Distance Best if both SANs are in the Anywhere with a good data connection
Limitations same datacenter

Cost Most expensive type of SAN Not as expensive as Synchronous but more
solution expensive than basic SANs.

Apart from the above differences one more difference in synchronous replication and asynchronous
replication is the way in which data is written to the replica. Most synchronous replication products
write data to primary storage and the replica simultaneously. As such, the primary copy and the replica
should always remain synchronized.

In contrast, asynchronous replication products copy the data to the replica after the data is already
written to the primary storage. Although the replication process may occur in near-real-time, it is more
common for replication to occur on a scheduled basis. For instance, write operations may be
transmitted to the replica in batches on a periodic basis (for example, every one minute). In case of a
failover event, some data loss may occur.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

b) Semi Joins Vs Bloom Joins


Ans:

Semi join and Bloom join are two joining methods used in query processing for distributed databases.
When processing queries in distributed databases, data needs to be transferred between databases
located in different sites.

Semi Join
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

Semi join is a method used for efficient query processing in a distributed database environments.
Consider a situation where an Employee database (holding information such as employee’s name,
department number she is working for, etc) located at site 1 and a Department database (holding
information such as department number, department name, location, etc) located at site 2. For example
if we want to obtain the employee name and department name that she is working for (only of
departments located in “New York”), by executing a query at a query processor located at site 3, there
are several ways that data could be transferred between the three sites to achieve this task. But when
transferring data, it is important to note that it is not necessary to transfer the whole database between
the sites. Only some of the attributes (or tuples) that are required for the join need to be transferred
between the sites to execute the query efficiently. Semi join is a method that can be used to reduce the
amount of data shipped between the sites. In semi join, only the join column is transferred from one site
to the other and then that transferred column is used to reduce the size of the shipped relations
between the other sites. For the above example, you can just transfer the department number and
department name of tuples with location=”New York” from site 2 to site 1 and perform the joining at
site 1 and transfer the final relation back to site

Bloom Join
As mentioned earlier, bloom join is another method used to avoid transferring unnecessary data
between sites when executing queries in a distributed database environments. In bloom join, rather
than transferring the join column itself, a compact representation of the join column is transferred
between the sites. Bloom join uses a bloom filter which employs a bit vector to execute membership
queries. Firstly, a bloom filter is built using the join column and it is transferred between the sites and
then the joining operations are performed.

Difference Between Semi Join and Bloom Join


Semi join and Bloom Join are methods of joining which are used in query processing in case of
distributed database. In case of distributed databases the data has to be transferred between the
databases for processing queries. These databases are usually located at different sites. To save on the
cost of operation the queries are optimized so that minimum amount of data may need to be
transferred. This is where these two methods come into picture.

Considering an Example, Suppose some of the information about an employee is at site 1 while rest of
the information is at site 2 and you want to access the entire information from site 3 then you need to
execute a query to get the information. Here it is not necessary to transfer the entire database, instead
we can use some attributes required for the join so that the query can be executed successfully. Here,
semi join reduces the amount of data transferred between these sites. Only the join column is
transferred between the sites.

In case of bloom join representation of join column is transferred between the remote sites instead of
the join column like in semi join. This representation is created by using bloom filter with bit vector for
executing membership queries.
Bloom join is more efficient than semi join because the amount of data transferred is far less in case of
bloom join.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

c) OODBMS Vs ORDBMS
Ans:
Comparing the ORDBMS with the OODBMS:

1. OODBMSs and ORDBMSs both support user-defined ADTs, structured types, object identity and
reference types, and inheritance

2. They both support a query language for manipulating collection types.


Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

3. ORDBMS support an extended form of SQL, and OODBMSs support ODL/OQL.

4. ORDBMSs consciously try to add OODBMS features to an RDBMS, and OODBMS in their turn
have developed query languages based on relational query languages.

5. Both OODBMSs and ORDBMS provide DBMS functionality such as concurrency control and
recovery.

6. OODBMSs try to add DBMS functionality to a programming language, whereas ORDBMSs try to
add richer data types to a relational DBMS.

7. OODBMS put more emphasis on the role of the client side This can improve long, process
intensive, transactions. ORDBMS SQL is still the language for data definition, manipulation and
query.

8. OODBMS have been optimized to directly support object-oriented applications and specific OO
languages ORDBMS are supported by most of the ‘database vendors’ in the DBMS market place.
ORDBMS Most third-party database tools are written for the relational model and will therefore
be compatible with SQL3. ORDBMS search, access and manipulate complex data types in the
database with standard (SQL3), without breaking the rules of the relational data model.

9. In applications that generally retrieve relatively few (generally physically large) highly complex
objects and work on them for long periods of time. In applications that process a large number
of short-lived (generally ad-hoc query) transactions on data items that can be complex in
structure

d) OLAP Vs OLTP
Ans :

Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)

1 Involves historical processing of Involves day-to-day processing.


information.
Name: Pankaj L. Chowkekar Application ID: 6808
Class: S.Y.MCA (SEM –IV) Academic Year: 2017-2018

2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers and database professionals.
analysts.

3 Useful in analyzing the business. Useful in running the business.

4 It focuses on Information out. It focuses on Data in.

5 Based on Star Schema, Snowflake, Schema Based on Entity Relationship Model.


and Fact Constellation Schema.

6 Contains historical data. Contains current data.

7 Provides summarized and consolidated Provides primitive and highly detailed


data. data.

8 Provides summarized and Provides detailed and flat relational view


multidimensional view of data. of data.

9 Number or users is in hundreds. Number of users is in thousands.

10 Number of records accessed is in millions. Number of records accessed is in tens.

11 Database size is from 100 GB to 1 TB Database size is from 100MB to 1GB.

12 Highly flexible. Provides high performance.

Vous aimerez peut-être aussi