Vous êtes sur la page 1sur 6

Data Provenance for Historical Queries in Relational

Database
Asma Rani Navneet Goyal Shashi K. Gadia
Lecturer Professor Associate Professor
Birla Institute of Technology & Science Birla Institute of Technology & Science Iowa State University
Pilani, India Pilani, India United States
+919468646534 +919929095379 (515) 294-2253
asma.rani@pilani.bits- goel@pilani.bits-pilani.ac.in gadia@iastate.edu
pilani.ac.in

ABSTRACT
Capturing, modeling, and querying data provenance in databases
1. INTRODUCTION
As the data in the world is growing at an unprecedented rate, data
has gained considerable importance in the last decade. All kinds
provenance has become an important topic of research. Data
of applications developed on top of databases, now a days collect
provenance is used to determine the veracity [11] and the quality
provenance for various purposes like trustworthiness of data,
[5] of data. Provenance of a piece of a data includes information
update management, quality measurement etc. For these purposes,
such as the origin of the data, date and process of creation, date
there is a need to efficiently capture, store, and query provenance
of last modification etc. [19].Capturing, storing and querying the
information for current as well as historical queries executed on
provenance data is of paramount importance as it supports to
the database. Most of the existing provenance models like DB-
trustworthiness [10], reliability, reputability, accountability,
Notes, MONDRIAN, Perm, Orchestra, TRIO, and GProM are
privacy and quality [5] of data. In the context of scientific
suitable for capturing and querying provenance in relational
experiments, data provenance can be used in reproducing the
databases. All these models can capture provenance only for
experiments [18] and to determine the quality of work. It
currently executing queries, except for TRIO and GProM, which
empowers the auditing and helps in View Maintenance [7],
can capture and query provenance for historical queries also. But,
Update Propagation [13] without executing the query again. In
the time and space complexity of these two models is very high.
addition, it also provides scope for data analysis with less efforts
In this paper, we propose a framework, Data Provenance for
in terms of time and volume of data to be searched [11]. In a
Historical Queries (DPHQ), which is capable of efficiently
business domain, it makes easier for the business to trust the data
capturing and querying provenance for queries, including that of
which is transferred between trusted partners [21] and helps in
historical queries. The proposed model also supports provenance
decision making or analyzing data. One of the most common
for updates. In our model, we have used Zero Information Loss
examples of provenance information is data citation [3], where a
Database [2] to execute historical queries at any point of time,
reference to a previous publication is mentioned.
using the concept of nested relations. A graph database is used for
storing and subsequent querying of provenance information. Granularity of provenance information can be defined at
following two levels, viz., coarse grained and fine grained. The
CCS Concepts first level tells us about which sequence of activities or operators
Information Systems Data Provenance; 500 are executed to generate a dataset and the second level gives
information about which source tuples contribute to a piece of
Information Systems Query Languages; 300 data in result set, respectively [21]. In Databases, data provenance
is captured at fine-grained level as it is more significant and
explanatory. Fine grained provenance can further be classified
Keywords into three categories, viz., Where-Provenance [4], Why-
Data Provenance; DPHQ; ZILD; Query Inversion; Provenance
Provenance [4] and How-Provenance.[14]. Where-Provenance
Querying; Graph Database; Neo4j; TPC-H.
indentify sources from where a value is copied to the result set. It
only tells us about the cells from where the value is coming, but
not about the source tuples which are sufficient enough to
generate the result tuple by executing the query again on
provenance information. Why-Provenance captures, why a result
tuple has been derived [4]. It gives us the source tuples those
Permission to make digital or hard copies of all or part of this work for contributed to produce the result tuple and are sufficient enough
personal or classroom use is granted without fee provided that copies are
to produce the result tuple by executing the same query again, on
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
provenance information. But, it does not provide any information
for components of this work owned by others than ACM must be honored. about the derivation process of a result tuple i.e., how these tuples
Abstracting with credit is permitted. To copy otherwise, or republish, to in provenance information have contributed towards the result.
post on servers or to redistribute to lists, requires prior specific How-Provenance captures the complete derivation history of a
permission and/or a fee. Request permissions from result tuple in a form of provenance polynomial [14] [17]. How-
Permissions@acm.org. Provenance is a superset of Where and Why Provenance. An
Compute 2015, October 29-31, 2015, Ghaziabad, India example showing why and how provenance is given in Figure 1.
2015 ACM. ISBN 978-1-4503-3650-5/15/10$15.00
DOI: http://dx.doi.org/10.1145/2835043.2835047
Part Partsupp
data model, MONDRIAN, for provenance capturing and
Tid Pid Pname Tid Pid Sid querying. They applied annotations on set of cell values in the
P1 1 Mouse
form of blocks instead of applying on each cell. Proposed model
PS1 1 1
uses different colors for different values (annotations) on blocks.
P2 2 HDD PS2 2 1 Color Query Language was proposed based on Color Algebra for
P3 3 Kindle PS3 1 2 querying data as well as provenance. Both DB-Notes and
MONDRIAN use Annotation approach to capture provenance.
But it is very complex to assign annotations on each cell or block
Query: select Pname from Part p join Partsupp ps on
and assigning the annotations causes storage overhead also.
p.Pid=ps.Pid where ps.Sid=2.
In query inversion approach, provenance information is captured
Query Result: Mouse
using query rewriting. By using this approach, transformations
Why-Provenance=<P1,PS3>, How-Provenance= <P1*PS3> can be inverted to determine the source tuples that have
Figure 1. Example : Why and How Provenance contributed towards the result set. Boris Glavic et al.[9][12]
proposed a model for provenance in relational database, Perm
In this work, we present a provenance framework, Data (Provenance Extension of the Relational model), based on why-
Provenance for Historical Queries (DPHQ), which captures the provenance and presented PI-CS (Perm Influence Contribution
detailed provenance information for every result tuple of queries, Semantics) to generate the provenance which works on both sets
including that for historical queries. To capture provenance and bags. Perm is built on PostgreSQL for capturing provenance
information for historical queries, we need to efficiently maintain by rewriting the original query which annotates the result dataset
the complete history of all the updates in a database. For this with provenance information. Perm uses an SQL language
purpose, we have used a Zero Information Loss Database (ZILD) extension called SQL-PLE to enable a user to issue provenance
[2], which has been implemented on top of a relational database. queries.
ZILD is space efficient as with any update it only stores updated
value in nested table without storing the values of other attributes Todd J. Green et. al. [13][15] proposed an Orchestra Data Model
in new row, unlike relational databases which use Type II based on the need for Collaborative Data Sharing of data between
changes (new record approach). different users to help update exchange efficiently. Orchestra also
enforces the trust level by capturing the provenance information at
A result tuple for a query can be generated in multiple ways. every step of transformation by any user. As the updates
DPHQ generates provenance polynomial that includes all the performed by any user are being translated, they are filtered based
derivations to generate a given result tuple. DPHQ architecture on the trust conditions that use the provenance of the data in the
and the details of implementation of ZILD is given in Section 3. updates. They implemented a query language ProQL [16] for
DPHQ query support is presented in Section 3.1. The provenance provenance graph based on provenance semiring [14] to trace the
information is captured using Query Inversion approach [12], and derivation of each transformed update. But Orchestra Data Model
is stored in a provenance table. Details of capturing and storing of is applicable to small dataset only. As the volume of data
provenance is presented in Section 3.1. Although, storing increases, communication overhead will also increases due to
provenance data in relational databases is very simple, but more number of updates may happen.
querying or tracking it is computationally expensive due to a large
DB-Notes, MONDRIAN, ORCHESTRA, and PERM are suitable
number of joins. Therefore, to efficiently query the provenance
to capture provenance information for currently executing query
information, our framework also provides a support to store
only. Also, these models support the provenance querying while
provenance information in a graph database in the form of key-
capturing . But, in current scenario, to analyze the data which has
value pairs [22]. Details of how provenance is stored in graph
been generated in the past, there is a need to capture the
database and querying provenance is given in Section 3.2. Salient
provenance information for historical queries also. Querying
features of DPHQ are:
support for the stored provenance information is also needed for
Captures provenance information for every result tuple of various applications like quality measurement, reputability, error
queries, including that for historical queries using zero analysis etc.
information loss database. Jennifer Widom [23] proposed the Trio model for database system
Manages updates efficiently using nested tables. basically for data accuracy and data lineage. She defined the view
data lineage as the maximal set of tuples from source tables that
produced a data item in a materialized warehouse view. TRIO
2. RELATED WORK uses Inverse Query [21] instead for Annotation Propagation for
In the literature, there are two broad approaches for data provenance generation. The inverse queries are recorded at the
provenance, viz., annotation-based approach[21][19][6] and granularity of a tuple and stored in a special Lineage table.
query inversion based approach[21][19][12]. Lineage table is: Lineage (tupleID, derivation-type, time, how-
In Annotation-based approach, zero or more annotations are derived, lineage-data). Jennifer Widom et.al. proposed TriQL
appended with every cell value in source tuples which propagate query language [20] for provenance querying in TRIO based on
from source to result data while querying. Laura Chiticariu et. al. SQL. TriQL query is converted into SQL query and then post-
[6] proposed a DB-Notes System based on where-provenance. processing is done for Accuracy, confidence and lineage queries.
System applies zero or more annotations on each cell which TRIO model is suitable for capturing the lineage for past queries
propagates from source to result data while querying. They (Historical Lineage) from the expired portion of database i.e. the
developed a Provenance Query Language, pSQL (an extension of data which is not valid now. But its lineage table which includes
SQL) to support storing and propagating annotation (transparent complete lineage information for every tuple, creates lots of
to user). Floris Geerts et. al [8] presents an annotation-oriented storage overhead. Bahareh Sadat Arab et. al. [1] presents GProM
(Generic Database Provenance Middleware) which captures Start
provenance using query rewriting. GProM is able to capture
provenance for concurrent database transactions also. Using the
Audit log it traces the transaction provenance and provenance for Find all the tables (TABLES [M]) of
classical database to design ZILD.
past queries. Capturing the provenance of past queries using audit
log will need complex query rewriting. Audit log may not capture
all the information all the time. GProM does not store provenance I=0
information explicitly for querying it later.
Marcin Wylot et. al. [24][25] presents TripleProv: which co- N
Stop Is I<M?
locates Lineage, L (Annotation) in RDF Triplets with different
options like SLPO and SPOL. TripleProv is suitable for web data Y
but not suitable for relational data as propagating the annotations For TABLES [I], find all the
from source to result set increases overhead as in [6][8]. columns COLUMNS [N].

In this paper, we are presenting an approach to generate ZILD on


top of any relational database which manages all the updates J=0
efficiently and is suitable to capture provenance of currently Add validfrom,
executing query as well as historical queries. Provenance N validto, tupleid
polynomial [17] which includes complete derivation is generated Is J<N? Columns to
Table creation.
using query inversion approach [12] and stored in Neo4j graph Y
database for querying provenance information efficiently [22].
Analyze the column[J]
Create Table.
I++
3. PROPOSED PROVENANCE
FRAMEWORK Add Column to N Is Column [J]
Table Creation.
Figure 2. shows the complete data flow in the proposed DPHQ J++
Updatable?
(Data Provenance for Historical Queries) framework. The user
Y
issues queries on the relational database which are then
automatically rewritten and executed on a Zero Information Loss Create type TYPE1 which has Column[J],
Database (ZILD). Queries are executed and provenance is validfrom, Validto members.
captured in the provenance table and also in a graph database. We
have implemented ZILD on top of a relational database using Create type TYPE2 which is table of TYPE1
object-relational database concepts like user defined data types
and nested tables. A complete flowchart for creating ZILD schema
Create Column of TYPE2
from a given relational schema is given in figure 3. In this paper,
we have used the following TPC-H schema [27]:
PART(partkey,name,mfgr,brand,type,size,container,retailprice,comment) Figure 3. ZILD Schema Design
SUPPLIER(suppkey,name,address,nationkey,phone,acctbal,comment) NATION(nationkey,name,regionkey,comment)

PARTSUPP(partkey,suppkey,availqty,aupplycost,comment) REGION(regionkey,name,comment)
CUSTOMER(custkey, name, address, nationkey, phone, acctbal,
mktsegment,comment)
Provenance Graph
Query Result ORDERS(orderkey,custkey,orderstatus,totalprice,orderdate,orderpriority,
clerk, shippriority, comment,)
Graph Database
LINEITEM(orderkey,partkey,suppkey,linenumber,quantity,extendedprice
,discount, tax, returnflag, linestatus, shipdate, commitdate, receiptdate,
shipinstruct,shipmode,comment)
ZILD Nested tables have been created for all the updatable attributes in
the schema to capture all the updates on attributes. For example,
in Region table, name of region and comment are updatable. To
Query and Provenance Capturing Module (QPCM)
store all the updates related to name, we have created a nested
table for name within region table as shown below:
create or replace type R_Name as object (
eR_Name varchar2(60),
Query Relational Database Provenance
Table Table valid_from timestamp,
valid_to timestamp);
create or replace type Reg_Name as table of R_Name;
In ZILD, the following table corresponding to Region has been
User Query created:
create table Zero_Region(R_RegionKey number(38) primary key,
Figure 2. DPHQ Framework r_Name Reg_Name, r_Comment Reg_Comment, r_valid_from
timestamp, r_valid_to timestamp, R_tID varchar2(10)) NESTED
q72: select r_name from zero_nation n join
zero_region r on r.r_regionkey=n.n_regionkey

Query Rewriting Module User1

Query Result
Rewritten Query TupleID R_Name PROVENANCE
Select rr.er_name, replace(wm_concat(rtrim( q72t0 AMERICA R2*N3+R2*N2+R2*N25+
nvl2(r.r_tid , r.r_tid ||*,)||nvl2(n.n_tid, R2*N18+R2*N4
n.n_tid ||*,),*)),,,+) provenance from q72t1 ASIA R3*N9+R3*N22+R3*N19
zero_nation n, zero_region r, table(r.r_name) +R3*N13+R3*N10
rr where r.r_regionkey=n.n_regionkey and q72t2 EUROPE R4*N7+R4*N24+R4*N23
n_valid_to is null and r_valid_to is null and +R4*N20+R4*N8
rr.valid_to is null group by rr.er_name
q72t3 MIDDLEEAST R5*N5+R5*N21+R5*N14
+R5*N12+R5*N11
q72t4 africa R1*N6+R1*N1+R1*N17
ZILD
+R1*N16+R1*N15

Provenance Table
RESULTID PROVENANCE
Query Table q72t0 R2*N3+R2*N2+R2*N25
QID QUERYNAME USER TIME +R2*N18+R2*N4
q70 update zero_region set r_name= asma 22-apr-2015 q72t1 R3*N9+R3*N22+R3*N19
BHARAT where r_name=INDIA 03:33:50 pm +R3*N13+R3*N10
q71 select p_name from zero_part, User1 24-apr-2015 q72t2 R4*N7+R4*N24+R4*N23
zero_ps where p_partkey= 09:01:42 am +R4*N20+R4*N8
ps_partkey and ps_suppkey=5 q72t3 R5*N5+R5*N21+R5*N14
q72 select r_name from zero_nation n User1 15-may- +R5*N12+R5*N11
join zero_region r on r.r_regionkey= 2015 q72t4 R1*N6+R1*N1+R1*N17
n.n_regionkey 12:04:31 pm +R1*N16+R1*N15

Figure 4. DPHQ: Querying and Provenance Generation for Example 2

TABLE r_Name store as rReg_Name_tab, NESTED TABLE


r_Comment store as rReg_Comment_tab;
3.1 Provenance Generation in DPHQ
In this section, we give details of provenance generation in
To store the data which is not valid now, we have added two DPHQ with some suitably chosen examples. Unlike any
attributes, viz., valid_from and valid_to in each tuple in a relation, conventional relational database, DPHQ allows a user to query
where valid_to attribute will have the constant NULL for data that has expired or has been modified, by using the qualifiers,
currently valid tuples. valid_from and valid_to. DPHQ also allows users to generate
Querying and Provenance Capturing Module (QPCM) is results of queries that were executed in the past. ZILD gives the
developed on top of ZILD. In the proposed provenance same result as was obtained by a query when it was executed in
framework, complexity introduced by ZILD is completely the past. To get historical data or currently valid data, we have
abstracted from the user. Queries posed by the users are parsed extended our SQL query by adding two constructs, viz., instance
against the schema of ZILD and rewritten by QPCM, which and valid_on_now or valid_on 'date' as explained in Examples 1
passes it to ZILD for execution. A complete worked out example and 2.
is given in section 3.1. All the queries executed on the system are We now present a few examples of queries and their
stored in a table in the relational database with the schema: corresponding provenance that gets generated thereof.
QUERYTABLE (QueryID, Query, USER, Time). It also provides
Example 1: (Q1) Display the partname supplied by supplier
some provenance information like which users are accessing the
'Supplier#000000001 on 15/04/2015.
database and when, and what kind of queries they are executing.
The captured provenance information is stored in a table with the User query: select instance p_name from part p, partsupp ps,
schema: PROVENANCETABLE (ResultID, PROVENANCE) in supplier s where p.p_partkey=ps.ps_partkey and s.s_suppkey=
the relational database where ResultID is concatenation of ps.ps_suppkey and s.s_name=' Supplier#00000001' valid_on
QueryID and TupleID. To provide support for efficient 15-Apr-2015.
provenance querying, provenance information is also stored in a This query will be passed to QPCM which will automatically
graph database in the form of key-value pairs. rewrite the query into a query that can be executed on ZILD.
Rewritten query is given below:
Rewritten Query: Select instance pn.ep_name, replace(
Source Tuples
wm_concat(rtrim( nvl2(p.p_tid , p.p_tid ||*,)|| nvl2(ps.ps_tid,
n.n_tid ||*,) nvl2(s.s_tid, s.s_tid ||*,),*)),, ,+) Join Operator
provenance from zero_part p, table(p.p_name) pn, zero_partsupp N21
ps, zero_supplier s, table(s.s_name) sn where
*
p.p_partkey=ps.ps_partkey and s.s_suppkey=ps.ps_suppkey and
sn.es_name='Supplier#00000001' and p_valid_from<= 15-Apr-
2015<=p_valid_to and ps_valid_from<=15-Apr-2015 <= N14 +
ps_valid_to and s_valid_from<= 15-Apr-2015<=s_valid_to
and pn.valid_from<= 15-Apr-2015<=pn.valid_to and *
+ Output Tuples
sn.valid_from <= 15-Apr-2015<=sn.valid_to group by
p.p_name; N5

Above query will display all part names which are supplied by q72t3
Supplier#000000001 on 15-April-2015. The corresponding * +
provenance in the form of provenance polynomial is also R5
generated.
+
Example 2: (q72) Display the region name where nation key=1. *
User Query: select instance r_name from region r join nation n on
N12 +
n.n_regionkey=r.r_regionkey where n.n_nationkey=1
valid_on_now.
*
Rewritten Query: select instance rr.er_name, replace(
wm_concat(rtrim( nvl2(r.r_tid , r.r_tid ||*,)||nvl2(n.n_tid, N11
n.n_tid ||*,),*)),,,+) provenance from zero_nation n,
zero_region r, table(r.r_name) rr where Figure 5. Subset of Provenance Graph for Query in Figure 4.
r.r_regionkey=n.n_regionkey and n.n_nationkey=1 and
n_valid_to is null and r_valid_to is null and rr.valid_to is null
group by rr.er_name Example 3: Find all tuples which are derived from tuple with
Above query will display region name corresponding to nation TupleID R5 (Assuming an error in input tuple with TupleID
with nationkey=1. R5).
Complete worked out Example 2 and corresponding provenance Query: match (tupleid: tuple {name:'R5'})-[*]->(b) RETURN
polynomial is shown in Figure 4. tupleid,b .
Provenance information will include provenance polynomial as Above Query will result a subtree rooted with node R5 and gives
well as query execution time. Query execution time gives validity all the result tuples it contributed towards.
time of source tuples. We store this in the query table. Captured Example 4: Find all tuple which contribute to derive Output
provenance shown in figure 4 is stored in graph database, where Tuple with TupleID q72t3 (To know about quality or
edges are directed from source to result tuple. Result tuple trustworthiness of this Result)
attributes in provenance graph have addition property which
Query: match (tupleid: tuple {name:'q72t3'}) <-[*]-(b) RETURN
include query execution time to store complete provenance
tupleid,b.
information as stated earlier. The snapshot of provenance graph
in Neo4j for Example 2 is shown in Figure 5. It shows the Above query will perform backward tracking in provenance graph
different derivations of tuple q72t3 in result set like N5*R5+ and gives all the tuple which contributed to it.
N11*R5+ N14*R5+ N21*R5+N12*R5. In this polynomial, *
signifies Multiplication (join operator) and + signifies (OR 4. CONCLUSION AND FUTURE WORK
operator). We have presented a framework to capture Data Provenance for
Historical Queries (DPHQ). In DPHQ, a zero information loss
3.2 Querying Provenance Data database is implemented on top of relational database which
To store the provenance information, common data structure allows us to capture the provenance information for currently
generally used is Directed Acyclic Graph (DAG). Although, executing query as well as past queries. For this, our framework
storing provenance information in relational database is very creates the nested table for only those attributes in the relations
simple but querying or tracking is computationally inefficient [22] which are updatable, and manages all the updates with less
due to a large number of joins as shown in Figure 4. Therefore, to storage requirements. DPHQ captures provenance information
efficiently query the provenance information, our framework using query inversion approach. Captured provenance information
provides a support to store provenance information in graph is also stored in the form of DAG in graph database for efficiently
database in the form of key-value pairs. For this purpose, we have querying. Provenance graph is updated every time new
used Neo4j graph database. provenance is captured. Currently, DPHQ supports simple queries
including join and union operator; in future, we will extend it
Thus, DPHQ provides following two interfaces for provenance
further to capture provenance for complex queries and also for
querying; querying via relational database using SQL and
nested queries. Presently, DPHQ captures the provenance
querying via graph database. User can also perform forward and
information for relational database. It will be further extended to
backward query in provenance graph as explained in Example 3
capture provenance information for Big Data.
and 4.
5. REFERENCES [13] Green, T. J., Karvounarakis, G., Ives, Z. G., and Tannen, V.
[1] Arab, B., Gawlick,D., Radhakrishnan, V., Guo, H., and 2007. Update Exchange with Mappings and Provenance. In
Glavic,B. 2014. A Generic Provenance Middleware for VLDB, pages 675-686.
Queries, Updates, and Transactions. In TaPP 14: 6th [14] Green, T. J., Karvounarakis, G., and Tannen, V. 2007.
USENIX Workshop on the Theory and Practice of Provenance Semirings. In PODS, pages 31-40.
Provenance. DOI=http://dl.acm.org/citation.cfm?id=1265535
[2] Bhargava, G., and Gadia.S. K. 1993. Relational Database [15] Green, T. J., Karvounarakis, G., Ives, Z. G., and Tannen, V.
Systems with Zero Information Loss. In IEEE Transactions 2010. Provenance in Orchestra. In IEEE Data Eng. Bull.,
on Knowledge and Data Engineering, vol. 5, Issue. 1, pages 33(3), pages 9-16.
76-87. DOI=http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1
DOI=http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnum .1.174.5428
ber=204093 [16] Karvounarakis, G., Tannen, V., and Ives, Z.G. 2010.
[3] Buneman, P., Khanna, S., and Tan.,W.C. 2000. Data Querying Data Provenance. In SIGMOD, pages 951-962.
provenance some basic issues. In proceeding of Foundations [17] Karvounarakis,G., and Green,T.J. 2012. Semiring-Annotated
of Software Technology and Theoretical Computer Science, Data: Queries and Provenance. In SIGMOD, Volume 41,
pages 87-93. Issue 3, pages 05-14.
DOI=http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1
.1.101.7132 [18] Korolev, V., and Joshi,A. 2014. PROB: A tool for Tracking
Provenance and Reproducibility of Big Data Experiments. In
[4] Buneman, P., Khanna, S., and Tan.,W.C. 2001. Why and Proceeding of Workshop on Reproducible Research
Where: A Characterization of Data Provenance. In ICDT, Methodologies (REPRODUCE'14).
LECT NOTES COMPUT SC, pages 316-330.
DOI=http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1 [19] Rani,A., and Thalia,S. 2014. Knowledge driven decision
.1.6.1848 support system for provenance models in relational database.
In Proc. ICDSE, pages 68-75.
[5] Buneman,P., and Davidson,S.B. 2010. Data provenance DOI=http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=69746
the foundation of data quality. A Technical Report, 14
September. DOI= http://icnjp.net/reading/data-provenance-
the-foundation-of-data-quality-wOhg.html [20] Sarma, A. D., Theobald, M. and Widom,J.2008. Exploiting
Lineage for Confidence Computation in Uncertain and
[6] Chiticariu, L., Tan.,W.C., and Vijayvargiya.,G. 2005. Probabilistic databases. In ICDE, vol 29, pages 1023-1032.
DBNotes: A Post-It System for Relational Databases based DOI=http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=44975
on Provenance. In SIGMOD, pages 942-944. 11
[7] Cui, Y., Widom, J.,and Wiener., J. L. 2000. Tracing the [21] Simmhan ,Y. L., Plale ,B., and Gannon, D. 2005. Survey of
Lineage of View Data in a Warehousing Environment. In data Provenance in e-science. In SIGMOD, vol 34, pages 31-
TODS, Volume 25 Issue 2, pages 179-227. 36.
[8] Geerts, F., and Kementsietsidis, A. 2006. MONDRIAN: [22] Vicknair,C. 2010. A comparison of a graph database and a
Annotating and querying databases through colors and relational database: a data provenance perspective. In
blocks, In ICDE, pages 82-91. Proceedings of the 48th Annual Southeast Regional
[9] Glavic, B., and Alonso., G. 2009. Perm: Processing Conference, ACM SE10, Article no: 42. DOI=
provenance and data on the same data model through query http://dl.acm.org/citation.cfm?id=1900067
rewriting. In ICDE, pages 174-185. [23] Widom,J. 2005. Trio: A System for Integrated Management
DOI=http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=48124 of Data, Accuracy, and Lineage. In CIDR, pages 262-276.
01 DOI=http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.
[10] Glavic,B., and Miller,R.J. 2011. Reexamining Some Holy 1.153.9613
Grails of Data Provenance. In TaPP 11: 3rd USENIX [24] Wylot, M., Cudre-Mauroux, P., and Groth P. 2014.
Workshop on the Theory and Practice of Provenance, pages TripleProv: Efficient Processing of Lineage Queries in a
1-6. Native RDF Store. In Proceeding of the 23rd international
[11] Glavic,B. 2012. Big Data Provenance: Challenges and conference on World Wide Web, WWW'14, pages 455-466.
Implications for Benchmarking. In Springer LNCS 8163, [25] Wylot, M., Cudre-Mauroux, P., and Groth, P.2015.
pages 72-80. Executing Provenance-Enabled Queries over Web Data. In
DOI=http://cs.iit.edu/~dbgroup/php/bibtexbrowser.php?key= Proceeding of the 24th international conference on World
G13&bib=..%2Ffiles%2Fdbgroup.bib Wide Web, WWW'15.
[12] Glavic,B., Alonso,G., and Miller,R.J. 2013. Using SQL for [26] Neo4j Graph Database http://neo4j.com/developer/get-
Efficient Generation and Querying of Provenance started/
Information. In Springer LNCS 8000, pages 291-320.
DOI=http://cs.iit.edu/~dbgroup/pdfpubls/GM13.pdf [27] The TPC-H Benchmark.
http://www.tpc.org/tpch/spec/tpch2.7.0.pdf , page 12.

Vous aimerez peut-être aussi