Académique Documents
Professionnel Documents
Culture Documents
NOR HIDAYAH BINTI ABDUL MANAF, AIDA HAZIRAH ABDUL HAMID, NUR
SUHADA NUZAIFA ISMAIL, SITI AYUNAZURA JAMALI@JAMANI, NURUL IDA
FARHANA ABDULL HADI
Contents
Topic 2: Big Data Quality ....................................................................................................................... 1
Introduction ......................................................................................................................................... 1
Conclusion .......................................................................................................................................... 7
References ........................................................................................................................................... 8
Topic 2: Big Data Quality
Introduction
Data quality refers to the overall utility of a dataset(s) as a function of its ability to be
easily processed and analyzed for other uses, usually by a database, data warehouse, or data
analytics system. Quality data is useful data. To be of high quality, data must be consistent
and unambiguous. Data quality issues are often the result of database merges or
systems/cloud integration processes in which data fields that should be compatible are not
due to schema or format inconsistencies. Data that is not high quality can undergo data
cleansing to raise its quality.
There are four dimensions of big data quality: accuracy, timeliness, consistency and
completeness. Accuracy refers to the degree to which data are equivalent to their
corresponding real values (Ballou and Pazer, 1985). This dimension can be accessed via
comparing values with external values that are known to be (or considered to be) correct
(Redman, 1996). A simple example would be a data record in a customer relationship
management system, where the street address for a customer in the system matches the street
address where the customer currently resides. In this case, accuracy of the street address
value in the system could be assessed via validating the shipping address on the most recent
customer order. No problem context or value-judgment of the data is needed: it is either
accurate or not. Its accuracy is entirely self-dependent.
Timeliness refers to the degree to which data are up-to-date. Research suggests that
timeliness can be further decomposed into two dimensions: (1) currency, or length of time
since the records last update, and (2) volatility, which describes the frequency of updates
(Blake and Mangiameli, 2011; Pipino et al., 2002 ; Wand and Wang, 1996). Data that are
correct when assessed, but updated very infrequently, may still hamper efforts at effective
managerial decision making (e.g., errors that occur in the data may be missed more often than
1
not with infrequent record updating, preventing operational issues in the business from being
detected early). A convenient example measure for calculating timeliness using values for
currency and volatility can be found in Ballou et al. (1998), p. 468, where currency is
calculated using the time of data delivery, the time it was entered into the system, and the age
of the data at delivery (which can differ from input time). Together, currency and volatility
measures are used to calculate timeliness.
Consistency refers to the degree to which related data records match in terms of
format and structure. Ballou and Pazer (1985) define consistency as when the representation
of the data value is the same in all cases (p. 153). Batini et al. (2009) develop the notion of
both intra-relation and inter-relation constraints on the consistency of data. Intra-relation
consistency assesses the adherence of the data to a range of possible values (Coronel et al.,
2011), whereas inter-relation assesses how well data are presented using the same structure.
An example of this would be that a person, currently alive, would have for year of birth a
possible value range of 19002013 (intra-relation constraint), while that persons record in
two different datasets would, in both cases, have a field for birth year, and both fields would
intentionally represent the persons year of birth in the same format (inter-relation constraint).
Completeness refers to the degree to which data are full and complete in content, with
no missing data. This dimension can describe a data record that captures the minimally
required amount of information needed (Wand and Wang, 1996), or data that have had all
values captured (Gomes et al., 2007). Several types of cpompleteness has been reported by
(Emran, 2015) and methods to measure completeness also has been proposed by the author
(N A Emran et al. 2013), (Nurul A Emran et al. 2013) (Emran et al. 2014),. Every field in the
data record is needed to paint the complete picture of what the record is attempting to
represent in the real world. For example, if a customer record includes a name and street
address, but no state, city, and zip code, then that record is considered incomplete. The
minimum amount of data needed for a correct address record is not present. A simple ratio of
complete versus incomplete records can then form a potential measure of completeness.
2
helps to ensure that the records in the dataset are as accurate, timely, complete, and consistent
as is practical.
Data quality
dimension Description Supply chain example
Consistency Are the data All requested delivery dates are entered in a
presented in DD/MM/YY format
the same
format?
Completeness Are necessary Customer shipping address includes all data points
data missing? necessary to complete a shipment (i.e. name, street
address, city, state, and zip code)
When data is of excellent quality, it can be easily processed and analyzed, leading to
insights that help the organization make better decisions. High-quality data is essential to
business intelligence efforts and other types of data analytics, as well as better operational
efficiency.
3
1.3 Tools for big data cleaning
Drake
OpenRefine
DataWrangler
DataCleaner
4
Winpure Data Cleaning Tool
Because big data has the 4V characteristics, when enterprises use and process big
data, extracting high-quality and real data from the massive, variable, and complicated
data sets becomes an urgent issue. At present, big data quality faces the following
challenges:
The diversity of data sources brings abundant data types and complex data
structures and increases the difficulty of data integration.
In the past, enterprises only used the data generated from their own business
systems, such as sales and inventory data. But now, data collected and analyzed by
enterprises have surpassed this scope. Big data sources are very wide, including:
1) data sets from the internet and mobile internet (Li & Liu, 2013);
5
software packages/modules, spreadsheets, and financial reports. The third is
structured data. The quantity of unstructured data occupies more than 80% of the
total amount of data in existence.
As for enterprises, obtaining big data with complex structure from different
sources and effectively integrating them are a daunting task (McGilvray, 2008).
There are conflicts and inconsistent or contradictory phenomena among data from
different sources. In the case of small data volume, the data can be checked by a
manual search or programming, even by ETL (Extract, Transform, Load) or ELT
(Extract, Load, Transform). However, these methods are useless when processing
PB-level even EB-level data volume.
Data change very fast and the timeliness of data is very short, which
necessitates higher requirements for processing technology.
Due to the rapid changes in big data, the timeliness of some data is very
short. If companies cant collect the required data in real time or deal with the data
needs over a very long time, then they may obtain outdated and invalid
information. Processing and analysis based on these data will produce useless or
6
misleading conclusions, eventually leading to decision-making mistakes by
governments or enterprises. At present, real-time processing and analysis software
for big data is still in development or improvement phases; really effective
commercial products are few.
No unified and approved data quality standards have been formed in China and
abroad, and research on the data quality of big data has just begun.
Conclusion
7
References
http://searchdatamanagement.techtarget.com/definition/data-quality
https://www.informatica.com/services-and-training/glossary-of-terms/data-quality-
definition.html#fbid=PAAEY1tABpg
http://datascience.codata.org/articles/10.5334/dsj-2015-002/
http://www.sciencedirect.com/science/article/pii/S0925527314001339
Emran, N.A. et al., 2013. Measuring Data Completeness for Microbial Genomics Database.
In ACIIDS 2013 Part 1. Lecture Notes in Computer Science. Springer.
Emran, N.A. et al., 2013. Reference Architectures to Measure Data Completeness across
Integrated Databases. In ACIIDS 2003 Part 1. Springer-Verlag Berlin Heidelberg, pp.
216225.
Emran, N.A., Embury, S. & Missier, P., 2014. Measuring Population-Based Completeness
for Single Nucleotide Polymorphism (SNP) Databases. In J. Sobecki, V. Boonjing, & S.
Chittayasothorn, eds. Advanced Approaches to Intelligent Information and Database
Systems. Cham: Springer International Publishing, pp. 173182.