A Guide To Improving Data Integrity and Adoption

A Guide to Improving Data
Integrity and Adoption
A Case Study in Verifying Usage Data
Jessica Roper
Beijing
Boston Farnham Sebastopol
Tokyo
A Guide to Improving Data Integrity and Adoption

by Jessica Roper
Copyright 2017 OReilly Media Inc. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
OReilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://oreilly.com/safari). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editor: Nicole Tache

Production Editor: Colleen Lobner
Copyeditor: Octal Publishing Services
December 2016:
Interior Designer: David Futato

Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
First Edition
Revision History for the First Edition

2016-12-12:
First Release
The OReilly logo is a registered trademark of OReilly Media, Inc. A Guide to

Improving Data Integrity and Adoption, the cover image, and related trade dress are
trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi
bility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-97052-2
[LSI]
Table of Contents
A Guide to Improving Data Integrity and Adoption. . . . . . . . . . . . . . . . . . . 1

Validating Data Integrity as an Integral Part of Business
Using the Case Study as a Guide
An Overview of the Usage Data Project
Getting Started with Data
Managing Layers of Data
Performing Additional Transformation and Formatting
Starting with Smaller Datasets
Determining Acceptable Error Rates
Creating Work Groups
Reassessing the Value of Data Over Time
Checking the System for Internal Consistency
Verifying Accuracy of Transformations and Aggregation
Reports
Allowing for Tests to Evolve
Implementing Automation
Conclusion
Further Reading
1
3
4
5
6
7
9
10
11
13
14
19
21
29
31
32
iii
A Guide to Improving Data

Integrity and Adoption
In most companies, quality data is crucial to measuring success and

planning for business goals. Unlike sample datasets in classes and
examples, real data is messy and requires processing and effort to be
utilized, maintained, and trusted. How do we know if the data is
accurate or whether we can trust final conclusions? What steps can
we take to not only ensure that all of the data is transformed cor
rectly, but also to verify that the source data itself can be trusted as
accurate? How can we motivate others to treat data and its accuracy
as priority? What can we do to expand adoption of data?
Validating Data Integrity as an Integral Part of

Business
Data can be messy for many reasons. Unstructured data such as log
files can be complicated to understand and parse information. A lot
of data, even when structured, is still not standardized. For example,
parsing text from online forums can be complicated and might need
to include logic to accommodate slang such as bad ass, which is a
positive phrase but made with negative words. The system creating
the data can also make it messy because different languages have dif
ferent expectations for design, such as Ruby on Rails, which requires
a separate table to represent many-to-many relationships.
Implementation or design can also lead to messy data. For example,
the process or code that creates data, and the database storing that
data might use incompatible formats. Or, the code might store a set
of values as one column instead of many columns. Some languages
1
parse and store values in a format that is not compatible with the
databases used to store and process it, such as YAML (YAML Aint
Markup Language), which is not a valid data type in some databases
and is stored instead as a string. Because this format is intended to
work much like a hash with key-and-value pairs, searching with the
database language can be difficult.
Also, code design can inadvertently produce a table that holds data
for many different, unrelated models (such as categories, address,
name, and other profile information) that is also self-referential. For
example, the dataset in Table 1-1 is self-referential, wherein each
row has a parent ID representing the type or category of the row.
The value of the parent ID refers to the ID column of the same table.
In Table 1-1, all information around a User Profile is stored in the
same table, including labels for profile values, resulting in some val
ues representing labels, whereas others represent final values for
those labels. The data in Table 1-1 shows that Mexico is a Coun
try, part of the User Profile because the parent ID of Mexico is
11, the ID for Country, and so on. Ive seen this kind of example in
the real world, and this format can be difficult to query. I believe this
relationship was mostly the result of poor design. My guess is that, at
the time, the idea was to keep all profile-like things in one table
and, as a result, relationships between different parts of the profile
also needed to be stored in the same place.
Table 1-1. Self-referential data example (source: Jessica Roper and Brian
Johnson)
ID
16
11
9
Parent ID
11
9
NULL
Value
Mexico
Country
User Profile
Data quality is important for a lot of reasons, chiefly that its difficult
to draw valid conclusions from impartial or inaccurate data. With a
dataset that is too small, skewed, inaccurate, or incomplete, its easy
to draw invalid conclusions. Organizations that make data quality a
priority are said to be data driven; to be a data-driven company
means priorities, features, products used, staffing, and areas of focus
are all determined by data rather than intuition or personal experi
ence. The companys success is also measured by data. Other things
that might be measured include ad impression inventory, user
engagement with different products and features, user-base size and

predictions, revenue predictions, and most successful marketing
campaigns. To affect data priority and quality will likely require
some work to make the data more usable and reportable and will
almost certainly require working with others within the organiza
tion.

In this report, I will follow a case study from a large and critical data
project at Spiceworks, where Ive worked for the past seven years as
part of the data team, validating, processing and creating reports.
Spiceworks is a software company that aims to be everything IT for
everyone IT, bringing together vendors and IT pros in one place.
Spiceworks offers many products including an online community
for IT pros to do research and collaborate with colleagues and ven
dors, a help desk with a user portal, network monitoring tools, net
work inventory tools, user management, and much more.
Throughout much of the case study project, I worked with other
teams at Spiceworks to understand and improve our datasets. We
have many teams and applications that either produce or consume
data, from the network-monitoring tool and online community that
create data, to the business analysts and managers who consume
data to create internal reports and prove return on investment to
customers. My team helps to analyze and process the data to provide
value and enable further utilization by other teams and products via
standardizing, filtering, and classifying the data. (Later in this
report, I will talk about how this collaboration with other teams is a
critical component to achieving confidence in the accuracy and
usage of data.)
This case study demonstrates Spiceworks process for checking each
part of the system for internal and external consistency. Throughout
the discussion of the usage data case study, Ill provide some quick
tips to keep in mind when testing data, and then Ill walk through
strategies and test cases to verify raw data sources (such as parsing
logs) and work with transformations (such as appending and sum
marizing data). I will also use the case study to talk about vetting
data for trustworthiness and explain how to use data monitors to
identify anomalies and system issues for the future. Finally, I will
discuss automation and how you can automate different tests at dif
ferent levels and in different ways. This report should serve as a

guide for how to think about data verification and analysis and some
of the tools that you can use to determine whether data is reliable
and accurate, and to increase the usage of data.
An Overview of the Usage Data Project

The case study, which Ill refer to as the usage data project, or UDP,
began with a high-level goal: to determine usage across all of Spice
works products and to identify page views and trends by our users.
The need for this new processing and data collection came after a
long road of hodge-podge reporting wherein individual teams and
products were all measured in different ways. Each team and depart
ment collected and assessed data in its own wayhow data was
measured in each team could be unique. Metrics became increas
ingly important for us to measure success and determine which fea
tures and products brought the most value to the company and,
therefore, should have more resources devoted to them.
The impetus for this project was partially due to company growth
Spiceworks had reached a size at which not everyone knew exactly
what was being worked on and how the data from each place corre
lated to their own. Another determining factor was inventoryto
improve and increase our inventory, we needed to accurately deter
mine feature priority and value. We also needed to utilize and
understand our users and audience more effectively to know what to
show, to whom, and when (such as display ads or send emails).
When access to this data occurred at an executive level, it was even
more necessary to be able to easily compare products and under
stand the data as a whole to answer questions like: How many total
active users do we have across all of our products? and How many
users are in each product? It wasnt necessary to understand how
each products data worked. We also needed to be able to do analysis
on cross-product adoption and usage.
The product-focused reporting and methods of measuring perfor
mance that were already in place made comparison and analysis of
products impossible. The different data pieces did not share the
same mappings, and some were missing critical statistics such as
which specific user was active on a feature. We thus needed to find a
new source for data (discussed in a moment).
| A Guide to Improving Data Integrity and Adoption
When our new metrics proved to be stable, individual teams began

to focus more on the quality of their data. After all, the product bugs
and features that should be focused on are all determined by data
they collect to record usage and performance. After our experience
with the UDP and wider shared data access, teams have learned to
ensure that their data is being collected correctly during beta testing
of the product launch instead of long after. This guarantees them
easy access to data reports dynamically created on the data collected.
After we made the switch to this new way of collecting and manag
ing data from the startwhich was automatic and easymore peo
ple in the organization were motivated to focus on data quality,
consistency, and completeness. These efforts moved us to being a
more truly data-driven company and, ultimately, a stronger com
pany because of it.

Where to begin? After we determined the goals of the project, we
were ready to get started. As I previously remarked, the first task
was to find new data. After some research, we identified much of the
data needed was available in logs from Spiceworks advertising ser
vice (see Figure 1-1), which is used to identify a target audience that
users qualify to be in and therefore what set of ads should be dis
played to them. On each page of our applications, the advertising
service is loaded, usually even when no ads are displayed. Each new
page and even context changes, such as switching to a new tab, cre
ate a log entry. We parsed these logs into tables to analyze usage
across all products; then, we identified places where tracking was
missing or broken to show what parts of the advertising-service data
source could be trusted.
As Figure 1-1 demonstrates, each log entry offered a wealth of data
from the web request that we scraped for further analysis, including
the uniform resource locator (URL) of the page, the user who
viewed it, the referrer of the page, the Internet Protocol (IP) address,
and, of course, a time stamp to indicate when the page was viewed.
We parsed these logs into structured data tables, appended more
information (such as geography, and other user profile informa
tion), and created aggregate data that could provide insights into
product usage and cohort analysis.
Figure 1-1. Ad service log example (source: Jessica Roper and Brian
Johnson)
Managing Layers of Data

There are three layers of data useful to keep in mind, each used dif
ferently and with different expectations (Figure 1-2). The first layer
is raw, unprocessed data, often produced by an application or exter
nal process; for example, some raw data from the usage data study
comes from products such as Spiceworks cloud helpdesk, where
users can manage IT tickets and requests, and our community,
which is where users can interact online socially through discus
sions, product research, and so on. This data is in a format that
makes sense for how the application itself works. Most often, it is
not easily consumed nor does it lend itself well for creating reports.
For example, in the community, due to the frameworks used, we
break apart different components and ideas of users and relation
ships so that email, subscriptions, demographics and interests, and
so forth are all separated into many different components, but for
analysis and reporting its better to have these different pieces of
information all connected. Because this data is in a raw format, it is
more likely to be unstructured and/or somewhat random, and
sometimes even incomplete.
Figure 1-2. Data layers (source: Jessica Roper and Brian Johnson)
The next layer of data is processed and structured following some

format, usually created from the raw dataset. At this layer, compres
sion can be used if needed; either way, the final format will be a
result of general processing, transformation, and classification. To
use and analyze even this structured and processed layer of data still
usually requires deep understanding and knowledge and can be a bit
more difficult to report on accurately. Deeper understanding is
required to work with this dataset because it still includes all of the
raw data, complete with outliers and invalid data but in a formatted
and consistent representation with classifications and so on.
The final layer is reportable data that excludes outliers, incomplete
data, and unqualified data; it includes only the final classifications
without the raw source for the classification included, allowing for
segmentation and further analysis at the business and product levels
without confusion. This layer is also usually built from the previous
layer, processed and structured data. If needed, other products and
processes using this data can further format and standardize it for
the individual needs as well as apply further filtering.
Performing Additional Transformation and

Formatting
The most frequent reasons additional transformation and format
ting are needed are when it is necessary to improve performance for
the analysis or report being created, to work with analysis tools
(which can be quite specific as to how data must be formatted to
work well), and to blend data sources together.
An example of a use case in which we added more filtering was to
analyze changes in how different products were used and determine
what changes had positive long-term effects. This analysis required
further filtering to create cohort groups and ensure that the users
being observed were in the ideal audiences for observation. Remov
ing users unlikely to engage in a product from analysis helped us to
determine what features changed an engaged users behavior.
In addition, further transformations were required. For example, we
used a third-party business intelligence tool to feed in the data to
analyze and filter final data results for project managers. One trans
formation we had to make was to create a summary table that broke
Performing Additional Transformation and Formatting
out the categorization and summary data needed into columns

instead of rows.
For a long time, a lot of the processed and compressed data at Spice
works was developed and formatted in a way that was highly related
to the reporting processes that would be consuming the data. This
usually would be the final reporting data, but many of the reports
created were fairly standard, so we could create a generic way for
consumption. Then, each report applied filters, and further aggrega
tions on the fly. Over time, as data became more widely used and
dynamically analyzed as well as combined with different data sour
ces, these generic tables proved to be difficult to use for digging
deeper into the data and using it more broadly.
Frequently, the format could not be used at all, forcing analysts to go
back to the raw unprocessed data that required a higher level of
knowledge about the data if it were to be used at all. If the wrong
assumptions were made about the data or if the wrong pieces of data
were used (perhaps some that was no longer actively updated),
incorrect conclusions might have been drawn. For example, when
digging into the structured data parsed from the logs, some of our
financial analysts incorrectly assumed that the presence of a user ID
(generic, anonymous user identifierID) indicated the user was
logged in. However, in some cases we identified the user through
other means and included flags to indicate the source of the ID.
Because the team did not have a full understanding of these flags or
the true meaning of the field they were using, they got wildly differ
ent results than other reports tracking only logged-in users, which
caused a lot of confusion.
To be able to create new reports from the raw, unprocessed data, we
blended additional sources and analyzed the data as a whole. One
problem arose from different data sources with different representa
tions of the same entities. Of course, this is not surprising, because
each product team needed to have its own idea of users, and usually
some sort of profile for those users. Blending the data required cre
ating mappings and relationships among the different datasets,
which of course required a deep understanding of those relation
ships and datasets. Over time, as data consumption and usage grew,
we updated, refactored, and reassessed how data is processed and
aggregated. Our protocol has evolved over time to fit the needs for
our data consumption.

A few things to keep in mind when youre validating data include
becoming deeply familiar with the data, using small datasets, and
testing components in isolation. Beginning with smaller datasets
when necessary allows for faster iterations of testing before working
on the full dataset. The sample data is a great place to begin digging
into what the raw data really looks like to better understand how
it needs to be processed and to identify patterns that are considered
valid.
When youre creating smaller datasets to work with, it is important
to try to be as random as possible but still ensure that the sample is
large enough to be representative of the whole. I usually aim for
about 10 percent, but this will vary between datasets. Keep in mind
that its important to include data over time, from varying geograph
ical locations, and include data that will be used for filtering such as
demographics. This understanding will define the parameters
around the data needed to create tests.
For example, one of the Spiceworks products identifies computer
manufacturer data that is collected in aggregate anonymously and
then categorized and grouped for further analysis. This information
is originally sourced from devices such as my laptop, which is a
MacBook Pro (Retina, 15-inch, mid-2014) (Figure 1-3). Categoriz
ing and grouping the data into a set for all MacBook Pros requires
spending time understanding what kind of titles are possible for
Apple and MacBook in the dataset by searching through the data
for related products. To really understand the data, however, it is
important to also understand titles in their raw format to gain some
insight into how they are aggregated and changed before being
pushed into the dataset that is being categorized and grouped.
Therefore, testing a data scrubber requires familiarity with the data
set and the source, if possible, so that you know which patterns and
edge case conditions to check for and how you should format
the data.
Figure 1-3. Example of laptop manufacturing information
Determining Acceptable Error Rates

Its important to understand acceptable errors in data. This will vary
between datasets, but, overall, you want to understand what an
acceptable industry standard is and understand the kinds of deci
sions that are being made with the data to determine the acceptable
error rate. The rule of thumb I use is that edge case issues represent
ing less than one percent of the dataset are not worth a lot of time
because they will not affect trends or final analysis. However, you
should still investigate all issues at some level to ensure that the set
affected is indeed small or at least outside the system (i.e., caused by
someone removing code that tracks usage because that person
believed it does not do anything).
In some cases, this error rate is not obtainable or not exact; for
example, some information we appended assigned sentiment (posi
tive, negative, neutral) to posts viewed by users in the online forums
that are part of Spiceworks community.
To determine our acceptable error rate, we researched sentiment
analysis as a whole in the industry and found that the average accu
racy rate is between 65 and 85 percent. We decided on a goal of 25
percent error rate for posts with incorrect sentiment assigned
because it kept us in that top half of accuracy levels achieved in
the industry.
10
When errors are found, understanding the sample size affected will
also help you to determine severity and priority of the errors. I gen
erally try to ensure that the amount of data ignored in each step
makes up less of the dataset than the allowable error so that the
combined error will still be within the acceptable rate. For example,
if we allow an error rate of one-tenth of a percent in each of 10
products, we can assume that the total error rate is still around or
less than 1 percent, which is the overall acceptable error rate.
After a problem is identified, the next goal is to find examples of
failures and identify patterns. Some patterns to look for are failures
for the same day of the week, time of day, or from only a small set of
applications or users. For example, we once found a pattern
in which a web pages load time increased significantly every Mon
day at the same time during the evening. After further digging, this
information led us to find that a database backup was locking large
tables and causing slow page loads. To account for this we added
extra servers and utilized one of them to be used for backups so that
performance would be retained even during backups. Any data
available that can be grouped and counted to check for patterns
in the problematic data can be helpful. This can assist in identifying
the source of issues and better estimate the impact the errors could
have.
Examples are key to providing developers, or whoever is doing data
processing, with insight into the cause of the problem and providing
a solution or way to account for the issue. A bug that cannot be
reproduced is very difficult to fix or understand. Determining pat
terns will also help you to identify how much data might be affected
and how much time should be spent investigating the problem.

Achieving data integrity often requires working with other groups
or development teams to understand and improve accuracy. In
many cases, these other teams must do work to improve the systems.
Data integrity and comparability becomes more important the
higher up in an organization that data is used. Generally, lots of
people or groups can benefit from good data, but they are not usu
ally the ones who must create and maintain data so its good [4].
Therefore, the higher the level of support and usage of data (such as
from managers and executives), the more accurate the system will
11
be and the more likely the system will evolve to improve accuracy. It
will require time and effort to coordinate between those who con
sume data and those who produce it, and some executive direction
will be necessary to ensure that coordination. When managers or
executives utilize data, collection and accuracy will be easier to make
a priority for other teams helping to create and maintain that data.
This does not mean that collection and accuracy arent important
before higher-level adoption of data, but it can be a much longer
and difficult process to coordinate between teams and maintain
data.
One effective way to influence this coordination among team mem
bers is consistently showing metrics in a view thats relevant to the
audience. For example, to show the value of data to a developer, you
can compare usage of products before and after a new feature is
added to show how much of an effect that feature has had on usage.
Another way to use data is to connect unrelated products or data
sources together to be used in a new way.
As an example, a few years ago at Spiceworks, each productand
even different features within some productshad individual defi
nitions for categories. After weeks of work to consolidate the cate
gories and create a new way to manage and maintain them that was
more versatile, it took additional effort and coordination to educate
others on the team about the new system, and it took work with
individual teams to help enable and encourage them to apply the
new system. The key to getting others to adopt the new system was
showing value to them. In this case, my goal was to show value by
making it easier to connect different products such as how-tos and
groups for our online forums.
There were only a few adopters in the beginning, but each new
adopter helped to push others to use the same definitions and pro
cesses for categorizationvery slowly, but always in the same con
sistent direction. As we blended more data together, a unified
categorization grew in priority, making it more heavily adopted and
used. Now, the new system is widely used and used for the potential
seen when building it initially several years ago. It took time for
team members to see the value in the new system and to ultimately
adopt it, but as soon as the tipping point was crossed, the work
already put in made final adoption swift and easy in comparison.
12
Collaboration helped achieve confidence in the accuracy because

each different application was fully adopting the new categorization,
which then vetted individual product edge cases against the design
and category set defined. In a few cases, the categorization system
needed to be further refined to address those edge cases, such as to
account for some software and hardware that needed to belong to
more than one category.

The focus and value placed on data in a company evolves over time.
In the beginning, data collection might be a lower priority than
acquiring new clients and getting products out the door. In my
experience, data consumption begins with the product managers
and developers working on products who want to understand if and
how their features and products are being used or to help in the
debugging process. It can also include monitoring system perfor
mance and grow quickly when specific metrics are set as goals for
performance of products.
After a product manager adopts data-tracking success and failure,
the goal is to make that data reportable and sharable so that others
can also view the data as a critical method of measuring success. As
more product managers and teams adopt data metrics, those metrics
can be shared and standardized. Executive-level adoption of data
metrics at a high level is much easier with the data in uniform and
reportable format that can measure company goals.
If no parties are already interested in data and the value it can bring,
this is a good opportunity to begin using data to track your success
of products that you are invested in and share the results with man
agers, teammates, and so on. If you can show value and success in
the products or prove opportunity that is wasted, the data is more
likely to be seen as valuable and as a metric for success.
Acting only as an individual, you can show others the value you can
get out of data, and thereby push them to invest and use it for their
own needs. The key is to show value and, when possible, make it
easy for others to maintain and utilize the data. Sometimes, it might
require building a small tool or defining relationships between data
structures that make the data easy to use and maintain.
13
An example of this is when we were creating the unified mechanism

and process to categorize everything including products, groups,
topics, vendors and how-tos, including category hierarchy (recall
the example introduced in the preceding section). The process
required working with each team that created and consumed the
data around categories to determine a core set that could be used
and shared. I wanted to identify which pieces had unique require
ments for which we could account. In some products, a category
hierarchy was required to be able to identify relationships such as
antivirus software, which was grouped under both the security and
software categories. Other products such as online forums for users
to have discussions did not have this hierarchy and therefore did not
use that part of the feature. Creating default hierarchies and rela
tionship inheritance with easy user interfaces for management made
creating and maintaining this data easier for others.
When it became vital to connect different products, improve search
engine optimization, and provide more complete visibility between
products, the work done years earlier to maintain and define cate
gories easily was recognized as incredibly vital.

Utilizing the usage data case study, I will talk through the process we
used at Spiceworks for checking each part of the system for internal
and external consistency, including examples of why, when, and how
this was done (see Figure 1-4). We began with validation of the raw
data to ensure that the logging mechanism and applications created
the correct data. Next, we validated parsing the data and appending
information from other sources into the initial format. After that, we
verified that transformations and aggregation reports were accurate.
Finally, we vetted the data against other external and internal data
sources and used trends to better understand the data and add mon
itors to the systems to maintain accuracy. Lets look at each of these
steps in more detail.
First, we tested the raw data itself. This step is nearly identical to
most other software testing, but in this instance the focus is on
ensuring that different possible scenarios and products all produce
the expected log entries. This step is where we became familiar with
the source data, how it was formatted, and what each field meant in
a log entry.
14
Figure 1-4. UDP workflow (source: Jessica Roper and Brian Johnson)
We included tests such as creating page views from several IP
addresses, different locations, and from every product and critical
URL path. We also used grep (a Unix command used to search files
for the occurrence of a string of characters that matches a specified
pattern) to search the logs for more URLs, locations, and IP
addresses to smoke-test for data completeness and understand
how different fields were recorded. During this investigation, we saw
that several fields had values that looked like IP addresses. Some of
them actually had multiple listings, so we needed to investigate fur
ther to learn what each of those fields meant. In this case, it required
further research into how event headers are set and what the non
simple results meant.
15
When you cannot test or validate the raw data formally to under
stand the flow of the system, try to understand as much about the
raw data as possibleincluding the meaning behind different com
ponents and how everything connects together. One way to do this
is to observe the data relationships and determine outlier cases by
understanding boundaries. For example, if we had been unable to
observe how the logs were populated, my search would have
included looking at counts for different fields.
Even with source data testing, we manually dug through the logs to
investigate what each field might mean and defined a normal
request. We also looked at the counts of IP addresses to see which
might have had overly high counts or any other anomalies. We
looked for cases in which fields were null to understand why and
when this occurred and understand how different fields seemed to
correlate. This raw data testing was a big part of becoming familiar
with the data and defining parameters and expectations.
When digging in, you should try to identify and understand the
relationship between components; in other words, determine which
relationships are one-to-one versus one-to-many or many-to-many
between different tables and models. One relationship that was
important to keep in mind during the UDP was the many-to-many
relationship between application installations and users. Users can
have multiple installations and multiple users use each installation,
so overlap must be accounted for in testing. Edge case parameters of
the datasets are also important to understand, such as determining
the valid minimum and maximum values for different fields.
Finally, in this step you want to understand what is common for the
dataset, essentially identifying the normal cases. When investigat
ing the page views data for products, the normal case had to be
defined for each product separately. For example, in most cases
users were expected to be logged in, but in a couple of our products,
being a visitor was actually more common; so, the normal use case
for those products was quite different.
Validating the Initial Parsing Process

After we felt confident that the raw data was reliable, it was time to
validate the initial parsing process that converted unstructured logs
into data tables. We began by validating that all of the data was
present. This included ensuring all of the log files we expected did
16
in fact exist (there should be several for each day) as well as check
ing that the numbers of log entries match the total rows in the initial
table.
For this test, two senior developers created a complex grep regular
expression command to search the logs. They then had both
searches compared and reviewed by other senior staff members,
working together to define the best clause. One key part of the exer
cise included determining which rows were included in one expres
sion result but not in others, and then use those to determine the
best patterns to employ.
We also dug into any rows the grep found that were not in the table,
and vice versa. The key goal in this first set of tests was to ensure
that the source data and final processing match before filtering out
invalid and outlier data.
For this project, tests included comparing totals such as unique user
identification counts, validating the null data found was acceptable,
checking for duplicates, and verifying proper formats of fields such
as IP addresses and the unique application IDs that are assigned to
each product installation. This required an understanding of which
data pieces are valid when null, how much data could be tolerated
to have null fields, and what is considered valid formatting for each
field.
In the advertising service logs, some fields should never be null,
such as IP address, timestamp, and URL. This set of tests identified
cases that excluded IP addresses completely by incorrectly parsing
the IP set. This prompted us to rework the IP parsing logic to get
better-quality data. Some fields, such as user ID, were valid in some
cases when null; only some applications require users to be logged
in, so we were able to use that information as a guide. We validated
that user ID was present in all products and URLs that are accessible
only by logged in users. We also ensured that visitor traffic came
only from applications for which login was not required.
You should carry out each test in isolation to ensure that your test
results are not corrupted by other values being checked. This helps
you to find smaller edge cases, such as small sections of null data.
For our purposes at Spiceworks, by individually testing for each
classification of data and each column of data, we found small edge
cases for which most of the data seemed normal but had only one
17
field that was invalid in a small, but significant way. All relationships
existed for the field, but when examined at the individual level, we
saw the field was invalid in some cases, which led us to find some
of the issues described later, such as invalid IP addresses and a sig
nificant but small number of development installations that needed
to be excluded because they skewed the data with duplicate and
invalid results.
Checking the Validity of Each Field

After the null checks were complete, it was time to check the validity
of each field to determine if the data is structured correctly and all
the data is valid. After working with the team that built the logger as
well as some product teams, we defined proper formats for fields
and verified that the unique user IDs were correct. Another field
that was included was IP addresses, where we looked for private or
invalid (unassigned) IPs. After we identified some IP addresses as
invalid, we dug into the raw log results and determined some of the
invalid addresses were a result of how the request was forwarded
and recorded in the headers.
We improved parsing to grab only the most valid IP and added a
flag to the data to indicate data from those IPs that could be filtered
out in reporting. We also checked for URLs and user IDs that might
not be valid. For example, our user IDs are integers, so a quick valid
ity test included getting the minimum and maximum value of user
IDs found in the logs after parsing to ensure that they fell within the
user-base size and that all user IDs were numerical. One test identi
fied invalid unique application IDs.
Upon further digging, we determined that almost all of the invalid
applications IDs were prerelease versions, usually from the Spice
works IP address. This led us to decide to exclude any data from our
IP and development environments because that data was not valua
ble for our end goals. By identifying the invalid data and finding the
pattern, we were able to determine the cause and ultimately remove
the unwanted data so that we could gain better insights into the
behavior of actual users.
Next, we checked for duplicates; usually there will be specific sets of
fields that should be unique even if other values in the row are not.
For example, every timestamp should be unique because parallel
logging mechanisms were not used. After the raw logs were parsed
18
into table format, more information was appended and the tables
were summarized for easier analysis. One of the goals for these more
processed tables, which included any categorizations and final filter
ing of invalid data such as that from development environments,
was to indicate total page views of a product per day by user. We
verified that rows were unique across user ID, page categorization,
and date.
Usually at this point, we had the data in some sort of database,
which allowed us to do this check by simply writing a query that
selected and counted the columns that should be unique and ensure
that the resulting counts equal 1. As Figure 1-5 illustrates, you also
can run this check in Excel by using the Remove Duplicates action,
which will return the number of duplicate rows that are removed.
The goal is for zero rows to be removed to show that the data is
unique.
Figure 1-5. Using Excel to test for duplicates (source: Jessica Roper)
Verifying Accuracy of Transformations and

Aggregation Reports
Because all of the data needed was not already in the logs, we added
more to the parsed logs and then translated and aggregated the
results so that everything we wanted to filter and report on was
included. Data transformations like appended information, aggrega
tion, and transformation require more validation. Anything con
verted or categorized needs to be individually checked.
Verifying Accuracy of Transformations and Aggregation Reports
19
In the usage data project, we converted each timestamp to a date

and had to ensure that the timestamps were converted correctly.
One way we did this was to manually find the first and last log
entries for a day in the logs and compare them to the first and last
entries in the parsed data tables. This test revealed an issue with a
time zone difference between the logs and the database system,
which shifted and excluded results for several hours. To account for
this we processed all logs for a given day as well as the following day
and then filtered the results based on the date after adjusting for the
time zone differences.
We also validated all data appended to the original source. One piece
of data that we appended was location information for each page
view based on the IP address. To do this, we used a third-party com
pany that provides an application program interface (API) to corre
late IP addresses with location data, such as country, for further
processing and geographical analysis. For each value, we verified
that the source of the appended data matched what was found in the
final table. For example, we ensured that the country and user infor
mation appended was correct by comparing the source location data
from the third-party and user data to the final appended results. We
did this by joining the source data to the parsed dataset and compar
ing values.
For the aggregations, we checked that raw row counts from the
parsed advertising service log tables matched the sum of the aggre
gate values. In this case, we wanted to roll up our data by pages
viewed per user in each product, requiring validation that the total
count of rows parsed matched the summary totals stored in the
aggregate table.
Part of the UDP required building aggregated data for the reporting
layer, generically customized for the needs of the final reports. In
most cases, consumers of data (individuals, other applications, or
custom tools) will need to transform, filter, and aggregate data for
the unique needs of the report or application. We created this trans
formation for them in a way that allowed the final product to easily
filter and further aggregate the data (in this case, we used a business
intelligence software). Those final transformations also required val
idation for completeness and accuracy, such as ensuring that any
total summaries equal the sum of their parts, nothing was double
counted, and so on.
20
The goal for this level of testing is to validate that aggregate and
appended data has as much integrity as the initial data set. Here are
some questions that you should ask during this process:
If values are split into columns, do the columns add up to the
total?
Are any values negative that should not be, such as a calculated
other count?
Is all the data from the source found in the final results?
Is there any data that should be filtered out but is still present?
Do appended counts match totals of the original source?
As an example of the last point, when dealing with Spiceworks
advertising service, there are a handful of sources and services that
can make a request for an ad and cause a log entry to be added.
Some different kinds of requests included new organic pageviews,
ads refreshing automatically, and for pages with ad-block software.
One test we included checked that the total requests equaled the
sum of the different request types. As we built this report and con
tinued to evolve our understanding of the data and requirements for
the final results, the reportable tables and tests also evolved. The test
process itself helped us to define some of this when outliers or unex
pected patterns were found.

It is common for tests to evolve as more data is introduced and con
sumed and therefore better understood. As specific edge cases and
errors are discovered, you might need to add more automation or
processes. One such case I encountered was caused by the fact that
every few years there are 53 weeks in the calendar year. This extra
week (it is approximately half a week, actually) results in 5 weeks
in December and 14 weeks in the last quarter. When this situation
occurred for the first time after building our process, the reporting
for the last quarter of the year as well as for the following quarter
were incorrect. When the issue and cause were discovered, special
logic for the process and new test cases were added to account for
this unexpected edge case.
21
For scrubbing transformations or clustering of data, your tests

should search through all unique possible options, filtering one
piece at a time. Look for under folding, whereby data has not clus
tered/grouped everything it should have, and over folding, which is
when things are over-grouped or categorized where they should not
be [2]. Part of the aggregations for this project required us to classify
URLs based on the different products of which they were a part.
To test and scrub these, we first broke apart the required URL com
ponents to ensure that all variations are captured. For example, one
of the products that required its own category was an app center
where users can share and download small applications or plug-ins
for our other products; to test this, we began by searching for all
URLs that had app and center in the URL. We did not require
app center, app%center, or other combined variations, because
we wanted to make no assumptions about the format of the URL. By
searching in this more generic way, we were able to identify many
URLs with formats of appcenter, app-center, and app center.
Next, we looked for URLs that match only part of the string. In this
case, we found the URL /apps by looking for URLs that had the
word app but not center. This test identified several URLs that
looked similar to other app center URLs, but after further investiga
tion were found to be part of another product. This allowed us to
add automated tests that ensured those URLs were always catego
rized correctly and separately. To categorize this data required using
the acceptable error to identify what should be used to create the
logic. In this case, we did not need to focus on getting down and
dirty with our long tailwhat was usually thousands of pages that
only have a handful of views. A few page views account for well
below one thousandth of a percent and would provide virtually no
value even if scrubbed. Most of the time the URLs are still incorpo
rated by the other logic created.
Checking for External Consistency: Analyzing and

Monitoring Trends
The last two components of data validation, vetting data with trend
analysis and monitoring, are the most useful for determining data
reliability and helping to ensure continued validity. This layer is
heavily dependent on the kind of data that is going to be reported on
and what data is considered critical for any analysis. It is part of
22
maintaining and verifying the reportable data layer, especially when

data comes from external or multiple sources.
First is vetting data by comparing what was collected to other
related data sources to comprehensively cover known boundaries.
This helps to ensure the data is complete and that other data sources
correlate and agree with the data being tested.
Of course, other data sources will not represent the exact same
information, but they can be used to check things such as whether
trends over time match, if total unique values are within the bounds
of expectations, and so on. For example, in Spiceworks advertising
service log data, there should not be more active users than the total
users registered. Even further, active users should not be higher than
total users that have logged in during the time period. The goal is to
verify the data against any reliable external data possible.
External data could be from a source such as Google Analytics,
which is a reliable source for page views, user counts, and general
usage with some data available for free. We used external data avail
able to us to compare total active users and page views over time for
many products. Even public data such as general market share com
parisons are a good option; compare sales records to product usage,
active application counts, and associated users to total users, and
so on.
Checking against external sources is just a different way to think
about the data and other data related to it. It provides boundaries
and expectations for what the data should look like and edge-case
conditions that might be encountered. Some things to include
are comparing total counts, averages, and categories or classifica
tions. In some cases, the counts from the external source might be
summaries or estimates, so its important to understand those val
ues, as well, to determine if inconsistencies among datasets indicate
an error.
For the UDP, we were fortunate to have many internal sources of
data that we used to verify that our data was within expected bounds
and matched trends. One key component is to compare trends. We
compared data over time for unique user activity, total page views,
and active installations (and users related to those installations) and
checked our results against available data sources (a hypothetical
example is depicted in Figure 1-6).
23
Figure 1-6. Hypothetical example of vetting data trends (source: Jessica

Roper and Brian Johnson)
We aimed to answer questions such as the following:
Do the number of active users over time correlate to unique
users seen in our other stats?
Do the total page views correlate to the usage we see from users
in our installation stats?
Can all the URLs we see in our monitoring tools and request
logs be found in the new data set?
During this comparison, we found that several pages were not being
tracked with the system. This led us to work with the appropriate
development teams to start tracking those pages, and determined
what the impact would be on final analysis.
The total number of active users for each product was critical for
reporting teams and project managers. During testing, we found
some products only had data available to indicate the number of
active installations and the total number of users related to the
installation. As users change jobs and add installations, they can be a
user in all of those applications, making the user-to-installation rela
tionship many-to-many. Some application users also misunderstood
the purpose of adding new users, which is meant to be for all IT pros
providing support to the people in their companies. However, in
some cases an IT pro might add not only all other support staff, but
include all end users, who wont actually use the application and
therefore are never active on the installation itself. In this circum
stance, they were adding the set of end users they support, but those
24
end users are neither expected to interact directly with the applica
tion nor are they considered official users of it. We wanted to define
what assumptions were testable in the new data set from the ad ser
vice. For example, at least one user should be active on every instal
lation; otherwise no page views could be generated. Also, there
should not be more active users than the total number of users asso
ciated to an installation.
After we defined all the expectations for the data, we built several
tests. One tested that each installation had less active users than the
total associated with it. More important, however, we also tested that
the total active users trended over time was consistent with trends
for total active installations and total users registered to the applica
tion (Figure 1-6). We expected the trend to be consistent between
the two values and follow the same patterns of usage for time of day,
day of week, and so on. The key to this phase is having as much
understanding as possible of the data, its boundaries, how it is rep
resented, and how it is produced so that you know what to expect
from the test results. Trends usually will match generally, but in my
experience, its rare for them to match exactly.
Performing TimeSeries Analyses

The next step is timeseries analysisunderstanding how the
data behaves over time and what makes sense for the dataset.
Timeseries analysis provides insights needed to monitor the system
over the long term and to validate data consistency. This sort of
analysis also verifies the datas accuracy and reliability. Are there
large changes from month to month or week to week? Is change
consistent over time?
One way to verify whether a trend makes sense is by looking for
expected anomalies such as new product launch dates causing
spikes, holidays causing a dip, known outage times, and expected
low-usage periods (i.e., 2 AM). A hypothetical example is provided
in Figure 1-7. This can also help identify other issues such as miss
ing data or even problems in the system itself. After you understand
trends and how they change over time, you might find it helpful to
implement alerts that ensure the data fits within expected bounds.
You can do this by checking for thresholds being crossed, or by veri
fying that new updates to the dataset grow or decline at a rate that is
similar to the average seen across the previous few datasets.
25
Figure 1-7. Hypothetical example of page view counts over time vetting
(source: Jessica Roper and Brian Johnson)
For example, in the UDP, we looked at how page views by product
change over time by month compared to the growth of the most
recent month. We verified that the change we saw from month to
month as new data was stable over time, and dips or spikes were
seen only when expected (e.g., when a new product was launched).
We used the average over several months to account for anomalies
caused by months with several holidays during the week. We wanted
to identify thresholds and data existence expectations.
During this testing, we found several issues, including a failing log
copy process and products that stopped sending up data to the sys
tem. This test verified that each product was present in the final
dataset. Using this data, we were able to identify a problem with ad
server tracking in one of our products before it caused major prob
lems. This kind of issue was previously difficult to detect without
timeseries analysis.
We knew the number of active installations for different products
and total users associated with each of those installations, but we
could not determine which users were actually active before the new
data source was created. To validate the new data and these active
user counts, we ensured that the total number of users we saw mak
ing page views in each product was higher than the total number of
installations, but lower than the total associated users, because not
all users in an installation would be active.
26
Putting the Right Monitors in Place

The timeseries analysis was key to identifying the kinds of moni
tors needed, such as ones for user and client growth. It also identi
fied general usage trends to expect, such as average page views per
product. Monitors are used to test new data being appended to and
created by the system for the future; one time or single historical
reports will not require monitoring. One thing we had to account
for when creating monitors was traffic changes throughout the
week, such as significant drops on the weekends. A couple of trend
complications we had to deal with were weeks that have holidays
and general annual trends such as drops in traffic in December and
during the summer. It is not enough to verify that the month looks
similar to the month before it or that a week has similar data to the
week before; we also had to determine a list of known holidays to
add indicators to those dates when the monitors are triggered and
compare averages over a reasonable amount of time.
It is important to note that we did not allow holidays to mute errors;
instead, we added indicators and high-level data trend summaries in
the monitor errors that allowed us to easily determine if the alert
could be ignored.
Some specific monitors we added included looking at total page
views over time and ensuring that the total was close to the average
total over the previous three months. We also added the same moni
tors for the total page views of each product and category, which
tracked that all categories collect data consistently. This also ensured
that issues in the system creating the data were monitored and
changes such as accidental removal of tracking code would not
go unnoticed.
Other tests included looking at these same trends for totals and by
category for registered users and visitors to ensure that tracking
around users remained consistent. We added many tests around
users because knowing active users and their demographics was crit
ical to our reporting. The main functionality for monitors is to
ensure that critical data continues to have the integrity required.
A large change in a trend is an indicator that something might not
be working as expected in all parts of the system. A good rule
of thumb for what defines a large change is when the data in ques
tion is outside one to two standard deviations from the average. For
example, we found one application that collected the expected data
27
for three months while in beta, but when the final product was
deployed, the tracking was removed. Our monitors discovered this
issue by detecting a drop in total page views for that product cate
gory, allowing us to dig in and correct the issue before it had a large
impact.
There are other monitors we also added that do not focus heavily on
trends over time. Rather, they ensured that we would see the
expected number of total categories and that the directory contain
ing all the files being processed had the minimum number of
expected files, each with the minimum expected size. This was
determined to be critical because we found one issue in which some
log files were not properly copied for parsing and therefore signifi
cant portions of data were missing for a day. Missing even only a few
hours of data can have large effects on different product results,
depending on what part of the day is missing from our data. These
monitors helped us to ensure data copy processes and sources were
updated correctly and provided high-level trackers to make sure the
system is maintained.
As with other testing, the monitors can change over time. In fact, we
did not start out with a monitor to ensure that all the files being pro
cessed were present and the correct sizes. The monitor was added
when we discovered data missing after running a very long process.
When new data or data processes are created it is important to use it
skeptically until no new issues or questions are found for a reason
able amount of time. This is usually related to how the processed
data is consumed and used.
Much of the data I work with at Spiceworks is produced and ana
lyzed monthly, so we closely and heavily manually monitor the sys
tem until the process has run fully successfully for several months.
This included working closely with our analysts as they worked with
the data to find any potential issues or remaining edge cases in the
data. Anytime we found a new issue or unexpected change, a new
monitor was added. Monitors were also updated over time to be
more tolerant of acceptable changes. Many of these monitors were
less around the system (there are different kinds of tests for that),
and more about the data integrity and ensuring reliability.
Finally, another way to monitor the system is to provide end users
with a dead-easy way to raise an issue the moment an inaccuracy is
discovered, and, even better, let them fix it. If you can provide a tool
28
that both allows a user to report on data as well as make corrections,

the data will be able to mature and be maintained more effectively.
One tool we created at Spiceworks helped maintain how different
products are categorized. We provided a user interface with a data
base backend that allowed interested parties to update classifications
of URLs. This created a way to dynamically update and maintain the
data without requiring code changes and manual updates.
Yet another way we did this was to incorporate regular communica
tions and meetings with all of the users of our data. This included
our financial planning teams, business analysts, and product man
agers. We spent time understanding the way the data would be used
and what the end goals were for those using it. In every application,
we included a way to give feedback on each page, usually through a
form that includes all the pages details. Anytime the reporting tool
did not have enough data results for the user, we gave an easy way to
connect with us directly to help obtain the necessary data.
At each layer of testing, automation can help ensure long-term relia
bility of the data and quickly identify problems during development
and process updates. This can include unit tests, trend alerts, or any
thing in between. These are valuable for products that are being
changed frequently or require heavy monitoring.
In the UDP, we automated almost all of the tests around transforma
tions and aggregations, which allowed for shorter test cycles while
iterating through process and provided long-term stability monitor
ing of the parsing process in case anything changes in the future or a
new system needs to be tested.
Not all tests need to be automated or created as monitors. To deter
mine which tests should be automated, I try to focus on three areas:
Overall totals that indicate system health and accuracy
Edge cases that have a large effect on the data
How much effect code changes can have on the data
29
There are four general levels of testing, and each of these levels gen
erally describes how the tests are implemented:
Unit
This tests focus on single complete components in isolation.
Integration
Integration tests focus on two components working together to
build a new or combined data set.
System
This level tests verify the infrastructure and overall process itself
as a whole.
Acceptance
Acceptance tests validate data as reasonable before publishing
or appending data sets.
In the UDP, because having complete sets of logs was critical, a sepa
rate system-level test was created to run before the rest of the pro
cess to ensure that data for each day and hour could be identified in
the log files. This approach further ensures that critical and difficultto-find errors would not go unnoticed. Other tests we focused on
were between transformations of the data such as comparing initial
parsed logs as well as aggregate counts of users and total page views.
Some tests, such as categorization verification, were only done man
ually because most changes to the process should not affect this data
and any change in categorization would require more manual test
ing either way. Different tests require different kinds of automation;
for example, we created an automated test to validate the final
reporting tables, which included a column for total impressions as
well as the breakdown for type of impression based on that impres
sion being caused by a new page view versus ad refresh, and so on.
This test was implemented as a unit test to ensure that at a low level
the total was equal to the sum of the page view types.
Another unit test included creating samples for the log parsing logic
including edge cases as well as both common and invalid examples.
These were fed through the parsing logic after each change to it as
we discovered new elements of the data. One integration test
included in the automation suite was the test to ensure country data
from the third-party geographical dataset was valid and present. The
automating tests for data integrity and reliability using monitors and
trends were done at the acceptance level after processing to ensure
30
valid data that followed the patterns expected before publishing it.
Usually when automated tests are needed, there will be some at
every level.
It is helpful to document test suites and coverage, even if they are
not automated immediately or at all. This makes it easy to review
tests and coverage as well as allow for new or inexperienced testers,
developers, and so on, to assist in automation and manual testing.
Usually, I just record tests as they are manually created and exe
cuted. This helps to document edge cases and other expectations
and attributes of the data.
As needed, when critical tests were identified, we worked to auto
mate those tests to allow for faster iterations working with the data.
Because almost all code changes required some regression testing,
covering critical and high-level tests automatically provided easy
smoke testing for the system and gave some confidence in the con
tinued integrity of the data when changes were made.
Conclusion
Having confidence in data accuracy and integrity can be a daunting
task, but it can be accomplished without having a Ph.D. or back
ground in data analysis. Although you cannot use some of these
strategies in every scenario or project, they should provide a guide
for how you think about data verification, analysis, and automation,
as well as give you the tools and ways to think about data to be able
to provide confidence that the data youre using is trustworthy. It is
important that you become familiar with the data at each layer and
create tests between each transformation to ensure consistency in
the data. Becoming familiar with the data will allow you to under
stand what edge cases to look for as well as trends and outliers to
expect. It will usually be necessary to work with other teams and
groups to improve and validate data accuracy (a quick drink never
hurts to build rapport). Some ways to make this collaboration easier
are to understand what the focus is for those being collaborated with
and to show how the data can be valuable to those teams to use
themselves. Finally, you can ensure and monitor reliability through
automation of process tests and acceptance tests that verify trends
and boundaries and also allow the data collection processes to be
converted and iterated on easily.
Conclusion
31
Further Reading
1. Peters, M. (2013). How Do You Know If Your Data is Accu
rate? Retrieved December 12, 2016, from http://bit.ly/2gJz84p.
2. Polovets, L. (2011). Data Testing Challenge. Retrieved Decem
ber 12, 2016 from http://bit.ly/2hfakCF.
3. Chen, W. (2010). How to Measure Data Accuracy? Retrieved
December 12, 2016 from http://bit.ly/2gj2wxp.
4. Chen, W. (2010). Whats the Root Cause of Bad Data?
Retrieved December 12, 2016 from http://bit.ly/2hnkm7x.
5. Jain, K. (2013). Being paranoid about data accuracy! Retrieved
December 12, 2016 from http://bit.ly/2hbS0Kh.
32
About the Author

Since graduating from University of Texas at Austin with a BS in
computer science, Jessica Roper has worked as a software developer
working with data to maintain, process, scrub, warehouse, test,
report on, and create products for it. She is an avid mentor and
teacher, taking any opportunity available to share knowledge.
Jessica is currently senior developer in the data analytics division of
Spiceworks, Inc., a network used by IT professionals to stay connec
ted and monitor their systems.
Outside of her technical work, she enjoys biking, swimming, cook
ing, and traveling.

A Guide To Improving Data Integrity and Adoption

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

A Guide To Improving Data Integrity and Adoption

Transféré par

Droits d'auteur :

Formats disponibles

A Guide to Improving Data

Integrity and Adoption

A Case Study in Verifying Usage Data

Boston Farnham Sebastopol

A Guide to Improving Data Integrity and Adoption

Editor: Nicole Tache

Interior Designer: David Futato

Revision History for the First Edition

The OReilly logo is a registered trademark of OReilly Media, Inc. A Guide to

A Guide to Improving Data Integrity and Adoption. . . . . . . . . . . . . . . . . . . 1

A Guide to Improving Data

In most companies, quality data is crucial to measuring success and

Validating Data Integrity as an Integral Part of

A Guide to Improving Data Integrity and Adoption

engagement with different products and features, user-base size and

Using the Case Study as a Guide

Using the Case Study as a Guide

ferent levels and in different ways. This report should serve as a

An Overview of the Usage Data Project

| A Guide to Improving Data Integrity and Adoption

When our new metrics proved to be stable, individual teams began

Getting Started with Data

Getting Started with Data

Managing Layers of Data

| A Guide to Improving Data Integrity and Adoption

The next layer of data is processed and structured following some

Performing Additional Transformation and

Performing Additional Transformation and Formatting

out the categorization and summary data needed into columns

| A Guide to Improving Data Integrity and Adoption

Starting with Smaller Datasets

Starting with Smaller Datasets

Figure 1-3. Example of laptop manufacturing information

Determining Acceptable Error Rates

| A Guide to Improving Data Integrity and Adoption

Creating Work Groups

Creating Work Groups

| A Guide to Improving Data Integrity and Adoption

Collaboration helped achieve confidence in the accuracy because

Reassessing the Value of Data Over Time

Reassessing the Value of Data Over Time

An example of this is when we were creating the unified mechanism

Checking the System for Internal Consistency

A Guide to Improving Data Integrity and Adoption

Checking the System for Internal Consistency

Validating the Initial Parsing Process

| A Guide to Improving Data Integrity and Adoption

Checking the System for Internal Consistency

Checking the Validity of Each Field

A Guide to Improving Data Integrity and Adoption

Verifying Accuracy of Transformations and

Verifying Accuracy of Transformations and Aggregation Reports

In the usage data project, we converted each timestamp to a date

A Guide to Improving Data Integrity and Adoption

Allowing for Tests to Evolve

Allowing for Tests to Evolve

For scrubbing transformations or clustering of data, your tests

Checking for External Consistency: Analyzing and

A Guide to Improving Data Integrity and Adoption

maintaining and verifying the reportable data layer, especially when

Allowing for Tests to Evolve

Figure 1-6. Hypothetical example of vetting data trends (source: Jessica

| A Guide to Improving Data Integrity and Adoption