Vous êtes sur la page 1sur 26

Getting Started with DQ Analyzer 6

Copyright 2009 Ataccama Corporation

Getting Started with Ataccama DQ Analyzer 6

pg. 1

Contents
1. Introduction.................................................................................................................. 3 Other Help Resources...........................................................................................................3 2. User Interface Overview................................................................................................ 3 The Explorer Panel................................................................................................................4 The Editing Area ...................................................................................................................4 The Status Panel ...................................................................................................................4 Configuration Dialogs ...........................................................................................................4 3. Creating a Profile .......................................................................................................... 5 4. Reading a Profile ..........................................................................................................10 Introduction .......................................................................................................................10 Column Analyses ................................................................................................................10
Basic Analyses .................................................................................................................................. 10 Interpreting Counts.......................................................................................................................... 11 Frequency ........................................................................................................................................ 12 Domain Analysis............................................................................................................................... 12 Mask Analysis................................................................................................................................... 12 Quantiles.......................................................................................................................................... 13 Group Frequency Analysis ............................................................................................................... 13

Inputs and Roll Ups.............................................................................................................14 Primary Keys.......................................................................................................................14 Foreign Keys .......................................................................................................................14 Business Rules ....................................................................................................................14 Dependency Analysis..........................................................................................................15 5. Working with Plan Files ................................................................................................15 The Plan Editor ...................................................................................................................16
Building a Plan ................................................................................................................................. 16 Editing Steps .................................................................................................................................... 17 Comments........................................................................................................................................ 18

Working with Data Files......................................................................................................19 Tips for Using Steps ............................................................................................................21


Description of Steps......................................................................................................................... 21 Using Functions................................................................................................................................ 22 Using Regular Expressions ............................................................................................................... 22

Debugging and Running a Plan ...........................................................................................22 6. Connecting to a Database.............................................................................................24

Getting Started with Ataccama DQ Analyzer 6

pg. 2

1. Introduction
DQ Analyzer 6 is a comprehensive tool for Data Profiling. This guide is intended to provide an overview of the basic functionality of the product and describe how to perform common functions. Some of the highlights of this guide include: How to Profile a text file in under 30 seconds (pg. 5) How to read and interpret Profiling results (pg. 10) Short descriptions of the included algorithms (Steps) (pg. 21) How to connect to a database for additional profiling capabilities (pg. 24)

Other Help Resources


Video Tutorials There are video tutorials available which demonstrate how to perform common tasks in DQ Analyzer. There are links to these tutorials from the Welcome Screen. Tutorial Files DQ Analyzer includes a sample Profiling project which contains pre-built, runnable configurations. Help Files For help on specific functions or features not covered in this guide or the resources mentioned above, extensive documentation is available in the product Help (available in Help Help Contents in the toolbar).

2. User Interface Overview


DQ Analyzer is a development tool built on the Eclipse framework, so it is similar in structure and behavior to many Integrated Development Environments (IDEs). The user interface is comprised of four main areas: 1. 2. 3. 4. The explorer panel The main editing area The status panel Configuration dialogs (not shown)

Getting Started with Ataccama DQ Analyzer 6

pg. 3

2 1

3
The Explorer Panel
The explorer panel is where project files and database connections are shown. Many action shortcuts are available by right-clicking on objects in this panel, such as creating new files or connecting to a database.

The Editing Area


The editing area is where Profiles are shown. It is also where Plan files are created and edited. These will be described in detail in subsequent sections.

The Status Panel


The Properties panel contains a section that displays any problems that DQ Analyzer detects in Plan files. Clicking on a problem in the list will open the component that contains the problem. This area is also used to show the progress of generating a Profile or running a Plan.

Configuration Dialogs
There are various dialogs used in DQ Analyzer to configure the different components, similar to the one shown below. These dialogs are typically invoked by double-clicking Steps or via the context (right-click) menu.

Getting Started with Ataccama DQ Analyzer 6

pg. 4

3. Creating a Profile
To create a Profile, click on the Create a Profile link from the Welcome Screen or select New Profile from the toolbar or context menu (as shown below).

You will first be given the option to create a Profile from a text file or from a database table.

Getting Started with Ataccama DQ Analyzer 6

pg. 5

In order to profile a database table, you must have a database configured (see Chapter 6 Connecting to a Database to learn how to do this). In this guide we will demonstrate the process of profiling a text file.

Choose the file you would like to profile and click Next >. Note: You can skip these steps by right-clicking on a text file or database table and selecting Create Profile. This will take you directly to the next step. You may need to assign metadata to your file to describe how it is formatted. If there is no metadata associated with the file, the following screen will appear.

Getting Started with Ataccama DQ Analyzer 6

pg. 6

If the format shown is not correct, click on the Open Metadata Editor button to customize the format and appearance of your data. Once the appropriate metadata has been assigned, the profile configuration step will appear.

Getting Started with Ataccama DQ Analyzer 6

pg. 7

This panel allows you to configure the profile that you will create. You can specify where to create the profile as well as which columns to profile. Drill-through functionality allows you to see the individual records that comprise the statistics that are generated (database connection required). Finally, there is the option to create a Profile or a Plan file. Selecting the Profile option will generate the Profile immediately using the settings specified. Creating a Plan file will create a Plan that can be run to generate a Profile (see Chapter 5 Working with Plan Files for more information on Plans). This option is useful if you wish to modify or filter the data before profiling it or if you want to do some advanced configuration of the profiling algorithm (such as adding business rules or performing primary key analysis, for example).

Getting Started with Ataccama DQ Analyzer 6

pg. 8

If you select Profile and click Finish, the Profile will be generated and it will be opened in the Profile Viewer. See Chapter 4 - Reading a Profile to learn how to read the data contained in the Profile. If you select the option to create a Plan file, a file will be created that looks something like this:

There are two steps in the Plan, one for reading the data and another to generate the Profile. You can double-click on either of them to perform additional configuration. To create the Profile, click the Run button in the toolbar.

If you want to modify the Profile that is created, you can double-click on the Profiling step to open the Profiling step editor. Here you can edit the existing configuration or add additional analyses to run.

Getting Started with Ataccama DQ Analyzer 6

pg. 9

4. Reading a Profile
Introduction
The Profile Viewer contains several tabs and windows, which are described below. The data can be exported to XML or HTML format by using the Export button above each table. For information on how to create a profile, see Chapter 3 - Creating a Profile.

Column Analyses
The Column analyses tab presents statistical analyses and pattern information about the columns that have been profiled. Each column in the input data is listed as a row in the table, which presents information such as data type, value counts, and minimum/maximum values.

Basic Analyses
The Basic tab provides simple statistics about the data that has been profiled and shows a chart of duplicate and distinct data as a percentage of the whole.

Getting Started with Ataccama DQ Analyzer 6

pg. 10

Interpreting Counts
The Counts table lists the following values: Null: all data that are empty or have "Null" as their value. Non-null: all data that are not empty or null (duplicate + distinct) Duplicate: the number of values that are the same as other values in the list Distinct: the number of non-null values that are different from each other (non-unique + unique) Non-unique: the number of values that have at least one duplicate in the list Unique: the number of values that have no duplicates

To illustrate the meaning of these values, take the following data as an example.
Record No. 1 2 3 4 5 Value John Smith John Smith Rebecca Davis Paul Adams

The Counts table for this data would be as follows:


Type Null Non-null Duplicate Distinct Non-unique Unique Count 1 4 1 3 1 2 Records Record 5 Records 1-4 Record 2 Records 1, 3, 4 Record 1 Records 3 and 4 Explanation The last record is empty The first 4 records contain data there is one duplicate of the John Smith record John Smith has a duplicate record - therefore it isn't unique Rebecca Davis and Paul Adams appear only once in the list, they have no duplicates

Getting Started with Ataccama DQ Analyzer 6

pg. 11

Frequency
The Frequency Analysis tab shows the number of times each value in the data occurs (shown as both an absolute count and as a percentage of the whole).

Domain Analysis
This is an analysis to determine the likely type of the data in each column (whether the data is text, a number, or a date, for example). The probable types are listed, along with exceptions (such as a text string found in a list of dates, for example).

Mask Analysis
The Mask Analysis tab shows the syntactic patterns of the data, i.e. the structure of the data rather than the content of the data. Codes (masks) are used to describe these patterns. For example, the code W is used by default to represent a word (the number of letters required to make a word can be defined in the Profiling Step properties), while L is used to represent a letter. This type of analysis can be useful when, for example, looking at a column of names, where one or two words are common, but single letters and numbers are not. Finding unexpected patterns in the data can provide information about the overall level of quality of the data.

Getting Started with Ataccama DQ Analyzer 6

pg. 12

Quantiles
The Quantiles tab displays the data values that occur at designated intervals in the ordered data set. The first value in the list is at 0% and the last value is at 100%. The median value is at the 50% marker.

Group Frequency Analysis


The Groups tab presents a different analysis of the data in the Frequency tab. It shows the number of times that each non-null frequency count is repeated. If all values are unique, the group size will be 1, as there are no duplicate values. Each time a value is repeated, it forms a new group.

Getting Started with Ataccama DQ Analyzer 6

pg. 13

Inputs and Roll Ups


The Profiling Step may take any number of inputs, which are shown in this panel (if there is more than one input). Additionally, each input may have any number of roll ups assigned to it - ways of grouping the data by specific parameters. For example, roll ups could be used to view data profiles by gender or country.

Primary Keys
When configured in the Profiling Step properties, the Primary Keys tab is shown and analyzes the uniqueness of designated keys.

Foreign Keys
When configured in the Profiling Step properties, the Foreign Keys tab is shown and analyzes whether the key from one input can be considered a foreign key in relation to the other (parent) entity coming from a second input

Business Rules
When configured in the Profiling Step properties, this tab is shown and displays the results of the evaluation of any number of Boolean expressions relating to the input data. The example below (taken from the Advanced Profiling sample) shows a business rule that checks the length of each SIN number and tests whether it is 9 digits in length. It evaluates to true if the length is 9 digits and false otherwise.

Getting Started with Ataccama DQ Analyzer 6

pg. 14

Dependency Analysis
Dependency Analysis discovers whether values of Dependants (selected columns or expressions) depend on the value of a Determinant (one or more columns combined into a single key). Each group of records with the same Determinant value is examined, and if the most frequent Dependant value is present in at least a certain percent of records, the whole group is considered to be dependent. Otherwise the whole group is considered not to be dependent.

5. Working with Plan Files


A Plan file defines the logic and rules to be applied to the input data in order to produce the desired output. Plans are created by placing Steps onto a canvas and connecting them together. Steps are data processing algorithms that can be used to read, transform and analyze data, among other actions. To create a new Plan file, select New Plan by right-clicking on a project or folder in the explorer panel (or from the File menu or Toolbar).

Getting Started with Ataccama DQ Analyzer 6

pg. 15

The Plan Editor


The Plan editor consists of a canvas where the Plan logic is defined (by connecting Steps together) and a palette where the various Steps are listed.

Building a Plan
To start building a Plan, drag a Step from the palette and drop it on to the canvas.

Getting Started with Ataccama DQ Analyzer 6

pg. 16

Connect Steps together by dragging from the "out" endpoint of one Step to the "in" endpoint of another or by selecting the Connection object from the palette and select the output from one Step and then the input from another Step.

Editing Steps
Properties for each Step can be edited by double-clicking on the Step, or by right-clicking the Step and selecting Edit Properties....

Getting Started with Ataccama DQ Analyzer 6

pg. 17

For more information on the available Steps in DQ Analyzer, refer to the documentation pages for each Step.

Comments
Comments can be used to place notes or other information on the canvas or to place around a series of Steps to visually group them together.

Getting Started with Ataccama DQ Analyzer 6

pg. 18

Working with Data Files


Existing files can be added to DQ Analyzer for use as input data for a Plan, for example. Files can be added by dragging and dropping from the file system to the desired project in the Navigator panel or by copying them from destination folder to the desired project folder inside the workspace folder in the file system. To use an input file in a Plan, it must first be assigned metadata describing the format of the data. When a data file (e.g. .txt or .csv) file is opened for the first time the Metadata Editor is launched.

The metadata editor presents options for how to read the file, such as the type of delimiter used, the data types of each column, and whether the file contains header rows. The result data can be previewed in the lower panel of the editor to examine the results of the metadata settings. Clicking OK in the Metadata Editor will open the data file for viewing. The file metadata can be edited later by right-clicking on the file and selecting Edit Metadata.... To use input files inside a Plan, add a Text File Reader Step and enter the input file name in the File Name property.

Getting Started with Ataccama DQ Analyzer 6

pg. 19

Alternatively, text files can be dragged from the explorer panel directly on to the canvas, where a Text File Reader will be generated after the metadata is created.

Getting Started with Ataccama DQ Analyzer 6

pg. 20

Tips for Using Steps


DQ Analyzer offers several steps and functions for constructing plan files. The algorithms and logic used to create a plan file will vary from project to project; an introduction to steps and functions is provided below.

Description of Steps
Steps can perform many types of functions, such as transforming data, filtering and categorizing data and reading data. Below is an overview of the steps included with DQ Analyzer.
Icon Step Name Column Assigner Condition JDBC Reader Profiling Regex Matching Splitter Text File Reader Trash Union same Step Description Assigns the result of an expression to a column. Directs data flow (True->right false->left). Reads data from a JDBC (database) data source. Comprehensive analysis written to a file (.profile). Parses the input string based upon regular expression capturing groups. Creates a new record for each word in a defined expression. Reads data from a text file. Discards data flow. Like SQL Table union but applies only if flows are exactly same.

Getting Started with Ataccama DQ Analyzer 6

pg. 21

A complete description of the Steps and their usage can be found in the product Help (Help Help Contents in the toolbar menu).

Using Functions
There are many functions available in DQ Analyzer that can be used inside Steps. Some of the common functions are listed below.
Function matches find substr Description Full match input data with regular expression Partial match regular expression in input string Get substring of input string. Starting with zero. Return Value(s) True/false True/false String

Using Regular Expressions


DQ Analyzer supports the use of regular expressions for pattern matching. Some of the basic regular expressions are listed below.
Regular Expression \d [A-Z] [a-z] \s . (dot) ? + * {2,6} ^ $ Matches Number Uppercase letter Lowercase letter Whitespace Any character Once or none Once or more Zero or more times At least 2 times, maximum 6 times Beginning of string End of string

For example, two regular expressions and their uses are shown below.
Regular Expression String [A-Z] [0-9] [A-Z]\s?[0-9] [A-Z][0-9] (\d{3} \d{2} \d{4}|\d{9}|\d{3}\-\d{2}\-\d{4}) Sample Usage Canadian ZIP code (e.g., A3A 9S9) US Social Security Number (123 45 789 or 123456789 or 123-45-6789)

Debugging and Running a Plan


Errors in the Plan will be shown in the Properties panel as the Plan is constructed.

Getting Started with Ataccama DQ Analyzer 6

pg. 22

Selecting an individual Step will show only the warnings and errors for that Step. Doubleclicking on an error in the Properties panel will open the Step properties dialog to the field which contains the error. Individual Steps can also be debugged along the way by clicking the Debug button in the toolbar when a Step is selected or by right-clicking on a Step and selecting Debug.

To run a Plan, click on the Run button on the toolbar or right-click on the canvas and select Run.

When a Plan is being run its progress can be monitored in the Console window below the Plan editor.

Getting Started with Ataccama DQ Analyzer 6

pg. 23

6. Connecting to a Database
The following JDBC database drivers are included with DQ Analyzer (additional drivers can be added in the DB Drivers preferences):

Apache Derby HSQLDB Oracle Microsoft SQL PostgreSQL

To connect to one of these database types, right-click on the Databases node in the DQ Explorer and select New Database Connection.

Getting Started with Ataccama DQ Analyzer 6

pg. 24

Selecting a driver name will populate the URL string field with a template for connecting to the specified database type. After the database connection has been made, the database will be shown in the Databases node in the explorer panel. Clicking on the table names will show metadata for each table in the Properties panel. To view the results of an SQL query on a table, right-click on a table and select Open in SQL editor.

Getting Started with Ataccama DQ Analyzer 6

pg. 25

A default query will be shown, listing all table entries (grouped in batches if the number of rows is large). To change the query, edit the query text and click the Execute button. To retrieve more results from the query, click Next batch or Read rest (to show all results). Refer to the documentation for the JDBC Reader Step to learn how to use data from a database inside a Plan file.

Getting Started with Ataccama DQ Analyzer 6

pg. 26

Vous aimerez peut-être aussi