Académique Documents
Professionnel Documents
Culture Documents
pg. 1
Contents
1. Introduction.................................................................................................................. 3 Other Help Resources...........................................................................................................3 2. User Interface Overview................................................................................................ 3 The Explorer Panel................................................................................................................4 The Editing Area ...................................................................................................................4 The Status Panel ...................................................................................................................4 Configuration Dialogs ...........................................................................................................4 3. Creating a Profile .......................................................................................................... 5 4. Reading a Profile ..........................................................................................................10 Introduction .......................................................................................................................10 Column Analyses ................................................................................................................10
Basic Analyses .................................................................................................................................. 10 Interpreting Counts.......................................................................................................................... 11 Frequency ........................................................................................................................................ 12 Domain Analysis............................................................................................................................... 12 Mask Analysis................................................................................................................................... 12 Quantiles.......................................................................................................................................... 13 Group Frequency Analysis ............................................................................................................... 13
Inputs and Roll Ups.............................................................................................................14 Primary Keys.......................................................................................................................14 Foreign Keys .......................................................................................................................14 Business Rules ....................................................................................................................14 Dependency Analysis..........................................................................................................15 5. Working with Plan Files ................................................................................................15 The Plan Editor ...................................................................................................................16
Building a Plan ................................................................................................................................. 16 Editing Steps .................................................................................................................................... 17 Comments........................................................................................................................................ 18
pg. 2
1. Introduction
DQ Analyzer 6 is a comprehensive tool for Data Profiling. This guide is intended to provide an overview of the basic functionality of the product and describe how to perform common functions. Some of the highlights of this guide include: How to Profile a text file in under 30 seconds (pg. 5) How to read and interpret Profiling results (pg. 10) Short descriptions of the included algorithms (Steps) (pg. 21) How to connect to a database for additional profiling capabilities (pg. 24)
pg. 3
2 1
3
The Explorer Panel
The explorer panel is where project files and database connections are shown. Many action shortcuts are available by right-clicking on objects in this panel, such as creating new files or connecting to a database.
Configuration Dialogs
There are various dialogs used in DQ Analyzer to configure the different components, similar to the one shown below. These dialogs are typically invoked by double-clicking Steps or via the context (right-click) menu.
pg. 4
3. Creating a Profile
To create a Profile, click on the Create a Profile link from the Welcome Screen or select New Profile from the toolbar or context menu (as shown below).
You will first be given the option to create a Profile from a text file or from a database table.
pg. 5
In order to profile a database table, you must have a database configured (see Chapter 6 Connecting to a Database to learn how to do this). In this guide we will demonstrate the process of profiling a text file.
Choose the file you would like to profile and click Next >. Note: You can skip these steps by right-clicking on a text file or database table and selecting Create Profile. This will take you directly to the next step. You may need to assign metadata to your file to describe how it is formatted. If there is no metadata associated with the file, the following screen will appear.
pg. 6
If the format shown is not correct, click on the Open Metadata Editor button to customize the format and appearance of your data. Once the appropriate metadata has been assigned, the profile configuration step will appear.
pg. 7
This panel allows you to configure the profile that you will create. You can specify where to create the profile as well as which columns to profile. Drill-through functionality allows you to see the individual records that comprise the statistics that are generated (database connection required). Finally, there is the option to create a Profile or a Plan file. Selecting the Profile option will generate the Profile immediately using the settings specified. Creating a Plan file will create a Plan that can be run to generate a Profile (see Chapter 5 Working with Plan Files for more information on Plans). This option is useful if you wish to modify or filter the data before profiling it or if you want to do some advanced configuration of the profiling algorithm (such as adding business rules or performing primary key analysis, for example).
pg. 8
If you select Profile and click Finish, the Profile will be generated and it will be opened in the Profile Viewer. See Chapter 4 - Reading a Profile to learn how to read the data contained in the Profile. If you select the option to create a Plan file, a file will be created that looks something like this:
There are two steps in the Plan, one for reading the data and another to generate the Profile. You can double-click on either of them to perform additional configuration. To create the Profile, click the Run button in the toolbar.
If you want to modify the Profile that is created, you can double-click on the Profiling step to open the Profiling step editor. Here you can edit the existing configuration or add additional analyses to run.
pg. 9
4. Reading a Profile
Introduction
The Profile Viewer contains several tabs and windows, which are described below. The data can be exported to XML or HTML format by using the Export button above each table. For information on how to create a profile, see Chapter 3 - Creating a Profile.
Column Analyses
The Column analyses tab presents statistical analyses and pattern information about the columns that have been profiled. Each column in the input data is listed as a row in the table, which presents information such as data type, value counts, and minimum/maximum values.
Basic Analyses
The Basic tab provides simple statistics about the data that has been profiled and shows a chart of duplicate and distinct data as a percentage of the whole.
pg. 10
Interpreting Counts
The Counts table lists the following values: Null: all data that are empty or have "Null" as their value. Non-null: all data that are not empty or null (duplicate + distinct) Duplicate: the number of values that are the same as other values in the list Distinct: the number of non-null values that are different from each other (non-unique + unique) Non-unique: the number of values that have at least one duplicate in the list Unique: the number of values that have no duplicates
To illustrate the meaning of these values, take the following data as an example.
Record No. 1 2 3 4 5 Value John Smith John Smith Rebecca Davis Paul Adams
pg. 11
Frequency
The Frequency Analysis tab shows the number of times each value in the data occurs (shown as both an absolute count and as a percentage of the whole).
Domain Analysis
This is an analysis to determine the likely type of the data in each column (whether the data is text, a number, or a date, for example). The probable types are listed, along with exceptions (such as a text string found in a list of dates, for example).
Mask Analysis
The Mask Analysis tab shows the syntactic patterns of the data, i.e. the structure of the data rather than the content of the data. Codes (masks) are used to describe these patterns. For example, the code W is used by default to represent a word (the number of letters required to make a word can be defined in the Profiling Step properties), while L is used to represent a letter. This type of analysis can be useful when, for example, looking at a column of names, where one or two words are common, but single letters and numbers are not. Finding unexpected patterns in the data can provide information about the overall level of quality of the data.
pg. 12
Quantiles
The Quantiles tab displays the data values that occur at designated intervals in the ordered data set. The first value in the list is at 0% and the last value is at 100%. The median value is at the 50% marker.
pg. 13
Primary Keys
When configured in the Profiling Step properties, the Primary Keys tab is shown and analyzes the uniqueness of designated keys.
Foreign Keys
When configured in the Profiling Step properties, the Foreign Keys tab is shown and analyzes whether the key from one input can be considered a foreign key in relation to the other (parent) entity coming from a second input
Business Rules
When configured in the Profiling Step properties, this tab is shown and displays the results of the evaluation of any number of Boolean expressions relating to the input data. The example below (taken from the Advanced Profiling sample) shows a business rule that checks the length of each SIN number and tests whether it is 9 digits in length. It evaluates to true if the length is 9 digits and false otherwise.
pg. 14
Dependency Analysis
Dependency Analysis discovers whether values of Dependants (selected columns or expressions) depend on the value of a Determinant (one or more columns combined into a single key). Each group of records with the same Determinant value is examined, and if the most frequent Dependant value is present in at least a certain percent of records, the whole group is considered to be dependent. Otherwise the whole group is considered not to be dependent.
pg. 15
Building a Plan
To start building a Plan, drag a Step from the palette and drop it on to the canvas.
pg. 16
Connect Steps together by dragging from the "out" endpoint of one Step to the "in" endpoint of another or by selecting the Connection object from the palette and select the output from one Step and then the input from another Step.
Editing Steps
Properties for each Step can be edited by double-clicking on the Step, or by right-clicking the Step and selecting Edit Properties....
pg. 17
For more information on the available Steps in DQ Analyzer, refer to the documentation pages for each Step.
Comments
Comments can be used to place notes or other information on the canvas or to place around a series of Steps to visually group them together.
pg. 18
The metadata editor presents options for how to read the file, such as the type of delimiter used, the data types of each column, and whether the file contains header rows. The result data can be previewed in the lower panel of the editor to examine the results of the metadata settings. Clicking OK in the Metadata Editor will open the data file for viewing. The file metadata can be edited later by right-clicking on the file and selecting Edit Metadata.... To use input files inside a Plan, add a Text File Reader Step and enter the input file name in the File Name property.
pg. 19
Alternatively, text files can be dragged from the explorer panel directly on to the canvas, where a Text File Reader will be generated after the metadata is created.
pg. 20
Description of Steps
Steps can perform many types of functions, such as transforming data, filtering and categorizing data and reading data. Below is an overview of the steps included with DQ Analyzer.
Icon Step Name Column Assigner Condition JDBC Reader Profiling Regex Matching Splitter Text File Reader Trash Union same Step Description Assigns the result of an expression to a column. Directs data flow (True->right false->left). Reads data from a JDBC (database) data source. Comprehensive analysis written to a file (.profile). Parses the input string based upon regular expression capturing groups. Creates a new record for each word in a defined expression. Reads data from a text file. Discards data flow. Like SQL Table union but applies only if flows are exactly same.
pg. 21
A complete description of the Steps and their usage can be found in the product Help (Help Help Contents in the toolbar menu).
Using Functions
There are many functions available in DQ Analyzer that can be used inside Steps. Some of the common functions are listed below.
Function matches find substr Description Full match input data with regular expression Partial match regular expression in input string Get substring of input string. Starting with zero. Return Value(s) True/false True/false String
For example, two regular expressions and their uses are shown below.
Regular Expression String [A-Z] [0-9] [A-Z]\s?[0-9] [A-Z][0-9] (\d{3} \d{2} \d{4}|\d{9}|\d{3}\-\d{2}\-\d{4}) Sample Usage Canadian ZIP code (e.g., A3A 9S9) US Social Security Number (123 45 789 or 123456789 or 123-45-6789)
pg. 22
Selecting an individual Step will show only the warnings and errors for that Step. Doubleclicking on an error in the Properties panel will open the Step properties dialog to the field which contains the error. Individual Steps can also be debugged along the way by clicking the Debug button in the toolbar when a Step is selected or by right-clicking on a Step and selecting Debug.
To run a Plan, click on the Run button on the toolbar or right-click on the canvas and select Run.
When a Plan is being run its progress can be monitored in the Console window below the Plan editor.
pg. 23
6. Connecting to a Database
The following JDBC database drivers are included with DQ Analyzer (additional drivers can be added in the DB Drivers preferences):
To connect to one of these database types, right-click on the Databases node in the DQ Explorer and select New Database Connection.
pg. 24
Selecting a driver name will populate the URL string field with a template for connecting to the specified database type. After the database connection has been made, the database will be shown in the Databases node in the explorer panel. Clicking on the table names will show metadata for each table in the Properties panel. To view the results of an SQL query on a table, right-click on a table and select Open in SQL editor.
pg. 25
A default query will be shown, listing all table entries (grouped in batches if the number of rows is large). To change the query, edit the query text and click the Execute button. To retrieve more results from the query, click Next batch or Read rest (to show all results). Refer to the documentation for the JDBC Reader Step to learn how to use data from a database inside a Plan file.
pg. 26