Vous êtes sur la page 1sur 23

knowerce

Slovak Public Procurement


Announcmenets
Extraction, transformation and Loading Process
July 2010

info@knowerce.sk
www.knowerce.sk
Slovak Public Procurement Announcements ETL
knowerce

Document information
Creator Knowerce, s.r.o.
Vavilovova 16
851 01 Bratislava

info@knowerce.sk
www.knowerce.sk

Author Štefan Urbánek, stefan@knowerce.sk

Date of creation 20.7.2010

Document revision 2

1.Document Restrictions
Copyright (C) 2010 Knowerce, s.r.o., Stefan Urbanek
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU
Free Documentation License, Version 1.3 or any later version published by the Free Software
Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license is included in the section entitled "GNU Free Documentation License".

2
Slovak Public Procurement Announcements ETL
knowerce

Contents
Introduction ....................................................................................................................................................................................4
Overview .........................................................................................................................................................................................5
The Process 5
Jobs.......................................................................................................................................................................................................6
Download 6
Parse 7
Load Source 12
Cleanse 13
Create Cube 15
Create search index 16
Regis Download 16
Geography Loading 17
CPV Loading 17
Data ..................................................................................................................................................................................................18
Source Mirror 19
Staging Data 19
Datamart Datastore 19
Search Index ................................................................................................................................................................................21
Installation .....................................................................................................................................................................................22
Software Requirements 22
Preparation 22
ETL Database initialisation 22
Running ETL Jobs.......................................................................................................................................................................23
Launching 23

3
Slovak Public Procurement Announcements ETL
knowerce

2. Introduction
This document describes extraction, transformation and loading process of public procurement
documents in Slovakia. Objective of the VVO project was to transform unstructured public
procurement announcement documents into structured form.

unstructured
raw open data
HTML

Source code: http://github.com/Stiivi/vvo-etl


Data source URL: http://www.e-vestnik.sk/
Application using the data: http://vestnik.transparency.sk

4
Slovak Public Procurement Announcements ETL
knowerce

3. Overview
3.1. The Process
Public procurement announcement documents are being processed in a chain of ETL jobs. The jobs
are:

Source Extraction Transformation Analytical Transformation

Create
Download Parse Load source Cleanse Create cube
search index

Reasons for creating several jobs instead of single monolithic processing script are mainly: better
maintainability, ability to re-run failed part of the chain, ability to plug-in other sources into the chain in
the future.
If a part of the chain fails, it is not necessary to run whole chain again, just the part of the chain from
failed part. This lowers processing load and network load on source servers. For example, cleansing
fails, it is not necessary to download the files again.
In addition to the processing jobs, there are three required, however independent jobs:

Regis Geography
CPV Loading
Extraction Loading

Job Type Description

Download core Download HTML documents from the source

Parse core Parse HTML documents into structured form

Load source core Load structured form into database table

Cleanse core Cleanse data, fix values, map corrections

Create cube core Create analytical structure: fact table and dimensions

Create search index core Create search index for full-text searching with support for Slovak/ASCII searching

Regis Extraction support Extract list of all Slovak organisations

Geography loading support Load data from Slovak post-office about regional break-down

CPV loading support Load CPV (common procurement vocabulary) data

5
Slovak Public Procurement Announcements ETL
knowerce

4. Jobs
4.1. Download

Download

raw sources HTML files

Inputs: HTML documents stored on public procurements website


Outputs: HTML files stored locally
Configuration: public procurements web site root, path to bulletin index, document encoding
Options: incremental mode (default), full mode (download all announcements)

At site root one can find paginated list of bulletins:

http://www.e-vestnik.sk/#EVestnik/Vsetky_vydania

By following a bulletin link, there is list of announcement types:

http://www.e-vestnik.sk/#EVestnik/Vestnik?date=2010-08-07&from=Vsetky_vydania

6
Slovak Public Procurement Announcements ETL
knowerce

By clicking on a link with desired public procurement type (“procurement results”) list is expanded and
we get list of all announcements within the bulletin:

http://www.e-vestnik.sk/#EVestnik/Vestnik?cat=7&date=2010-08-07

Situation:
■ no data API provided by website
■ no single list of all public procurements, only paginated browsing of bulletins
■ no proper HTML id attributes, nor non-ambiguous class attributes
■ layout by table

Process
1. Download and parse document index at specified site root, get number of pages
2. Download and parse all “bulletin list page” pages, output is name and URL of each bulletin
3. Compare list of available bulletins with list of already downloaded bulletins and generate list of
bulletins to be downloaded (all if full download is requested)
4. Download all announcements found on each bulletin page and save into download directory
5. Store list of downloaded bulletins

4.2. Parse

Parse

HTML files YAML files

7
Slovak Public Procurement Announcements ETL
knowerce

Inputs: HTML documents with announcements, stored locally


Outputs: YAML structured files with parsed fields, one YAML per announcement
Configuration: none
Options: none
Situation:
■ very messy HTML structure
■ ambiguous class attributes, mis-use of class attributes
■ no usable id attributes
■ heavy table layout with nested tables, level 3 is common (table in table in table)
■ sometimes broken layout, causing many parsing exceptions
■ not reliably indexable values by referencing row number
■ non-consistent table layout – might and might not contain tbody
Document example:

http://www.e-vestnik.sk/EVestnik/Detail/16563

Example of layout with emphasised contrast CSS for better layout visibility:

8
Slovak Public Procurement Announcements ETL
knowerce

Example of broken layout, where the cyan values in left column were supposed to be in the right
column:

Example of an element nesting within a document with 24 levels of nesting:


html > body > #page > #container > #main > #innerMain > div >
> table > tbody > tr > td >
> table > tbody > tr > td >
> table > tbody > tr >td >
> table > tbody > tr >td > span.hodnota

Having situation like described above makes parsing of public procurement documents tricky.
Rough document structure (as seen by user/human):
■ document title
■ basic announcement information
■ parts of announcement
■ each part of announcement contain sections
■ each section contains list of information pieces (I would not call that key-value pairs, as they are
not)

9
Slovak Public Procurement Announcements ETL
knowerce

Process
The whole document was parsed as HTML document tree.
Strategies used:

■ Unicode regular exception matching


■ element references by element index (unstable, but sufficient for most cases) - instead of using
proper id/class attribute (which were missing) we used index of the element that we wanted to
parse
■ because structure was not consistent, sometimes searching for elements was necessary instead of
directly referencing by path, which made processing little bit slower

1. read basic announcement information: date, announcement number, type


2. find table with document parts and split HTML document subtrees for each part
3. parse each part

Part parsing:
The main body of the document is a table containing cells which contain optional part title and part
body in the form of a table. The part body table contains anonymous rows with section contents in
two columns. The left column is used mostly for padding and might contain section number. The right
column contains information to be extracted. How the part and sections look like is depicted in the
following picture:

<td>
<table> part body

<table> part body <tr> number section title

<tr> (empty) cell with content


<td>
<tr> (empty) cell with content
<span> part title

<tr> (empty) cell with content


<table> part body

<tr> number section title

<td>
<tr> (empty) cell with content
<span> part title

<tr> number section title


<table> part body
<tr> (empty) cell with content

... more parts

10
Slovak Public Procurement Announcements ETL
knowerce

It was not possible to reliably find sections in parts by referencing rows directly. Each part was broken
into list of table rows and rows were parsed sequentially as on a “tape”:
1. prepare section structure
2. get next row
3. if left column contains value, then it is beginning of next section
3.1. process previous section, if there is any
3.2. prepare new section structure
3.3. save next section name into section structure
4. if left column is empty then:
4.1. add right column to list of section rows in the section structure
5. repeat from 2 until all rows are processed

Section parsing:
After parsing parts, the section structure contains section title, section number and list of rows (cells
from left column of a part table). The rows are processed sequentially as well.
Each set of section rows were parsed into field/value pars using unicode regexp matching. Because
naming of values was non consistent, multiple values/matches had to be used or more complex regular
expressions. The value keys had different wordings or used different words to describe the same value.
Examples of section rows:

Rendered Document HTML

11
Slovak Public Procurement Announcements ETL
knowerce

Rendered Document HTML

Part V. contained list of contracts and required separate parsing.

No heavy data cleansing is performed. Only fixing numerical values and trimming text strings.

Issues
■ fields with currency amounts were in many forms:
■ one amount (expected) or two amounts (expected and final)
■ single amount or from-to range
■ with or without currency
■ with or without “VAT included” flag
■ with or without VAT rate
■ there were no field name prefixes (such as “name:”, “phone:) in all contacts, field order was used
in that case (not 100% reliable)
■ empty/bogus HTML nodes, sometimes preventing proper parsing

4.3. Load Source

Load source

YAML files contracts table


(staging)

Inputs: YAML structured files with parsed fields, one YAML per announcement
Outputs: populated staging database table with contracts
Configuration: none
Options: default mode (just load data), create mode (create DB structures)

Process
Simple mapping of structured files into DB table:

■ load structured file and for each contract do:


■ insert contract record into table

12
Slovak Public Procurement Announcements ETL
knowerce

Table contains mostly unprocessed raw text values and numerics only for currency amounts. Content
of the table mostly matches information from source documents.

4.4. Cleanse

Cleanse

staging clean data


contracts table fields with appropriate type and format
(staging)

REGIS (SK "unknown"


organisations) suppliers map

Inputs: populated staging database table with contracts


Outputs: cleaned staging data with consolidated suppliers
Configuration: none
Options: default mode (just load data), create mode (create DB structures)

Process
Goal of this job is to cleanse data taken from source and consolidate them. More specifically:

■ cleanse organisation number (ICO) format (without validity checking)


■ coalesce values of short enumerations
■ consolidate date formats
■ add procurer additions into procurers table
■ consolidate suppliers and add additions into suppliers table

Suppliers Consolidation
Requirements:

■ table with suppliers that might contain more information than present in REGIS database
■ possibility to automatically correct errors in source documents, such as invalid IDs
■ collect all unknown IDs for further correction in separate table

Presence and validity of organisation identification number (ICO) in the source does not match quality
requirements. There are cases when ICO does not match with any organisation in the organisation
database. For those cases a mapping table is created where one can specify mapping of invalid
company identifications to valid ones. There are two ways of corrective mapping:

■ map directly organisation within specific announcement:


[announcement №, organisation ID] → [correct organisation ID]

13
Slovak Public Procurement Announcements ETL
knowerce

■ map unknown organisations:


[country, organisation ID, organisation name] → [correct organisation ID]

The process is depicted in the following image:

sta_vvo_vysledky
sta_regis

- -

map_suppliers
1
unknown suppliers

? Slovensko

tmp_coalesced_suppliers_sk

-
sta_suppliers

+
3

new suppliers

1. Try to find unknown suppliers


2. Coalesce supplier name: use org.id from suppliers table if found, otherwise use from suppliers
table by mapping.
3. Append newly found suppliers

Reason for having separate suppliers table is, that it might be extended with more necessary
information than provided by the organisations database REGIS.

14
Slovak Public Procurement Announcements ETL
knowerce

4.5. Create Cube

fact table
Create cube

staging clean data dimension tables

analytical model
description

Inputs: cleaned staging data


Outputs: fact table, dimension tables, analytical model description
Configuration: none
Options: default mode (just load data), create mode (create DB structures)
This step creates and loads all structures for analytical processing:

■ fact table – fact is contract


■ dimensions:
■ supplier
■ procurer
■ process type
■ contract type
■ evaluation type
■ account sector
■ supplier geography

Process
1. create dimension for suppliers
2. create dimension for procurers
3. create fact table (see below)
4. fix unknown dimension values - if there are values in the source data that are not found in the
dimensions, mark them as “unknown” and add them into dimension tables as new value additions
5. create table with issues (for quality monitoring) and identify issues, such as empty or unknown
fields

Create Fact Table


Fact table is created simply by transforming cleansed data and joining with prepared dimension tables.

15
Slovak Public Procurement Announcements ETL
knowerce

4.6. Create search index

Create
search index

dimension tables search index

dimension index

Inputs: dimension tables


Outputs: Sphinx search index
Configuration: none
Options: none
This step creates index of dimension values at searchable levels and indexes them with Sphinx full-text
search indexer. Index is created using Slovak character mapping, to be able have search queries in plain
ASCII (without carrons and accents).
The analytical model is multidimensional cube in star schema1 with hierarchical dimensions that have
multiple levels. It would be not sufficient to create full-text search index for each table, as we need to
know at what level the searched field was found. For this purpose a dimension index table is created.
The dimension index contains fields:

■ dimension
■ dimension key (reference to dimension row - whole dimension point)
■ level (for example: ‘county’, ‘region’ or ‘country’ in geography)
■ level key – value of level key attribute (for example: county code)
■ indexed field name
■ indexed field value
Sphinx indexes the dimension index table.
Use example for search query: ‘Bystri*’. There are more cities called “Bystrica”, such as “Banska
Bystrica”, however there is also a region called “Banskobystricky” that will match the same query and
we want to get both results – higher level (region) and detailed level (city).

4.7. Regis Download


Inputs: documents at website of Statistics Office of Slovak Republic
Outputs: table with list of organisations in Slovakia
Configuration: source URL, document ID range, number of concurrent processing threads
Options: incremental download (default), full reload

1Fact table joined with dimension tables with no deeper references. All tables are joined to the fact table
directly, there are no joins: FT - T1 - T2.

16
Slovak Public Procurement Announcements ETL
knowerce

Process
Documents are being downloaded sequentially by document ID from source URL. The downloading is
being done in batches of 50k documents (configurable) and in 20 parallel threads (configurable).
In-spite of the documents being labeled as HTML, they contain no valid HTML code and can be
considered as text documents with HTML tags. The downloaded documents are stripped of HTML
tags and then are parsed with regular expressions as plain-text documents.
The process of downloading and processing all documents takes 2 hours in average, therefore it is
advised to run the process on a weekly basis.

4.8. Geography Loading


Inputs: list of municipalities and counties from Slovak Post Office
Outputs: single de-normalised table with hierarchical geography information about Slovakia
Configuration: none
Options: none

Process
Records are simply being mapped with mapping tables containing ISO 3166-2:SK division codes and
region names into a single de-normalised table.

4.9. CPV Loading


Inputs: Multilingual wide CPV code table
Outputs: single de-normalised table with hierarchical CPV structure
Configuration: none
Options: none

Process
Common Procurement Vocabulary (CPV) code table provided by EU institutions is in linear structure
with tree-structure properties. This table is being transformed into de-normalised table with tree
hierarchy levels in multiple columns.

17
Slovak Public Procurement Announcements ETL
knowerce

5. Data
Overview:

Source Mirror Staging Data Datamart

source documents

There are three data stores:

■ source mirror on a file system


■ staging data database schema
■ datamart database shcema
More detailed view:

Source Mirror

download parse
YAML files

source documents HTML files

Staging Data
load source

source contract
data

lists staging data


cleanse

staging contract
data
mappings temporary tables

Datamart

create cube

fact table dimensions contracts cube

logical model
(metadata)

18
Slovak Public Procurement Announcements ETL
knowerce

5.1. Source Mirror


The source mirror contains downloaded original documents and parsed structured version of the
documents in YAML format. If the source becomes unavailable and it is desired to parse the files again
(more attributes gathered, different parsing method, bug fix), it can be done on locally stored files.
Documents are not parsed directly into database. Reasons:

■ required YAML text file storage


■ structured documents can be processed with other tools without any database server connection

5.2. Staging Data


Structured files are loaded into database into staging data datastore (preferably separate schema). The
files are loaded without any or very minor transformations. The table should be 1:1 copy of the
structured files.
The staging data store contains:

■ lists/enumerations, for example ISO country region subdivision


■ copies of various sources or preprocessed datasets, such as geography from SK post office,
registry of organisations (REGIS)
■ staging data for procurers and suppliers – might contain more information than provided by
registry of organisations (REGIS)
■ maps – for mapping source values to desired values, coalescing and unifying
■ map of unknown organisations – map unknown org. names and org. codes into existing
organisations
■ map of region names – different region naming in REGIS than in official post office region
registry
■ map of reference codes – map of fulltext values, such as names of procurement types into
short codes (identifiers) that will be used as keys. Also unifies similar names into same code.
■ temporary tables – tables being used during transformation process that are created only for the
purpose of the single transformation run (for example: coalesced suppliers according to REGIS,
mapped unknown organisations and existing registered organisations)

Some tables are being appended with new data during the transformation process. New data are
being added into:

■ map of unknown organisations – for further fixing


■ new known organisations – for further update with additional information

5.3. Datamart Datastore


The Datamart Datastore, separate database schema, contains final data ready for analysis and
reporting. Structures in the schema are:

■ logical model metadata – description of the OLAP cube for contracts (Brewery framework
objects)
■ dimension tables – tables with dimension values (hierarchical)
■ fact table – cleansed table with procurement contracts, joinable with dimensions

19
Slovak Public Procurement Announcements ETL
knowerce

The dimension tables with fact table in this schema form snowflake schema.2
Brewery OLAP is using the structures in the datamart datastore to denormalize the snowflake schema
into wide fact table suitable for analysis, aggregation and reporting. That means, that the end-user –
the analyst – does have to know about physical structures behind the procurement contracts. He has
only one logical fact table where one row is one fact, that is one contract. The logical metadata enables
the analyst to perform analysis on multidimensional hierarchical structure.

2 http://en.wikipedia.org/wiki/Snowflake_schema

20
Slovak Public Procurement Announcements ETL
knowerce

6. Search Index
One of the requirements for the public procurements portal was to be able to search through the
data by many different fields. The nature of final data is:

■ many fields, described by metadata – we should not rely on fixed data structure,
■ hierarchical structure – we need to know at what level the value that we are searching for can be
found

Example of a search query: “chemical”. The word “chemical” might be contained in subject type,
however at different levels: division, category, subcategory… We have to know exact level where the
work appeared. If the word “chemical” is found at division level, we want report at division level, if the
word is found at category level, we want to aggregate at the category level, etc.

The sphinx searching engine can create one index for a table for known set of fields. While searching,
we do not know in which field the value was found, only document number (row). To make search in
multiple fields and through hierarchies possible we had to pre-index data with enough metadata. The
final table that is being indexed contains:

■ string value of indexed searchable field


■ dimension of the field (cpv, organisation, region, …)
■ dimension level of the field (division/category/subcategory, region/county...)
■ level key of the indexed field
■ some index “document” id that will be returned by sphinx

21
Slovak Public Procurement Announcements ETL
knowerce

7. Installation
7.1. Software Requirements
• PostgreSQL database server
• ruby 1.9 (does not work with version 1.8)
• gems: sequel, data-mapper, nokogiri
• Sphinx
• Brewery from http://github.com/Stiivi/brewery/

7.2. Preparation

I. create a directory where working files, such as dumps and ETL files, will be stored, for example:
/var/lib/vvo-files
II. initialize and configure Brewery (see Brewery installation instructions)
III. create two database schemas: vvo_staging for staging tables and vvo_data for analytical data

7.3. ETL Database initialisation


To initialize ETL database schema run the Brewery ETL tool:

etl initialize

This will create all necessary system tables. If you try to initialise a schema which already contains ETL
system tables you will get an error message. This prevents you to overwrite existing data. To recreate
the schema and start with empty tables execute initialize command with --force flag:

etl --force initialize

22
Slovak Public Procurement Announcements ETL
knowerce

8. Running ETL Jobs


8.1. Launching

Manual Launching
Jobs are being run with simply launching the etl tool:

etl run job_name

To manually run all daily jobs, you might use following script:

#!/bin/bash
#

DEBUG='--debug'

etl $DEBUG run vvo_download


etl $DEBUG run vvo_parse
etl $DEBUG run vvo_load_source
etl $DEBUG run vvo_cleanse
etl $DEBUG run vvo_create_cube
etl $DEBUG run vvo_search_index

If a job fails, you have to run only the jobs after the failed job.
To do full download, instead of incremental, do:

etl run vvo_download all

23

Vous aimerez peut-être aussi