Slovak Public Procurement Announcements - Extraction, Transformation and Loading

knowerce
Slovak Public Procurement

Announcmenets
Extraction, transformation and Loading Process
July 2010
info@knowerce.sk
www.knowerce.sk
Slovak Public Procurement Announcements ETL
knowerce
Document information
Creator Knowerce, s.r.o.
Vavilovova 16
851 01 Bratislava
info@knowerce.sk
www.knowerce.sk
Author Štefan Urbánek, stefan@knowerce.sk
Date of creation 20.7.2010
Document revision 2
1.Document Restrictions
Copyright (C) 2010 Knowerce, s.r.o., Stefan Urbanek
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU
Free Documentation License, Version 1.3 or any later version published by the Free Software
Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license is included in the section entitled "GNU Free Documentation License".
2
knowerce
Contents
Introduction ....................................................................................................................................................................................4
Overview .........................................................................................................................................................................................5
The Process 5
Jobs.......................................................................................................................................................................................................6
Download 6
Parse 7
Load Source 12
Cleanse 13
Create Cube 15
Create search index 16
Regis Download 16
Geography Loading 17
CPV Loading 17
Data ..................................................................................................................................................................................................18
Source Mirror 19
Staging Data 19
Datamart Datastore 19
Search Index ................................................................................................................................................................................21
Installation .....................................................................................................................................................................................22
Software Requirements 22
Preparation 22
ETL Database initialisation 22
Running ETL Jobs.......................................................................................................................................................................23
Launching 23
3
knowerce
2. Introduction
This document describes extraction, transformation and loading process of public procurement
documents in Slovakia. Objective of the VVO project was to transform unstructured public
procurement announcement documents into structured form.
unstructured
raw open data
HTML
Source code: http://github.com/Stiivi/vvo-etl

Data source URL: http://www.e-vestnik.sk/
Application using the data: http://vestnik.transparency.sk
4
knowerce
3. Overview
3.1. The Process
Public procurement announcement documents are being processed in a chain of ETL jobs. The jobs
are:
Source Extraction Transformation Analytical Transformation
Create
Download Parse Load source Cleanse Create cube
search index
Reasons for creating several jobs instead of single monolithic processing script are mainly: better
maintainability, ability to re-run failed part of the chain, ability to plug-in other sources into the chain in
the future.
If a part of the chain fails, it is not necessary to run whole chain again, just the part of the chain from
failed part. This lowers processing load and network load on source servers. For example, cleansing
fails, it is not necessary to download the files again.
In addition to the processing jobs, there are three required, however independent jobs:
Regis Geography
CPV Loading
Extraction Loading
Job Type Description
Download core Download HTML documents from the source
Parse core Parse HTML documents into structured form
Load source core Load structured form into database table
Cleanse core Cleanse data, fix values, map corrections
Create cube core Create analytical structure: fact table and dimensions
Create search index core Create search index for full-text searching with support for Slovak/ASCII searching
Regis Extraction support Extract list of all Slovak organisations
Geography loading support Load data from Slovak post-office about regional break-down
CPV loading support Load CPV (common procurement vocabulary) data
5
knowerce
4. Jobs
4.1. Download
Download
raw sources HTML files
Inputs: HTML documents stored on public procurements website

Outputs: HTML files stored locally
Configuration: public procurements web site root, path to bulletin index, document encoding
Options: incremental mode (default), full mode (download all announcements)
At site root one can find paginated list of bulletins:
http://www.e-vestnik.sk/#EVestnik/Vsetky_vydania
By following a bulletin link, there is list of announcement types:
http://www.e-vestnik.sk/#EVestnik/Vestnik?date=2010-08-07&from=Vsetky_vydania
6
knowerce
By clicking on a link with desired public procurement type (“procurement results”) list is expanded and
we get list of all announcements within the bulletin:
http://www.e-vestnik.sk/#EVestnik/Vestnik?cat=7&date=2010-08-07
Situation:
■ no data API provided by website
■ no single list of all public procurements, only paginated browsing of bulletins
■ no proper HTML id attributes, nor non-ambiguous class attributes
■ layout by table
Process
1. Download and parse document index at specified site root, get number of pages
2. Download and parse all “bulletin list page” pages, output is name and URL of each bulletin
3. Compare list of available bulletins with list of already downloaded bulletins and generate list of
bulletins to be downloaded (all if full download is requested)
4. Download all announcements found on each bulletin page and save into download directory
5. Store list of downloaded bulletins
4.2. Parse
Parse
HTML files YAML files
7
knowerce
Inputs: HTML documents with announcements, stored locally

Outputs: YAML structured files with parsed fields, one YAML per announcement
Configuration: none
Options: none
Situation:
■ very messy HTML structure
■ ambiguous class attributes, mis-use of class attributes
■ no usable id attributes
■ heavy table layout with nested tables, level 3 is common (table in table in table)
■ sometimes broken layout, causing many parsing exceptions
■ not reliably indexable values by referencing row number
■ non-consistent table layout – might and might not contain tbody
Document example:
http://www.e-vestnik.sk/EVestnik/Detail/16563
Example of layout with emphasised contrast CSS for better layout visibility:
8
knowerce
Example of broken layout, where the cyan values in left column were supposed to be in the right
column:
Example of an element nesting within a document with 24 levels of nesting:

html > body > #page > #container > #main > #innerMain > div >
> table > tbody > tr > td >
> table > tbody > tr > td >
> table > tbody > tr >td >
> table > tbody > tr >td > span.hodnota
Having situation like described above makes parsing of public procurement documents tricky.
Rough document structure (as seen by user/human):
■ document title
■ basic announcement information
■ parts of announcement
■ each part of announcement contain sections
■ each section contains list of information pieces (I would not call that key-value pairs, as they are
not)
9
knowerce
Process
The whole document was parsed as HTML document tree.
Strategies used:
■ Unicode regular exception matching

■ element references by element index (unstable, but sufficient for most cases) - instead of using
proper id/class attribute (which were missing) we used index of the element that we wanted to
parse
■ because structure was not consistent, sometimes searching for elements was necessary instead of
directly referencing by path, which made processing little bit slower
■
1. read basic announcement information: date, announcement number, type

2. find table with document parts and split HTML document subtrees for each part
3. parse each part
Part parsing:
The main body of the document is a table containing cells which contain optional part title and part
body in the form of a table. The part body table contains anonymous rows with section contents in
two columns. The left column is used mostly for padding and might contain section number. The right
column contains information to be extracted. How the part and sections look like is depicted in the
following picture:
<td>
<table> part body
<table> part body <tr> number section title
<tr> (empty) cell with content

<td>
<span> part title

<table> part body
<tr> number section title
<td>
<span> part title
<tr> number section title

<table> part body
... more parts
10
knowerce
It was not possible to reliably find sections in parts by referencing rows directly. Each part was broken
into list of table rows and rows were parsed sequentially as on a “tape”:
1. prepare section structure
2. get next row
3. if left column contains value, then it is beginning of next section
3.1. process previous section, if there is any
3.2. prepare new section structure
3.3. save next section name into section structure
4. if left column is empty then:
4.1. add right column to list of section rows in the section structure
5. repeat from 2 until all rows are processed
Section parsing:
After parsing parts, the section structure contains section title, section number and list of rows (cells
from left column of a part table). The rows are processed sequentially as well.
Each set of section rows were parsed into field/value pars using unicode regexp matching. Because
naming of values was non consistent, multiple values/matches had to be used or more complex regular
expressions. The value keys had different wordings or used different words to describe the same value.
Examples of section rows:
Rendered Document HTML
11
knowerce
Rendered Document HTML
Part V. contained list of contracts and required separate parsing.
No heavy data cleansing is performed. Only fixing numerical values and trimming text strings.
Issues
■ fields with currency amounts were in many forms:
■ one amount (expected) or two amounts (expected and final)
■ single amount or from-to range
■ with or without currency
■ with or without “VAT included” flag
■ with or without VAT rate
■ there were no field name prefixes (such as “name:”, “phone:) in all contacts, field order was used
in that case (not 100% reliable)
■ empty/bogus HTML nodes, sometimes preventing proper parsing
4.3. Load Source
Load source
YAML files contracts table

(staging)
Inputs: YAML structured files with parsed fields, one YAML per announcement
Outputs: populated staging database table with contracts
Configuration: none
Options: default mode (just load data), create mode (create DB structures)
Process
Simple mapping of structured files into DB table:
■ load structured file and for each contract do:

■ insert contract record into table
12
knowerce
Table contains mostly unprocessed raw text values and numerics only for currency amounts. Content
of the table mostly matches information from source documents.
4.4. Cleanse
Cleanse
staging clean data

contracts table fields with appropriate type and format
(staging)
REGIS (SK "unknown"

organisations) suppliers map
Inputs: populated staging database table with contracts

Outputs: cleaned staging data with consolidated suppliers
Configuration: none
Process
Goal of this job is to cleanse data taken from source and consolidate them. More specifically:
■ cleanse organisation number (ICO) format (without validity checking)

■ coalesce values of short enumerations
■ consolidate date formats
■ add procurer additions into procurers table
■ consolidate suppliers and add additions into suppliers table
Suppliers Consolidation
Requirements:
■ table with suppliers that might contain more information than present in REGIS database
■ possibility to automatically correct errors in source documents, such as invalid IDs
■ collect all unknown IDs for further correction in separate table
Presence and validity of organisation identification number (ICO) in the source does not match quality
requirements. There are cases when ICO does not match with any organisation in the organisation
database. For those cases a mapping table is created where one can specify mapping of invalid
company identifications to valid ones. There are two ways of corrective mapping:
■ map directly organisation within specific announcement:

[announcement №, organisation ID] → [correct organisation ID]
13
knowerce
■ map unknown organisations:

[country, organisation ID, organisation name] → [correct organisation ID]
The process is depicted in the following image:
sta_vvo_vysledky
sta_regis
- -
map_suppliers
1
unknown suppliers
? Slovensko
tmp_coalesced_suppliers_sk
-
sta_suppliers
+
3
new suppliers
1. Try to find unknown suppliers

2. Coalesce supplier name: use org.id from suppliers table if found, otherwise use from suppliers
table by mapping.
3. Append newly found suppliers
Reason for having separate suppliers table is, that it might be extended with more necessary
information than provided by the organisations database REGIS.
14
knowerce
4.5. Create Cube
fact table
Create cube
staging clean data dimension tables
analytical model
description
Inputs: cleaned staging data

Outputs: fact table, dimension tables, analytical model description
Configuration: none
This step creates and loads all structures for analytical processing:
■ fact table – fact is contract

■ dimensions:
■ supplier
■ procurer
■ process type
■ contract type
■ evaluation type
■ account sector
■ supplier geography
Process
1. create dimension for suppliers
2. create dimension for procurers
3. create fact table (see below)
4. fix unknown dimension values - if there are values in the source data that are not found in the
dimensions, mark them as “unknown” and add them into dimension tables as new value additions
5. create table with issues (for quality monitoring) and identify issues, such as empty or unknown
fields
Create Fact Table

Fact table is created simply by transforming cleansed data and joining with prepared dimension tables.
15
knowerce
4.6. Create search index
Create
search index
dimension tables search index
dimension index
Inputs: dimension tables

Outputs: Sphinx search index
Configuration: none
Options: none
This step creates index of dimension values at searchable levels and indexes them with Sphinx full-text
search indexer. Index is created using Slovak character mapping, to be able have search queries in plain
ASCII (without carrons and accents).
The analytical model is multidimensional cube in star schema1 with hierarchical dimensions that have
multiple levels. It would be not sufficient to create full-text search index for each table, as we need to
know at what level the searched field was found. For this purpose a dimension index table is created.
The dimension index contains fields:
■ dimension
■ dimension key (reference to dimension row - whole dimension point)
■ level (for example: ‘county’, ‘region’ or ‘country’ in geography)
■ level key – value of level key attribute (for example: county code)
■ indexed field name
■ indexed field value
Sphinx indexes the dimension index table.
Use example for search query: ‘Bystri*’. There are more cities called “Bystrica”, such as “Banska
Bystrica”, however there is also a region called “Banskobystricky” that will match the same query and
we want to get both results – higher level (region) and detailed level (city).
4.7. Regis Download

Inputs: documents at website of Statistics Office of Slovak Republic
Outputs: table with list of organisations in Slovakia
Configuration: source URL, document ID range, number of concurrent processing threads
Options: incremental download (default), full reload
1Fact table joined with dimension tables with no deeper references. All tables are joined to the fact table
directly, there are no joins: FT - T1 - T2.
16
knowerce
Process
Documents are being downloaded sequentially by document ID from source URL. The downloading is
being done in batches of 50k documents (configurable) and in 20 parallel threads (configurable).
In-spite of the documents being labeled as HTML, they contain no valid HTML code and can be
considered as text documents with HTML tags. The downloaded documents are stripped of HTML
tags and then are parsed with regular expressions as plain-text documents.
The process of downloading and processing all documents takes 2 hours in average, therefore it is
advised to run the process on a weekly basis.
4.8. Geography Loading

Inputs: list of municipalities and counties from Slovak Post Office
Outputs: single de-normalised table with hierarchical geography information about Slovakia
Configuration: none
Options: none
Process
Records are simply being mapped with mapping tables containing ISO 3166-2:SK division codes and
region names into a single de-normalised table.
4.9. CPV Loading

Inputs: Multilingual wide CPV code table
Outputs: single de-normalised table with hierarchical CPV structure
Configuration: none
Options: none
Process
Common Procurement Vocabulary (CPV) code table provided by EU institutions is in linear structure
with tree-structure properties. This table is being transformed into de-normalised table with tree
hierarchy levels in multiple columns.
17
knowerce
5. Data
Overview:
Source Mirror Staging Data Datamart
source documents
There are three data stores:
■ source mirror on a file system

■ staging data database schema
■ datamart database shcema
More detailed view:
Source Mirror
download parse
YAML files
source documents HTML files
Staging Data
load source
source contract
data
lists staging data

cleanse
staging contract
data
mappings temporary tables
Datamart
create cube
fact table dimensions contracts cube
logical model
(metadata)
18
knowerce
5.1. Source Mirror

The source mirror contains downloaded original documents and parsed structured version of the
documents in YAML format. If the source becomes unavailable and it is desired to parse the files again
(more attributes gathered, different parsing method, bug fix), it can be done on locally stored files.
Documents are not parsed directly into database. Reasons:
■ required YAML text file storage

■ structured documents can be processed with other tools without any database server connection
5.2. Staging Data

Structured files are loaded into database into staging data datastore (preferably separate schema). The
files are loaded without any or very minor transformations. The table should be 1:1 copy of the
structured files.
The staging data store contains:
■ lists/enumerations, for example ISO country region subdivision

■ copies of various sources or preprocessed datasets, such as geography from SK post office,
registry of organisations (REGIS)
■ staging data for procurers and suppliers – might contain more information than provided by
registry of organisations (REGIS)
■ maps – for mapping source values to desired values, coalescing and unifying
■ map of unknown organisations – map unknown org. names and org. codes into existing
organisations
■ map of region names – different region naming in REGIS than in official post office region
registry
■ map of reference codes – map of fulltext values, such as names of procurement types into
short codes (identifiers) that will be used as keys. Also unifies similar names into same code.
■ temporary tables – tables being used during transformation process that are created only for the
purpose of the single transformation run (for example: coalesced suppliers according to REGIS,
mapped unknown organisations and existing registered organisations)
Some tables are being appended with new data during the transformation process. New data are
being added into:
■ map of unknown organisations – for further fixing

■ new known organisations – for further update with additional information
5.3. Datamart Datastore

The Datamart Datastore, separate database schema, contains final data ready for analysis and
reporting. Structures in the schema are:
■ logical model metadata – description of the OLAP cube for contracts (Brewery framework
objects)
■ dimension tables – tables with dimension values (hierarchical)
■ fact table – cleansed table with procurement contracts, joinable with dimensions
19
knowerce
The dimension tables with fact table in this schema form snowflake schema.2
Brewery OLAP is using the structures in the datamart datastore to denormalize the snowflake schema
into wide fact table suitable for analysis, aggregation and reporting. That means, that the end-user –
the analyst – does have to know about physical structures behind the procurement contracts. He has
only one logical fact table where one row is one fact, that is one contract. The logical metadata enables
the analyst to perform analysis on multidimensional hierarchical structure.
2 http://en.wikipedia.org/wiki/Snowflake_schema
20
knowerce
6. Search Index
One of the requirements for the public procurements portal was to be able to search through the
data by many different fields. The nature of final data is:
■ many fields, described by metadata – we should not rely on fixed data structure,
■ hierarchical structure – we need to know at what level the value that we are searching for can be
found
Example of a search query: “chemical”. The word “chemical” might be contained in subject type,
however at different levels: division, category, subcategory… We have to know exact level where the
work appeared. If the word “chemical” is found at division level, we want report at division level, if the
word is found at category level, we want to aggregate at the category level, etc.
The sphinx searching engine can create one index for a table for known set of fields. While searching,
we do not know in which field the value was found, only document number (row). To make search in
multiple fields and through hierarchies possible we had to pre-index data with enough metadata. The
final table that is being indexed contains:
■ string value of indexed searchable field

■ dimension of the field (cpv, organisation, region, …)
■ dimension level of the field (division/category/subcategory, region/county...)
■ level key of the indexed field
■ some index “document” id that will be returned by sphinx
21
knowerce
7. Installation
7.1. Software Requirements
• PostgreSQL database server
• ruby 1.9 (does not work with version 1.8)
• gems: sequel, data-mapper, nokogiri
• Sphinx
• Brewery from http://github.com/Stiivi/brewery/
7.2. Preparation
I. create a directory where working files, such as dumps and ETL files, will be stored, for example:
/var/lib/vvo-files
II. initialize and configure Brewery (see Brewery installation instructions)
III. create two database schemas: vvo_staging for staging tables and vvo_data for analytical data
7.3. ETL Database initialisation

To initialize ETL database schema run the Brewery ETL tool:
etl initialize
This will create all necessary system tables. If you try to initialise a schema which already contains ETL
system tables you will get an error message. This prevents you to overwrite existing data. To recreate
the schema and start with empty tables execute initialize command with --force flag:
etl --force initialize
22
knowerce
8. Running ETL Jobs

8.1. Launching
Manual Launching
Jobs are being run with simply launching the etl tool:
etl run job_name
To manually run all daily jobs, you might use following script:
#!/bin/bash
#
DEBUG='--debug'
etl $DEBUG run vvo_download

etl $DEBUG run vvo_parse
etl $DEBUG run vvo_load_source
etl $DEBUG run vvo_cleanse
etl $DEBUG run vvo_create_cube
etl $DEBUG run vvo_search_index
If a job fails, you have to run only the jobs after the failed job.
To do full download, instead of incremental, do:
etl run vvo_download all
23

Slovak Public Procurement Announcements - Extraction, Transformation and Loading

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Slovak Public Procurement Announcements - Extraction, Transformation and Loading

Transféré par

Droits d'auteur :

Formats disponibles

knowerce

Slovak Public Procurement

Author Štefan Urbánek, stefan@knowerce.sk

Date of creation 20.7.2010

Source code: http://github.com/Stiivi/vvo-etl

Source Extraction Transformation Analytical Transformation

Job Type Description

Download core Download HTML documents from the source

Parse core Parse HTML documents into structured form

Load source core Load structured form into database table

Cleanse core Cleanse data, fix values, map corrections

Regis Extraction support Extract list of all Slovak organisations

CPV loading support Load CPV (common procurement vocabulary) data

raw sources HTML files

Inputs: HTML documents stored on public procurements website

At site root one can find paginated list of bulletins:

By following a bulletin link, there is list of announcement types:

HTML files YAML files

Inputs: HTML documents with announcements, stored locally

Example of an element nesting within a document with 24 levels of nesting:

■ Unicode regular exception matching

1. read basic announcement information: date, announcement number, type

<table> part body <tr> number section title

<tr> (empty) cell with content

<tr> (empty) cell with content

<tr> number section title

<tr> number section title

... more parts

Rendered Document HTML

Rendered Document HTML

Part V. contained list of contracts and required separate parsing.

4.3. Load Source

YAML files contracts table

■ load structured file and for each contract do:

staging clean data

REGIS (SK "unknown"

Inputs: populated staging database table with contracts

■ cleanse organisation number (ICO) format (without validity checking)

■ map directly organisation within specific announcement:

■ map unknown organisations:

The process is depicted in the following image:

1. Try to find unknown suppliers

4.5. Create Cube

staging clean data dimension tables

Inputs: cleaned staging data

■ fact table – fact is contract

Create Fact Table

4.6. Create search index

dimension tables search index

Inputs: dimension tables

4.7. Regis Download

4.8. Geography Loading

4.9. CPV Loading

Source Mirror Staging Data Datamart

There are three data stores:

■ source mirror on a file system

source documents HTML files

lists staging data

fact table dimensions contracts cube

5.1. Source Mirror

■ required YAML text file storage

5.2. Staging Data

■ lists/enumerations, for example ISO country region subdivision

■ map of unknown organisations – for further fixing