Académique Documents
Professionnel Documents
Culture Documents
info@knowerce.sk
www.knowerce.sk
Slovak Public Procurement Announcements ETL
knowerce
Document information
Creator Knowerce, s.r.o.
Vavilovova 16
851 01 Bratislava
info@knowerce.sk
www.knowerce.sk
Document revision 2
1.Document Restrictions
Copyright (C) 2010 Knowerce, s.r.o., Stefan Urbanek
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU
Free Documentation License, Version 1.3 or any later version published by the Free Software
Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license is included in the section entitled "GNU Free Documentation License".
2
Slovak Public Procurement Announcements ETL
knowerce
Contents
Introduction ....................................................................................................................................................................................4
Overview .........................................................................................................................................................................................5
The Process 5
Jobs.......................................................................................................................................................................................................6
Download 6
Parse 7
Load Source 12
Cleanse 13
Create Cube 15
Create search index 16
Regis Download 16
Geography Loading 17
CPV Loading 17
Data ..................................................................................................................................................................................................18
Source Mirror 19
Staging Data 19
Datamart Datastore 19
Search Index ................................................................................................................................................................................21
Installation .....................................................................................................................................................................................22
Software Requirements 22
Preparation 22
ETL Database initialisation 22
Running ETL Jobs.......................................................................................................................................................................23
Launching 23
3
Slovak Public Procurement Announcements ETL
knowerce
2. Introduction
This document describes extraction, transformation and loading process of public procurement
documents in Slovakia. Objective of the VVO project was to transform unstructured public
procurement announcement documents into structured form.
unstructured
raw open data
HTML
4
Slovak Public Procurement Announcements ETL
knowerce
3. Overview
3.1. The Process
Public procurement announcement documents are being processed in a chain of ETL jobs. The jobs
are:
Create
Download Parse Load source Cleanse Create cube
search index
Reasons for creating several jobs instead of single monolithic processing script are mainly: better
maintainability, ability to re-run failed part of the chain, ability to plug-in other sources into the chain in
the future.
If a part of the chain fails, it is not necessary to run whole chain again, just the part of the chain from
failed part. This lowers processing load and network load on source servers. For example, cleansing
fails, it is not necessary to download the files again.
In addition to the processing jobs, there are three required, however independent jobs:
Regis Geography
CPV Loading
Extraction Loading
Create cube core Create analytical structure: fact table and dimensions
Create search index core Create search index for full-text searching with support for Slovak/ASCII searching
Geography loading support Load data from Slovak post-office about regional break-down
5
Slovak Public Procurement Announcements ETL
knowerce
4. Jobs
4.1. Download
Download
http://www.e-vestnik.sk/#EVestnik/Vsetky_vydania
http://www.e-vestnik.sk/#EVestnik/Vestnik?date=2010-08-07&from=Vsetky_vydania
6
Slovak Public Procurement Announcements ETL
knowerce
By clicking on a link with desired public procurement type (“procurement results”) list is expanded and
we get list of all announcements within the bulletin:
http://www.e-vestnik.sk/#EVestnik/Vestnik?cat=7&date=2010-08-07
Situation:
■ no data API provided by website
■ no single list of all public procurements, only paginated browsing of bulletins
■ no proper HTML id attributes, nor non-ambiguous class attributes
■ layout by table
Process
1. Download and parse document index at specified site root, get number of pages
2. Download and parse all “bulletin list page” pages, output is name and URL of each bulletin
3. Compare list of available bulletins with list of already downloaded bulletins and generate list of
bulletins to be downloaded (all if full download is requested)
4. Download all announcements found on each bulletin page and save into download directory
5. Store list of downloaded bulletins
4.2. Parse
Parse
7
Slovak Public Procurement Announcements ETL
knowerce
http://www.e-vestnik.sk/EVestnik/Detail/16563
Example of layout with emphasised contrast CSS for better layout visibility:
8
Slovak Public Procurement Announcements ETL
knowerce
Example of broken layout, where the cyan values in left column were supposed to be in the right
column:
Having situation like described above makes parsing of public procurement documents tricky.
Rough document structure (as seen by user/human):
■ document title
■ basic announcement information
■ parts of announcement
■ each part of announcement contain sections
■ each section contains list of information pieces (I would not call that key-value pairs, as they are
not)
9
Slovak Public Procurement Announcements ETL
knowerce
Process
The whole document was parsed as HTML document tree.
Strategies used:
Part parsing:
The main body of the document is a table containing cells which contain optional part title and part
body in the form of a table. The part body table contains anonymous rows with section contents in
two columns. The left column is used mostly for padding and might contain section number. The right
column contains information to be extracted. How the part and sections look like is depicted in the
following picture:
<td>
<table> part body
<td>
<tr> (empty) cell with content
<span> part title
10
Slovak Public Procurement Announcements ETL
knowerce
It was not possible to reliably find sections in parts by referencing rows directly. Each part was broken
into list of table rows and rows were parsed sequentially as on a “tape”:
1. prepare section structure
2. get next row
3. if left column contains value, then it is beginning of next section
3.1. process previous section, if there is any
3.2. prepare new section structure
3.3. save next section name into section structure
4. if left column is empty then:
4.1. add right column to list of section rows in the section structure
5. repeat from 2 until all rows are processed
Section parsing:
After parsing parts, the section structure contains section title, section number and list of rows (cells
from left column of a part table). The rows are processed sequentially as well.
Each set of section rows were parsed into field/value pars using unicode regexp matching. Because
naming of values was non consistent, multiple values/matches had to be used or more complex regular
expressions. The value keys had different wordings or used different words to describe the same value.
Examples of section rows:
11
Slovak Public Procurement Announcements ETL
knowerce
No heavy data cleansing is performed. Only fixing numerical values and trimming text strings.
Issues
■ fields with currency amounts were in many forms:
■ one amount (expected) or two amounts (expected and final)
■ single amount or from-to range
■ with or without currency
■ with or without “VAT included” flag
■ with or without VAT rate
■ there were no field name prefixes (such as “name:”, “phone:) in all contacts, field order was used
in that case (not 100% reliable)
■ empty/bogus HTML nodes, sometimes preventing proper parsing
Load source
Inputs: YAML structured files with parsed fields, one YAML per announcement
Outputs: populated staging database table with contracts
Configuration: none
Options: default mode (just load data), create mode (create DB structures)
Process
Simple mapping of structured files into DB table:
12
Slovak Public Procurement Announcements ETL
knowerce
Table contains mostly unprocessed raw text values and numerics only for currency amounts. Content
of the table mostly matches information from source documents.
4.4. Cleanse
Cleanse
Process
Goal of this job is to cleanse data taken from source and consolidate them. More specifically:
Suppliers Consolidation
Requirements:
■ table with suppliers that might contain more information than present in REGIS database
■ possibility to automatically correct errors in source documents, such as invalid IDs
■ collect all unknown IDs for further correction in separate table
Presence and validity of organisation identification number (ICO) in the source does not match quality
requirements. There are cases when ICO does not match with any organisation in the organisation
database. For those cases a mapping table is created where one can specify mapping of invalid
company identifications to valid ones. There are two ways of corrective mapping:
13
Slovak Public Procurement Announcements ETL
knowerce
sta_vvo_vysledky
sta_regis
- -
map_suppliers
1
unknown suppliers
? Slovensko
tmp_coalesced_suppliers_sk
-
sta_suppliers
+
3
new suppliers
Reason for having separate suppliers table is, that it might be extended with more necessary
information than provided by the organisations database REGIS.
14
Slovak Public Procurement Announcements ETL
knowerce
fact table
Create cube
analytical model
description
Process
1. create dimension for suppliers
2. create dimension for procurers
3. create fact table (see below)
4. fix unknown dimension values - if there are values in the source data that are not found in the
dimensions, mark them as “unknown” and add them into dimension tables as new value additions
5. create table with issues (for quality monitoring) and identify issues, such as empty or unknown
fields
15
Slovak Public Procurement Announcements ETL
knowerce
Create
search index
dimension index
■ dimension
■ dimension key (reference to dimension row - whole dimension point)
■ level (for example: ‘county’, ‘region’ or ‘country’ in geography)
■ level key – value of level key attribute (for example: county code)
■ indexed field name
■ indexed field value
Sphinx indexes the dimension index table.
Use example for search query: ‘Bystri*’. There are more cities called “Bystrica”, such as “Banska
Bystrica”, however there is also a region called “Banskobystricky” that will match the same query and
we want to get both results – higher level (region) and detailed level (city).
1Fact table joined with dimension tables with no deeper references. All tables are joined to the fact table
directly, there are no joins: FT - T1 - T2.
16
Slovak Public Procurement Announcements ETL
knowerce
Process
Documents are being downloaded sequentially by document ID from source URL. The downloading is
being done in batches of 50k documents (configurable) and in 20 parallel threads (configurable).
In-spite of the documents being labeled as HTML, they contain no valid HTML code and can be
considered as text documents with HTML tags. The downloaded documents are stripped of HTML
tags and then are parsed with regular expressions as plain-text documents.
The process of downloading and processing all documents takes 2 hours in average, therefore it is
advised to run the process on a weekly basis.
Process
Records are simply being mapped with mapping tables containing ISO 3166-2:SK division codes and
region names into a single de-normalised table.
Process
Common Procurement Vocabulary (CPV) code table provided by EU institutions is in linear structure
with tree-structure properties. This table is being transformed into de-normalised table with tree
hierarchy levels in multiple columns.
17
Slovak Public Procurement Announcements ETL
knowerce
5. Data
Overview:
source documents
Source Mirror
download parse
YAML files
Staging Data
load source
source contract
data
staging contract
data
mappings temporary tables
Datamart
create cube
logical model
(metadata)
18
Slovak Public Procurement Announcements ETL
knowerce
Some tables are being appended with new data during the transformation process. New data are
being added into:
■ logical model metadata – description of the OLAP cube for contracts (Brewery framework
objects)
■ dimension tables – tables with dimension values (hierarchical)
■ fact table – cleansed table with procurement contracts, joinable with dimensions
19
Slovak Public Procurement Announcements ETL
knowerce
The dimension tables with fact table in this schema form snowflake schema.2
Brewery OLAP is using the structures in the datamart datastore to denormalize the snowflake schema
into wide fact table suitable for analysis, aggregation and reporting. That means, that the end-user –
the analyst – does have to know about physical structures behind the procurement contracts. He has
only one logical fact table where one row is one fact, that is one contract. The logical metadata enables
the analyst to perform analysis on multidimensional hierarchical structure.
2 http://en.wikipedia.org/wiki/Snowflake_schema
20
Slovak Public Procurement Announcements ETL
knowerce
6. Search Index
One of the requirements for the public procurements portal was to be able to search through the
data by many different fields. The nature of final data is:
■ many fields, described by metadata – we should not rely on fixed data structure,
■ hierarchical structure – we need to know at what level the value that we are searching for can be
found
Example of a search query: “chemical”. The word “chemical” might be contained in subject type,
however at different levels: division, category, subcategory… We have to know exact level where the
work appeared. If the word “chemical” is found at division level, we want report at division level, if the
word is found at category level, we want to aggregate at the category level, etc.
The sphinx searching engine can create one index for a table for known set of fields. While searching,
we do not know in which field the value was found, only document number (row). To make search in
multiple fields and through hierarchies possible we had to pre-index data with enough metadata. The
final table that is being indexed contains:
21
Slovak Public Procurement Announcements ETL
knowerce
7. Installation
7.1. Software Requirements
• PostgreSQL database server
• ruby 1.9 (does not work with version 1.8)
• gems: sequel, data-mapper, nokogiri
• Sphinx
• Brewery from http://github.com/Stiivi/brewery/
7.2. Preparation
I. create a directory where working files, such as dumps and ETL files, will be stored, for example:
/var/lib/vvo-files
II. initialize and configure Brewery (see Brewery installation instructions)
III. create two database schemas: vvo_staging for staging tables and vvo_data for analytical data
etl initialize
This will create all necessary system tables. If you try to initialise a schema which already contains ETL
system tables you will get an error message. This prevents you to overwrite existing data. To recreate
the schema and start with empty tables execute initialize command with --force flag:
22
Slovak Public Procurement Announcements ETL
knowerce
Manual Launching
Jobs are being run with simply launching the etl tool:
To manually run all daily jobs, you might use following script:
#!/bin/bash
#
DEBUG='--debug'
If a job fails, you have to run only the jobs after the failed job.
To do full download, instead of incremental, do:
23