Vous êtes sur la page 1sur 29

Data Cleaning

Tools and Methodologies

Arthur D. Chapman
Australia / Brazil

Centro de Referência em Informação Ambiental


TDWG- Lisbon Oct 2003
Background

• ERIN/CRIA
• speciesLink
• FAPESP/Biota

TDWG- Lisbon Oct 2003


Species Data

• Museum/Herbarium
• Observation
• Survey

TDWG- Lisbon Oct 2003


Data Error

• Names
• Geocode
• Altitude
• Collectors
• Dates

TDWG- Lisbon Oct 2003


Data quality - fitness for use

TDWG- Lisbon Oct 2003


Methods for geocode validation

• Internal Database Checks


• Outliers in Geographic Space - GIS
• Outliers in Environmental Space - Models
• Statistical outliers

TDWG- Lisbon Oct 2003


Internal Database Checks

• Internal inconsistencies
• Checking one field against another
– Text location vs geocode
• Checking one database against another
– Gazetteers
– DEM
– Collectors

TDWG- Lisbon Oct 2003


Geographic outliers - GIS
• Country, State, named district, etc.

TDWG- Lisbon Oct 2003


Geographic outliers - GIS

TDWG- Lisbon Oct 2003


Geographic Outliers - GIS
• Collectors – location vs date

TDWG- Lisbon Oct 2003


Environmental Outliers
• Cumulative Frequency Curves

TDWG- Lisbon Oct 2003


Acacia orites - 19 records -
9 Temperature parameters
35

30

25
Temperature (C)

20

15

10

0
t t t t t t t t
a m m s c w w d
n n x p l m e r
n c
m
w
m
a
n
q q t
q
y
q
Reverse Jack-knife

TDWG- Lisbon Oct 2003


Outliers in climate space
(T=0.95(√n)+0.2)
where ‘n’ is the number of records

TDWG- Lisbon Oct 2003


FloraMap

• CIAT (Columbia)
• PCA
• Cluster Analysis
• $US100
• Modelling
• 10-minute grids

TDWG- Lisbon Oct 2003


Principal Components Analysis - FloraMap

Image from
FloraMap (Jones and
Gladkov 2001)
showing use of
Principal
Components Analysis
to identify an outlier
in Rauvolfia littoralis
specimen data.

A. Principal
Components Analysis
B. Specimen record.
C. Mapped specimen.
D. Climate profile

TDWG- Lisbon Oct 2003


Cluster Analysis - FloraMap

Image from FloraMap (Jones


and Gladkov 2001) showing
use of Cluster Analysis to
identify an outlier in Rauvolfia
littoralis specimen data.

A.Cluster Analysis
B. Principal Components
Analysis.
C. Mapped specimen.
D. Climate profile.
E. Specimen record

TDWG- Lisbon Oct 2003


Diva-GIS
• Free
• Simple GIS
• Modelling (BIOCLIM/Domain)
• Data Cleaning Tools

TDWG- Lisbon Oct 2003


Diva-GIS – Coordinate Check

Using Diva-GIS to check


coordinates by
comparing a file of point
specimen records (red)
against a polygon of
Bolivian provinces. Input
dialogue box is shown at
A, where it can be seen
that “STATE” in the
point file has been set to
the equivalent
“DEPARTMENT” in the
polygon file (Hijmans et
al. 2003).

TDWG- Lisbon Oct 2003


Points outside Polygon – Diva GIS

Results from Diva-GIS (Hijmans


et al. 2003) showing point records
that fall outside all polygons in
the Bolivian provinces polygon
file. The highlighted record shows
the linking between the results
dialogue box and the mapped
record

TDWG- Lisbon Oct 2003


Mismatched Provinces – Diva GIS

Results from Diva-GIS (Hijmans et al.


2003) showing point records that do
not match set relationships between the
specimen point file and the polygon of
Bolivian provinces. The highlighted
record where the geocoding on the
specimen record causes it to fall in the
wrong province

TDWG- Lisbon Oct 2003


Assign Coordinates – Diva GIS

Results from Diva-GIS (Hijmans et al.


2003) showing point records with
geocodes automatically assigned. A.
Unambiguous geocodes found by the
program and assigned. B. Ambiguous
geocodes identified. C. Appropriate
geocodes not found.

TDWG- Lisbon Oct 2003


Multiple possibilities – Diva GIS

Results from Diva-GIS (Hijmans et al. 2003)


showing alternate geocodes for a record where use
of the Gazetteer has produced a number of credible
alternatives.

TDWG- Lisbon Oct 2003


Cumulative Frequency Curves - DivaGiS

Results from Diva-GIS (Hijmans et al. 2003) showing the use of the Cumulative Frequency curve from BIOCLIM to
identify possible geocoding errors in Rauvolfia littoralis. A1 and A2 show possible outliers in climate space, B1 and B2
the corresponding mapped records. The Blue lines represent the 97.5 percentile

TDWG- Lisbon Oct 2003


Bioclimatic Envelop – Diva GIS
Results from Diva-GIS
(Hijmans et al. 2003) showing
the use of the Bioclimatic
Envelope from BIOCLIM to
identify outliers in climate
space. In this case the
percentile cut off is set at 95.
Red points on the envelope
correspond with red points on
the map, green points in the
envelope correspond with
yellow points on the map

TDWG- Lisbon Oct 2003


ANUCLIM
• $AUD1000 (with data files)
• Modelling (BIOCLIM / ESOCLIM)
• Cumulative Frequency Curves
• Parameter Extremes

TDWG- Lisbon Oct 2003


Cumulative Frequency - ANUCLIM

Log file of Eucalyptus fastigata from


ANUCLIM Version 5.1 (Houlder et al.
2002) showing the species
accumulation curve with an identified
outlier (labelled “bad”). Information
from the “bad” record is displayed at
the top of the log file (from Houlder et
al. 2000).

TDWG- Lisbon Oct 2003


Parameter extremes - ANUCLIM

Log file of Eucalyptus fastigata from


ANUCLIM Version 5.1 (Houlder et al.
2002) showing the parameter extremes
(top) and associated species accumulation
curve (bottom) (from Houlder et al. 2000

TDWG- Lisbon Oct 2003


Statistical Tests
• Outliers in Latitude
• Outliers in Altitude
• Outliers in collectors range/day or week
– Especially 17th, 18th and 19th Century
collections

TDWG- Lisbon Oct 2003


Thank You…

Questions?
TDWG- Lisbon Oct 2003

Vous aimerez peut-être aussi