Vous êtes sur la page 1sur 1

Metadata-Driven Analysis Tools

John Porter, University of Virginia & M. Gastil-Buhl, University of California, Santa Barbara
Abstract:
One of the onerous but necessary tasks in performing quality control or analyzing data is creation of statistical programs that ingest data in a
wide variety of formats. Depending upon the number of different variables/attributes and the complexity of their format within data tables,
this task can vary from a few minutes to an hour or more for each data table. When dealing with many different data tables, the cumulative
effort can require a substantial amount of time and effort. However, metadata-driven tools can make creation of basic statistical programs a
matter of seconds, not hours.
Adequate metadata should include all the information needed to read and understand the associated data. If structured, as in Ecological
Metadata Language (EML) metadata, the metadata elements needed to read data tables can be extracted automatically and used to write
simple code for use with statistical programs. These simple programs can input data, perform basic quality assurance checks, such as type
and range checks, and provide basic statistical summaries.
With our colleagues, we have developed a variety of tools for transforming EML metadata into useful statistical programs for R, Matlab, SAS
or SPSS. Each program downloads the data (if available online), ingests all the data tables associated with a particular dataset, performs
basic QA/QC and creates simple statistical summaries. The programs leave the investigator with a set of pre-ingested, analysis-ready data.
The programs can then be modified by investigators to add new analysis steps.
EML metadata can also be used to drive web-based analysis environments to perform quality assurance analyses, map dataset locations and
even to create online graphics (see http://ngis.tfri.gov.tw/modules/modules_en/). Additionally, the Kepler and DataONE environments
support EML-based tools for automating analyses.
Goals:
Use Metadata-driven tools
to:
Reduce time needed to
ingest new data
Eliminate avoidable
errors in reading data
Automatically identify
errors in data
Resources:
Basic Web Service Help: http://www.vcrlter.virginia.edu/data/eml2/PASTAprogHelp.html
Detailed Web Service Help: http://www.vcrlter.virginia.edu/data/eml2/PASTAprogWebService.pdf
Web-portal for statistical code generation: http://www.vcrlter.virginia.edu/data/eml2/eml2stat.html
Stylesheets and other developer resources:
https://svn.lternet.edu/websvn/listing.php?repname=VCR&path=%2Ftrunk%2Feml_statistical_tools%2F
Other EML-driven tools: http://ngis.tfri.gov.tw/modules/modules_en
File extension sets the type of program
(R, SAS, SPSS & Matlab) created
Matlab stylesheet by Wade Sheldon. This material is based upon work supported by the National Science Foundation under Grant Nos. 1237733 & 1026851. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
System:
REST-based web services are
implemented via PHP and PERL scripts
Ecological Metadata Language (EML)
documents are fetched and transformed
using XSLT stylesheets to generate
statistical code
in:
R
MatLab
SAS
SPSS
Results:
Statistical code includes, depending on statistical
packages:
Variable name adjustment to meet package
requirements (e.g., removal of spaces)
Data retrieval
Data ingestion
Variable type checking
Date conversions
Range checks
Statistical summaries
Code can then be extended to meet user analysis needs
Opportunities:
Embed web service into existing catalog interfaces
Create stylesheets for new packages (e.g., IDL, Octave)
Improve existing stylesheets to perform more sophisticated quality
checks and conversions
Enhance web services to allow advanced data integration and
plotting services, with the addition of additional information not
currently in metadata
Contact: jporter@LTERnet.edu
Stylesheet or
Program
Editor
Researcher
Human-readable
Display
Traditional
Way
Automatic generation of statistical code (red arrow) makes unnecessary
traditional, human-intensive ways of genrating programs (grey box).
Statistical
Program
EML Metadata Document
Analysis

Vous aimerez peut-être aussi