Capturing, persisting, and querying the provenance of scientific data

Date

2012-06-04

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Scientists use technology ubiquitously to collect and process data. They often use software to handle massive datasets and produce scientific results, which they post on the web, making them readily available to the public. Flaws and differences in the way data is collected and processed can impair its usefulness for interpretation. To ensure the authenticity and reproducibility of that result, as well as to improve the result by incorporating corrections in its processing, it is essential to be able to trace the provenance, or history, of the results. Data provenance is defined as the information describing all entities - procedures and data - that were involved in producing a result. We aim to create a software tool that provides provenance for scientific data analyses. The major issues in this research are collecting, persisting, querying, and visualizing the provenance. The amount of data provenance is usually massive and challenging to present in a meaningful way. The focus of my work is on persisting provenance and developing the interface for interesting queries so that they can be made by a non-programmer. We are using the example of a hydrological study at the Harvard Forest which measures stream discharge as a function of other quantities. We use a definition of the process written in the graphical programming language Little-JIL to generate a graph (Data Derivation Graph or DDG) documenting the provenance of the data for each process execution. We store the DDG into an RDF (Resource Description Framework) database, making it available for querying. We provide a GUI that allows the scientist to query the provenance data without becoming an expert in database technology.

Description

Keywords

data provenance, workflow, Little-JIL, Data Derivation Graph

Citation