Capturing, persisting, and querying the provenance of scientific data

dc.contributorBoose, Emery
dc.contributorAndrews, Christopher
dc.contributorDobosh, Paul
dc.contributorSt. John, Audrey
dc.contributor.advisorLerner, Barbara
dc.contributor.authorTaskova, Sofiya
dc.date.accessioned2012-06-04T12:14:36Z
dc.date.available2012-06-04T12:14:36Z
dc.date.gradyear2012en_US
dc.date.issued2012-06-04
dc.description.abstractScientists use technology ubiquitously to collect and process data. They often use software to handle massive datasets and produce scientific results, which they post on the web, making them readily available to the public. Flaws and differences in the way data is collected and processed can impair its usefulness for interpretation. To ensure the authenticity and reproducibility of that result, as well as to improve the result by incorporating corrections in its processing, it is essential to be able to trace the provenance, or history, of the results. Data provenance is defined as the information describing all entities - procedures and data - that were involved in producing a result. We aim to create a software tool that provides provenance for scientific data analyses. The major issues in this research are collecting, persisting, querying, and visualizing the provenance. The amount of data provenance is usually massive and challenging to present in a meaningful way. The focus of my work is on persisting provenance and developing the interface for interesting queries so that they can be made by a non-programmer. We are using the example of a hydrological study at the Harvard Forest which measures stream discharge as a function of other quantities. We use a definition of the process written in the graphical programming language Little-JIL to generate a graph (Data Derivation Graph or DDG) documenting the provenance of the data for each process execution. We store the DDG into an RDF (Resource Description Framework) database, making it available for querying. We provide a GUI that allows the scientist to query the provenance data without becoming an expert in database technology.en_US
dc.description.sponsorshipComputer Scienceen_US
dc.identifier.urihttp://hdl.handle.net/10166/1048
dc.language.isoen_USen_US
dc.rights.restrictedpublic
dc.subjectdata provenanceen_US
dc.subjectworkflowen_US
dc.subjectLittle-JILen_US
dc.subjectData Derivation Graphen_US
dc.titleCapturing, persisting, and querying the provenance of scientific dataen_US
dc.typeThesis
mhc.degreeUndergraduateen_US
mhc.institutionMount Holyoke College

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
finaldraft.pdf
Size:
805.49 KB
Format:
Adobe Portable Document Format
Description:
Main article
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.9 KB
Format:
Item-specific license agreed upon to submission
Description: