Using Data Provenance to Support Reproducibility in R

Fabrega, Sean

Using Data Provenance to Support Reproducibility in R

Files

Thesis_Draft_Archive_Version.pdf (4.4 MB)

Date

2023-06-09

Authors

Fabrega, Sean

Abstract

The use of computers for data processing and analysis has dramatically transformed the approaches and capabilities of scientific research. Today, researchers are able to process and draw conclusions from large volumes of data in relatively little time, expanding the breadth and efficiency of their work. Despite this shift, verifying results through multiple studies and experiments will always remain important. A 2019 National Academies report recommended more research and development to ensure published scientific results are computationally reproducible, meaning the same results can be derived from the original data and analysis methods. Often, computational reproducibility requires information about the computing environment – such as the operating system, language, and package versions where the results were produced – as well as the data and script. This is because software can behave differently when components of the computing environment change. Therefore, an approach to reproducible research involves collecting all of the information about the scripts, data, and computing environment, also known as data provenance. In the R language, the rdtLite package facilitates the collection of data provenance for a given script execution. This thesis will focus on developing methods that use data provenance as a blueprint for reconstructing a computing environment and conducting experiments that apply this tool to identify situations in which changes to the environment resulted in changes in script behavior.

Keywords

data provenance, data science, R programming language, reproducibility

URI

http://hdl.handle.net/10166/6427

Collections

Student Theses and Honors Collection

Full item page

Using Data Provenance to Support Reproducibility in R

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By