Using Data Provenance to Support Reproducibility in R

Date

2023-06-09

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The use of computers for data processing and analysis has dramatically transformed the approaches and capabilities of scientific research. Today, researchers are able to process and draw conclusions from large volumes of data in relatively little time, expanding the breadth and efficiency of their work. Despite this shift, verifying results through multiple studies and experiments will always remain important. A 2019 National Academies report recommended more research and development to ensure published scientific results are computationally reproducible, meaning the same results can be derived from the original data and analysis methods. Often, computational reproducibility requires information about the computing environment – such as the operating system, language, and package versions where the results were produced – as well as the data and script. This is because software can behave differently when components of the computing environment change. Therefore, an approach to reproducible research involves collecting all of the information about the scripts, data, and computing environment, also known as data provenance. In the R language, the rdtLite package facilitates the collection of data provenance for a given script execution. This thesis will focus on developing methods that use data provenance as a blueprint for reconstructing a computing environment and conducting experiments that apply this tool to identify situations in which changes to the environment resulted in changes in script behavior.

Description

Keywords

data provenance, data science, R programming language, reproducibility

Citation