FAIR data pipeline: provenance-driven data management for traceable scientific workflows

Sonia Natalie Mitchell, Andrew Lahiff, Nathan Cummings, Jonathan Hollocombe, Bram Boskamp, Ryan Field, Dennis Reddyhoff, Kristian Zarebski, Antony Wilson, Bruno Viola, Martin Burke, Blair Archibald, Paul Bessell, Richard Blackwell, Lisa A. Boden, Alys Brett, Sam Brett, Ruth Dundas, Jessica Enright, Alejandra N. Gonzalez-BeltranClaire Harris, Ian Hinder, Christopher David Hughes, Martin Knight, Vino Mano, Ciaran McMonagle, Dominic Mellor, Sibylle Mohr, Glenn Marion, Louise Matthews, Iain J. McKendrick, Christopher Mark Pooley, Thibaud Porphyre, Aaron Reeves, Edward Townsend, Robert Turner, Jeremy Walton, Richard Reeve*

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    4 Citations (Scopus)
    9 Downloads (Pure)

    Abstract

    Modern epidemiological analyses to understand and combat the spread of disease depend critically on access to, and use of, data. Rapidly evolving data, such as data streams changing during a disease outbreak, are particularly challenging. Data management is further complicated by data being imprecisely identified when used. Public trust in policy decisions resulting from such analyses is easily damaged and is often low, with cynicism arising where claims of ‘following the science’ are made without accompanying evidence. Tracing the provenance of such decisions back through open software to primary data would clarify this evidence, enhancing the transparency of the decision-making process. Here, we demonstrate a Findable, Accessible, Interoperable and Reusable (FAIR) data pipeline. Although developed during the COVID-19 pandemic, it allows easy annotation of any data as they are consumed by analyses, or conversely traces the provenance of scientific outputs back through the analytical or modelling source code to primary data. Such a tool provides a mechanism for the public, and fellow scientists, to better assess scientific evidence by inspecting its provenance, while allowing scientists to support policymakers in openly justifying their decisions. We believe that such tools should be promoted for use across all areas of policy-facing research. This article is part of the theme issue ‘Technical challenges of modelling real-life epidemics and examples of overcoming these’.
    Original languageEnglish
    Article number20210300
    Number of pages23
    JournalPhilosophical transactions. Series A, Mathematical, physical, and engineering sciences
    Volume380
    Issue number2233
    Early online date15 Aug 2022
    DOIs
    Publication statusPrint publication - 3 Oct 2022

    Keywords

    • FAIR
    • provenance
    • data management
    • epidemiology
    • modelling
    • COVID-19

    Fingerprint

    Dive into the research topics of 'FAIR data pipeline: provenance-driven data management for traceable scientific workflows'. Together they form a unique fingerprint.

    Cite this