A survey of data quality requirements that matter in ML development pipelines

Maria Priestley; Fionntán O’Donnell; Elena Simperl

doi:10.1145/3592616

A survey of data quality requirements that matter in ML development pipelines

Maria Priestley, Fionntán O’Donnell, Elena Simperl

Research output: Contribution to journal › Article › peer-review

14 Citations (Scopus)

248 Downloads (Pure)

Abstract

The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users - typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical "fitness-for-use"view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.

Original language	English
Article number	3592616
Journal	Journal of Data and Information Quality
Volume	15
Issue number	2
Early online date	19 Apr 2023
DOIs	https://doi.org/10.1145/3592616
Publication status	Published - 22 Jun 2023

Access to Document

10.1145/3592616

A survey of data quality_PRIESTLEY_2023_GREEN AAMAccepted author manuscript, 4.41 MB

Cite this

@article{cd7e33155a9d443395812a0c85aac387,

title = "A survey of data quality requirements that matter in ML development pipelines",

abstract = "The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users - typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical {"}fitness-for-use{"}view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.",

author = "Maria Priestley and Fionnt{\'a}n O{\textquoteright}Donnell and Elena Simperl",

note = "Funding Information: This work was partly funded by the European Union{\textquoteright}s Horizon 2020 research and innovation programme under the projects EUHubs4Data (grant 951771) and MediaFutures (grant 951962). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Publisher Copyright: {\textcopyright} 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.",

year = "2023",

month = jun,

day = "22",

doi = "10.1145/3592616",

language = "English",

volume = "15",

journal = "Journal of Data and Information Quality",

issn = "1936-1955",

publisher = "Association for Computing Machinery (ACM)",

number = "2",

}

TY - JOUR

T1 - A survey of data quality requirements that matter in ML development pipelines

AU - Priestley, Maria

AU - O’Donnell, Fionntán

AU - Simperl, Elena

N1 - Funding Information: This work was partly funded by the European Union’s Horizon 2020 research and innovation programme under the projects EUHubs4Data (grant 951771) and MediaFutures (grant 951962). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Publisher Copyright: © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.

PY - 2023/6/22

Y1 - 2023/6/22

N2 - The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users - typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical "fitness-for-use"view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.

AB - The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users - typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical "fitness-for-use"view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.

UR - http://www.scopus.com/inward/record.url?scp=85161538719&partnerID=8YFLogxK

U2 - 10.1145/3592616

DO - 10.1145/3592616

M3 - Article

SN - 1936-1955

VL - 15

JO - Journal of Data and Information Quality

JF - Journal of Data and Information Quality

IS - 2

M1 - 3592616

ER -

A survey of data quality requirements that matter in ML development pipelines

Abstract

Access to Document

Other files and links

Fingerprint

Cite this