Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar; Omar Benjelloun; Costanza Conforti; Pieter Gijsbers; Joan Giner-Miguelez; Nitisha Jain; Michael Kuchnik; Quentin Lhoest; Pierre Marcenac; Manil Maskey; Peter Mattson; Luis Oala; Pierre Ruyssen; Rajat Shinde; Elena Simperl; Goeffry Thomas; Slava Tykhonov; Joaquin Vanschoren; Jos Van Der Velde; Steffen Vogler; Carole Jean Wu

doi:10.1145/3650203.3663326

Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos Van Der Velde, Steffen VoglerCarole Jean Wu

Research output: Contribution to conference types › Paper › peer-review

1 Citation (Scopus)

Abstract

Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.

Original language	English
Pages	1-6
Number of pages	6
DOIs	https://doi.org/10.1145/3650203.3663326
Publication status	Published - 9 Jun 2024
Event	8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - Santiago, Chile Duration: 9 Jun 2024 → …

Conference

Conference	8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024
Country/Territory	Chile
City	Santiago
Period	9/06/2024 → …

Keywords

discoverability
ML datasets
reproducibility
responsible AI

Access to Document

10.1145/3650203.3663326

Cite this

Akhtar, M., Benjelloun, O., Conforti, C., Gijsbers, P., Giner-Miguelez, J., Jain, N., Kuchnik, M., Lhoest, Q., Marcenac, P., Maskey, M., Mattson, P., Oala, L., Ruyssen, P., Shinde, R., Simperl, E., Thomas, G., Tykhonov, S., Vanschoren, J., Van Der Velde, J., ... Wu, C. J. (2024). Croissant: A Metadata Format for ML-Ready Datasets. 1-6. Paper presented at 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024, Santiago, Chile. https://doi.org/10.1145/3650203.3663326

@conference{796cca2949624cc4811b79dba5a94c2f,

title = "Croissant: A Metadata Format for ML-Ready Datasets",

abstract = "Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.",

keywords = "discoverability, ML datasets, reproducibility, responsible AI",

author = "Mubashara Akhtar and Omar Benjelloun and Costanza Conforti and Pieter Gijsbers and Joan Giner-Miguelez and Nitisha Jain and Michael Kuchnik and Quentin Lhoest and Pierre Marcenac and Manil Maskey and Peter Mattson and Luis Oala and Pierre Ruyssen and Rajat Shinde and Elena Simperl and Goeffry Thomas and Slava Tykhonov and Joaquin Vanschoren and {Van Der Velde}, Jos and Steffen Vogler and Wu, {Carole Jean}",

note = "Publisher Copyright: {\textcopyright} 2024 Owner/Author.; 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 ; Conference date: 09-06-2024",

year = "2024",

month = jun,

day = "9",

doi = "10.1145/3650203.3663326",

language = "English",

pages = "1--6",

}

Akhtar, M, Benjelloun, O, Conforti, C, Gijsbers, P, Giner-Miguelez, J, Jain, N, Kuchnik, M, Lhoest, Q, Marcenac, P, Maskey, M, Mattson, P, Oala, L, Ruyssen, P, Shinde, R, Simperl, E, Thomas, G, Tykhonov, S, Vanschoren, J, Van Der Velde, J, Vogler, S & Wu, CJ 2024, 'Croissant: A Metadata Format for ML-Ready Datasets', Paper presented at 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024, Santiago, Chile, 9/06/2024 pp. 1-6. https://doi.org/10.1145/3650203.3663326

TY - CONF

T1 - Croissant

T2 - 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024

AU - Akhtar, Mubashara

AU - Benjelloun, Omar

AU - Conforti, Costanza

AU - Gijsbers, Pieter

AU - Giner-Miguelez, Joan

AU - Jain, Nitisha

AU - Kuchnik, Michael

AU - Lhoest, Quentin

AU - Marcenac, Pierre

AU - Maskey, Manil

AU - Mattson, Peter

AU - Oala, Luis

AU - Ruyssen, Pierre

AU - Shinde, Rajat

AU - Simperl, Elena

AU - Thomas, Goeffry

AU - Tykhonov, Slava

AU - Vanschoren, Joaquin

AU - Van Der Velde, Jos

AU - Vogler, Steffen

AU - Wu, Carole Jean

PY - 2024/6/9

Y1 - 2024/6/9

N2 - Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.

AB - Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.

KW - discoverability

KW - ML datasets

KW - reproducibility

KW - responsible AI

UR - http://www.scopus.com/inward/record.url?scp=85196652122&partnerID=8YFLogxK

U2 - 10.1145/3650203.3663326

DO - 10.1145/3650203.3663326

M3 - Paper

AN - SCOPUS:85196652122

SP - 1

EP - 6

Y2 - 9 June 2024

ER -

Croissant: A Metadata Format for ML-Ready Datasets

Abstract

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this