TY - GEN
T1 - Simulated domain-specific provenance
AU - Alper, Pinar
AU - Fairweather, Elliot
AU - Curcin, Vasa
PY - 2018/1/1
Y1 - 2018/1/1
N2 - The main driver for provenance adoption is the need to collect and understand knowledge about the processes and data that occur in some environment. Before analytical and storage tools can be designed to address this challenge, exemplar data is required both to prototype the analytical techniques and to design infrastructure solutions. Previous attempts to address this requirement have tried to use existing applications as a source; either by collecting data from provenance-enabled applications or by building tools that can extract provenance from the logs of other applications. However, provenance sourced this way can be one-sided, exhibiting only certain patterns, or exhibit correlations or trends present only at the time of collection, and so may be of limited use in other contexts. A better approach is to use a simulator that conforms to explicitly specified domain constraints, and generate provenance data synthetically, replicating the patterns, rules and trends present within the target domain; we describe such a constraint-based simulator here. At the heart of our approach are templates - abstract, reusable provenance patterns within a domain that may be instantiated by concrete substitutions. Domain constraints are configurable and solved using a Constraint Satisfaction Problem solver to produce viable substitutions. Workflows are represented by sequences of templates using probabilistic automata. The simulator is fully integrated within our template-based provenance server architecture, and we illustrate its use in the context of a clinical trials software infrastructure.
AB - The main driver for provenance adoption is the need to collect and understand knowledge about the processes and data that occur in some environment. Before analytical and storage tools can be designed to address this challenge, exemplar data is required both to prototype the analytical techniques and to design infrastructure solutions. Previous attempts to address this requirement have tried to use existing applications as a source; either by collecting data from provenance-enabled applications or by building tools that can extract provenance from the logs of other applications. However, provenance sourced this way can be one-sided, exhibiting only certain patterns, or exhibit correlations or trends present only at the time of collection, and so may be of limited use in other contexts. A better approach is to use a simulator that conforms to explicitly specified domain constraints, and generate provenance data synthetically, replicating the patterns, rules and trends present within the target domain; we describe such a constraint-based simulator here. At the heart of our approach are templates - abstract, reusable provenance patterns within a domain that may be instantiated by concrete substitutions. Domain constraints are configurable and solved using a Constraint Satisfaction Problem solver to produce viable substitutions. Workflows are represented by sequences of templates using probabilistic automata. The simulator is fully integrated within our template-based provenance server architecture, and we illustrate its use in the context of a clinical trials software infrastructure.
UR - http://www.scopus.com/inward/record.url?scp=85053858459&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-98379-0_6
DO - 10.1007/978-3-319-98379-0_6
M3 - Conference contribution
AN - SCOPUS:85053858459
SN - 9783319983783
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 71
EP - 83
BT - Provenance and Annotation of Data and Processes - 7th International Provenance and Annotation Workshop, IPAW 2018, Proceedings
A2 - Belhajjame, Khalid
A2 - Gehani, Ashish
A2 - Alper, Pinar
PB - Springer Verlag
T2 - 7th International Provenance and Annotation Workshop, IPAW 2018
Y2 - 9 July 2018 through 10 July 2018
ER -