Yet Another Paradigm Shift…
A Blog post by Wim Hugo (WDS Scientific Committee member)
At the recently completed European Geosciences Union General Assembly 2016, I was one of the participants in a double session called "20 years of persistent identifiers – where do we go next?". Apart from reviewing the obvious elements, issues, and benefits of persistent identification—and agreeing on the success of the Research Data Alliance (RDA) Working Group on Data Citation and their excellent set of 14 guidelines for implementation—we also had a number of robust discussions; not least because Vienna was an airport too far for some of the presenters, leaving us with free time.
Firstly, most of us agreed that being able to reproduce the result of queries (and potentially other transformations or processes) applied to data or subsets of the data was the hardest of the guidelines to implement.
One can deal with this by keeping archived copies of all such query and transformation results (painless to implement, but potentially devastating from a storage provisioning perspective), or one could opt to store the query and transformation instructions themselves, with a view to reproducing the query or transformation result at some point in the future.
This second option equates to always starting with base ingredients (egg yolks, lemon juice, butter, and maybe mustard or cayenne) and to store this with a recipe (in this case for Hollandaise Sauce). This option is also painless to implement, until there is a change in the underlying database schema, code, or both—in which case one will have to (potentially almost ad infinitum) maintain backward compatibility so that historical operations continue to work, or maintain working copies of all historical releases for the purpose of reproducing a query or transformation result at some point in the future. Clearly this is not very practical.
By the way, there were some excellent ideas on how to record recipes systematically: Lesley Wyborn presented work on defining an ontology whereby queries and transformations could be documented as an automated script, and Edzer Pebesma and colleagues are conceiving an algebra for spatial operations with much the same objective in mind.
This approach, of course, requires an additional consensus: at what point do we store results as a new dataset instead of executing a potentially longer and longer list of processes on original data? There must be some value to buying Hollandaise Sauce off the shelf for our Eggs Benedict—at least some of the time.
Secondly, all of this trouble is required to achieve either one or both of two objectives: reliably finding the data referenced by a citation (via a digital object identifier or other persistent identifier), and supporting reproducibility in science. This last point was enthusiastically agreed on by most (one or two abstained, and there was one dissenter):
This assertion set me thinking about the process of reproducing results in the new world of data-intensive science, a world in which code and systems are increasingly distributed, reliant on external vocabularies, lookups, services, and libraries (that may be themselves referenced by persistent identifiers). None of these resources, which may have a significant outcome on the result of a process should they change, are under the control of the code running in my environment. Which brings us to Claerbout’s Principle:
"The scholarship does not only consist of theorems and proofs but also (and perhaps even more important) of data, computer code and a runtime environment which provides readers with the possibility to reproduce all tables and figures in an article."
Easier said than done. We can, of course (as we should in a world of formal systems engineering) insist on proper configuration control and versioning of all components, internal and external, but I am not convinced that the research community is ready for this level of maturity—typically reserved for moon rockets and defense procurement, with orders of magnitude in additional costs. Perhaps more importantly, the scientists writing code are not going to invest time and effort to document, version, and package their code to a standard that supports reproducibility. Hence, the code that we use to transform our data, whether we like it or not, will not automatically produce the same result at some unspecified point in the future, and much more so if it has external web-based dependencies (which, in turn, may also have external dependencies). There is some utility in packaging entire runtime environments (much in the way that one could persist the result of a query or transformation), but this does not solve the problem of external dependencies.
Which raises an interesting dilemma: in the world of linked open data, the semantic web, and open distributed processing, the state of the web at any point in time cannot be reproduced ever again—which may create significant issues for reproducible science if it uses any form of distributed code.
Not only that! As we rely more and more on processing enormous volumes of data by digital means, we will depend more and more on artificial intelligence, machine learning, and automated research. As the body of knowledge available to automated agents changes, so presumably, will their conclusions and inferences.
So...we need a new consensus on what science means in the era of data-intensive, increasingly automated science: our rules, notions, and paradigms will soon be outdated.
Fitting subject for an RDA Interest Group, I would think.
Some interesting additional reading: