Insight: UKCRIC workshops explore the feasibility of Labour’s proposed National Data Library

UKCRIC ran two workshops that addressed the challenges of and opportunities for data sharing (1) between industry and academia and (2) in urban observatory settings
Insight: UKCRIC workshops explore the feasibility of Labour’s proposed National Data Library
UKCRIC General Manager (University of Birmingham)

The workshops, which considered the challenges and opportunities of data sharing, were commissioned by DAFNI for its Data Infrastructure for National Infrastructure (DINI) initiative, which, in turn, was commissioned by DSIT (Department for Science, Innovation and Technology) as part of the Labour Government’s commitment to create a National Data Library to support solving data-driven research challenges. DAFNI’s final report to DSIT is available here and you can find out more about the other work commissioned for the DINI initiative here.

UKCRIC ran two workshops that addressed the challenges of and opportunities for data sharing (1) between industry and academia and (2) in urban observatory settings – paying particular attention to water, energy and transport. Both workshops focussed upon the same central question: how can the UK transform its data into research assets that can be used to benefit society? The outcomes are described in this report.

Both the industry representatives who participated in the first workshop and the academics who participated in the second face pressures that influence their ability to share data, but they also recognise the opportunities afforded by sharing data and see the potential in taking a ‘data-sharing first’ approach to their work.

Highlights from the workshops:

  • Increasing digitalisation has led to an increased need to critique the nature of data.
  • The nature of data is different from the nature of data sharing. Data sharing incorporates metadata and comprises, amongst other things: the purpose of the data, trusting the data, trusting the source of the data, and provenance of the data.
  • If data sharing relationships are to be productive, those involved must continually ask: what is data, what is it going to be used for, and how is it going to be contextualised.
  • There is increasing recognition that the question being asked of data is not always known at the time the data are collected. Without a purpose though, the sharing of data is unlikely to succeed.
  • For artificial intelligence (AI) and machine learning (ML), having an unstructured data pool can be more useful than a structured data pool, especially in terms of looking for the patterns that are outwith any structure that might be imposed upon the data by those who gathered it.
  • Traditional perspectives of data sharing as a linear process are being challenged by new perspectives that frame data sharing as the creation of new data incarnations.
  • A distinction is to be made between pre-commercial and post-commercial data sharing. Pre-commercial data sharing addresses issues faced by an entire group or sector, the solving of which benefits the whole group or sector. Post-commercial data sharing has the potential to deliver market advantage to a specific organisation. Each has a different benefit profile, but the distinction between the two is not always clear or considered in data sharing paradigms.
  • For continuous and near-continuous data, there are multiple financial, equipment, computer and time costs in capturing, storing and analysing very large amounts of data. The benefit of having the data must outweigh these costs.
  • Data sharing places a burden on those sharing the data that includes time and financial costs. Simply handing over data is not sufficient for data sharing.
  • Data sharing also places a burden on the planet. Digital data, digitally sharing data, cloud computing, AI, machine learning, digital and cloud backups, and so on all have a carbon cost.
  • The perceived risks of sharing data often don’t materialise in practice.
  • During each workshop recommendations were made for the functionality of a national data library. These went beyond simply collecting and cataloguing data to include custodianship, signposting, curating, protecting against poor-quality data, establishing benchmarks, and conducting insights and analytics. They also extended to brokering data sharing agreements and supporting dialogue between data suppliers and data users as well as advancing the science and practice of data sharing including shaping policy, regulatory and other drivers.

Overall, those who participated in the workshops see value in establishing a National Data Library, believe that the challenges to doing so are surmountable and that now is the right time to tackle them. For those tasked with delivering the Data Library they have this advice:

  • It cannot be assumed that improved access to data will lead to ‘the right answer’. It could simply increase and amplify spurious outcomes.
  • There is substantial work still to be done to establish confidence in data. This is more than data ontology (a description of data and its structure) and more than metadata (which is often bespoke to specific user communities). The emerging field of ‘computational epistemology’ has a role to play. This term has been coined to describe the data needed about data that provides confidence in the data. It includes when, where, and how data are collected, who asserts the data to be true, whether there are real-life examples of the data, and so on.
  • The data cloud could take the form of a centralised data depository, a federation of knowledge stores (favoured by the workshop participants), or a catalogue of available data and pointers to their locations. Whichever is the case, investment in hardware, software and people will be needed.
  • The proposed National Data Library, and data sharing infrastructure more generally, must be recognised as a class of infrastructure that falls under the remit of the National Infrastructure and Service Transformation Authority (NISTA). It must be a shared resource.
  • Last, but certainly not least, the quantity of data and the costs of creating a national data cloud should not be underestimated.