Wikidata as a Digital Preservation Knowledgebase

By Katherine Thornton posted 10-27-2016 13:57


Katherine Thornton, CLIR postdoctoral fellow in data curation at Yale University Library, coauthored this piece with Euan Cochrane, digital preservation manager at Yale University Library. It is a condensed version of a longer post published by the Open Preservation Foundation. 

We’re exploring Wikidata, the (relatively new) Wikipedia for data, as a knowledge base for digital preservation information and would appreciate feedback and involvement.

At Yale University Library we are beginning a new program of work (with funding from both CLIR and IMLS) to systematically preserve software to support the long-term preservation of our digital collections. One goal of this work is to enable every digital object under our management to be associated with a representative interaction environment that contains a representative set of software that would have been used to interact with the content during the period when it was first created and used. Through the use of emulation tools and services, such as the bwFLA Emulation as a Service (EaaS) technology, we aim to enable all preserved digital objects to be accessed regardless of the computing environment from which the end-user is accessing the object(s).

An initial stumbling block in this work has been the lack of comprehensive cataloging tools, databases, and standards for documenting software, and the inability to connect that type of documentation to other digital preservation information such as file format databases and identification tools. This blog post outlines a possible approach to resolving the issue by implementing a collaborative, trustworthy, and transparent technical information management system within the Wikidata framework.

Implementing TOTEM in Wikidata

TOTEM is the acronym for the Trustworthy Online Technical Environment Metadata Registry, created by Janet Delve and David Anderson. The purpose of TOTEM is to establish a standardized set of attributes and relationship descriptor terms to record information about the interrelationships between software and hardware.

To implement TOTEM in Wikidata we would need to propose properties for all of the attributes that are described in the TOTEM  framework that are not yet available as properties. For example, TOTEM has “processor speed,” “bit width,” “RAM,” “ROM, “motherboard,” as sub-elements to describe hardware configurations we may want to represent. These properties do not currently exist in Wikidata and will need to be proposed.

Properties can be proposed by any Wikidata user. There is a template for property proposals as seen in the image.

Runeberg.jpg(click image to see larger)

After properties are proposed the Wikidata community has the opportunity to ask questions, discuss aspects of the proposal, and vote to support or oppose the proposal. The following image is a screenshot of a property proposal discussion.

fig-2.jpg(click image to see larger)

Because Wikidata is a wiki implemented in MediaWiki and Wikibase software, it is fully versioned and all edits are available for review. This means that all discussions among editors can take place on-wiki and remain in the context of the content to which the editors are referring. A versioned system provides value to the digital preservation community in that the work that goes into refining concepts and tools will remain available for anyone to consult in the future. The fact that these discussions would be taking place on a public wiki also means that people from different organizations could collaborate in this forum.

Example SPARQL Queries

The Wikidata SPARQL endpoint maintains a list of example queries (available from the “Examples” button). Once a SPARQL query has been written it can be reused by others to find updated results. A member of the digital preservation community has written a SPARQL query to answer questions such as:

  1. I want to know all formats in Wikidata that don’t have property ‘y’
  2. I want all open source software that can render x and runs on operating system ‘z’

Other members of the community could then reuse these queries to quickly and easily leverage the infrastructure to answer relevant questions. We could all benefit from the work of a few SPARQL query writers and be able to answer many different types of data-driven questions.

Next Steps

We’re currently seeking feedback on this proposed approach. If no major problems are found, we will be pursuing it at Yale University Library and will soon begin proposing properties to build out the digital preservation model in Wikidata. We’ve created a Google group where we will share updates about our work and invite collaboration on the proposal process for new properties to be created in Wikidata. The name of the group is “wikidata-for-digital-preservation.” If you are interested in participating collaboratively, please navigate to that link and request to be added to the group. We would love to have a wide range of voices contributing to this proposed approach. As and when we move forward, we will ask for collaboration on developing the necessary properties to describe software, hardware, operating systems, emulated environments, file formats, etc. We will also seek input on any portal development work we initiate.