Persistent Identifiers

Author(s)	Robert Andrews Nick Juty Munazah Andrabi Nicola Soranzo Sara Morsy Kellie Snow Korneel Hens Philippe Rocca-Serra Laura Cooper Xenia Perez Sitja Andrew Mason Branka Franicevic Saskia Lawson-Tovey Katarzyna Kamieniecka Khaled Jum'ah Krzysztof Poterlowicz
Reviewers

Overview
Questions:

What is a persistent identifier?

What is the structure of identifiers?

Why does your dataset need to have an identifier?

Objectives:

Explain the definition and importance of using identifiers.

Illustrate what are the persistent identifiers.

Give examples of the structure of persistent identifiers.

Requirements:

tutorial Hands-on: FAIR in a nutshell

tutorial Hands-on: FAIR and its Origins

tutorial Hands-on: Metadata

tutorial Hands-on: Data Registration

tutorial Hands-on: Access

Time estimation: 30 minutes

Supporting Materials:

FAQs

instances Available on these Galaxies

Possibly Working

UseGalaxy.eu

UseGalaxy.org

UseGalaxy.org.au

UseGalaxy.fr

UseGalaxy.ca

Published: Mar 26, 2024

Last modification: Mar 27, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00434

version Revision: 2

The persistent identifier is defined. Learners will be able to define the structure of an identifier and explain its importance.

Agenda

In this tutorial, we will cover:

Persistent identifiers and the FAIR Principles

Using persistent identifiers (PIDs)

The structure of persistent identifiers

Useful Resources

Persistent identifiers and the FAIR Principles

Data Identifiers relate to the following 5 FAIR Principles (Table 5.1). We will discuss and signpost these in this episode.

The FAIR Guiding Principles
To be Findable:	F1. (meta)data are assigned a globally unique and persistent identifier F2. data are described with rich metadata (defined by R1 below) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource
To be Accessible:	A1. (meta)data are retrievable by their identifier using a standardized communications protocol A1.1 the protocol is open, free, and universally implementable A1.2 the protocol allows for an authentication and authorization procedure, where necessary A2. metadata are accessible, even when the data are no longer available
To be Interoperable:	I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data
To be Reusable:	R1. meta(data) are richly described with a plurality of accurate and relevant attributes R1.1. (meta)data are released with a clear and accessible data usage license R1.2. (meta)data are associated with detailed provenance R1.3. (meta)data meet domain-relevant community standards

Table 5.1: The 15 FAIR Guiding Principles. Principles relating to data identifiers are highlighted in black.

Using persistent identifiers (PIDs)

Identifiers are an important theme within the FAIR principles, arguably being foundational; they are considered two of the pillars for the FAIR principles since they are crucial for Findable (F) and Accessible (A) principles.

Identifiers are an eternal reference to a digital resource such as a dataset and its metadata. They provide the information required to reliably identify, verify and locate your research data.

Commonly, a persistent identifier is a unique record ID in a database, or a unique URL that takes a researcher to the data in question in a database. Persistent identifiers (PIDs) have to be unique so that only one dataset can be identified by this identifier. In addition to the identifier being unique, it needs to be persistent. When depositing or hosting data, you should ensure the longevity of this persistence meets your requirements, which may require reading specific database policies regarding identifier policy.

Since FAIR permits the withdrawal of data, the FAIR Principles combat the potential for broken URLs by stating: “Metadata are accessible, even when the data are no longer available.” (FAIR Principle A2). This means the link (PID) remains valid, displaying all the original metadata of the record even though the data is no longer available.

It is important to note that when you upload your data to a public repository, the repository will create this ID for you automatically.

Based on how to FAIR, many resources can help you find databases to assign PIDs to your data. One of these resources is FAIRsharing something we’ve already encountered in the previous episodes. FAIRsharing, provides a list of databases grouped by domains and organisations.

The structure of persistent identifiers

To ensure that a PID is globally unique, commonly it is supplied as a unique URL. For the case of a record in a biological database, the use of a URL ensures that the database record ID is associated with the database name or often some derivation of this. This is often enough to ensure the uniqueness of the PID for any future scenario.

Commonly, for things like publications, a DOI is used for the PID, where DOI stands for Digital Object Identifier. An example is shown below where the PID is constructed from 3 pieces of information: the resolver service, the prefix (namespace) and the suffix (dataset ID).

Resolver service: the domain/service/institution hosting the PID e.g. [https://www.doi.org] (doi.org)

Prefix: a unique number referring to the publisher. This is also known as the namespace.

Suffix: the unique dataset number

DOI_structure. — **Figure 1**: The structure of a DOI

For biological data, PIDs usually require a resolver that can deal with multiple resolving locations, which means that if a database changes its name or internal structure, the new and old variations of the PID remain valid and take the user to the (meta)data. One commonly used resolver service is called identifiers.org which maintains a list of database namespaces (prefix’s) as a persistent record. If a database changes its name, it keeps the original namespace operational or alternatively arranges for redirection from the original.

Examples of using identifiers.org to construct a PID are given below for 2 different databases, Ensembl and WikiPathways, respectively. The namespace is given as the database name in these examples.

PID_structure. — **Figure 2**: The structure of a PID using the identifiers.org resolver service

Question

Access the preprint for Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. This paper makes 10 recommendations for PID best practice.

Locate the lesson for “Do not reassign or delete identifiers”. Which PID is used as an example of a “tombstone page”? Which FAIR Principle does this relate to?

Tombstone PID: https://www.uniprot.org/uniprotkb/A0AV18

This relates directly to the FAIR Principle A2. “Metadata are accessible, even when the data are no longer available” Note that other FAIR Principles are also illustrated.

The Tombstone page is “retrievable by their identifier using a standardised communications protocol”(FAIR Principle A1), which in this case is http(s). The page contains metadata which “are assigned a globally unique and persistent identifier” (FAIR Principle F1). Also the metadata “clearly and explicitly includes the identifier of the data they describe” (FAIR Principle F3) noting that the identifier itself is featured on the webpage.

Additionally, the webpage features a link to the updated UniProt record thereby “metadata include qualified reference to other (meta)data” (FAIR Principle I3)

Useful Resources

More on identifiers: RDMkit and FAIR Cookbook
Nick Juty, Sarala M. Wimalaratne, Stian Soiland-Reyes, John Kunze, Carole A. Goble, Tim Clark; Unique, Persistent, Resolvable: Identifiers as the Foundation of FAIR. Data Intelligence 2020; 2 (1-2): 30–39. Juty et al. 2020

You've Finished the Tutorial

Key points

PIDs are eternal and unique.

PIDs are commonly URLs in the Life Sciences.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Juty, N., S. M. Wimalaratne, S. Soiland-Reyes, J. Kunze, C. A. Goble et al., 2020 Unique, Persistent, Resolvable: Identifiers as the Foundation of FAIR. Data Intelligence 2: 30–39. 10.1162/dint_a_00025

Glossary

FAIR: Findable, Accessible, Interoperable, Reusable

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Robert Andrews, Nick Juty, Munazah Andrabi, Nicola Soranzo, Sara Morsy, Kellie Snow, Korneel Hens, Philippe Rocca-Serra, Laura Cooper, Xenia Perez Sitja, Andrew Mason, Branka Franicevic, Saskia Lawson-Tovey, Katarzyna Kamieniecka, Khaled Jum'ah, Krzysztof Poterlowicz, Persistent Identifiers (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/fair/tutorials/fair-persistent-identifiers/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{fair-fair-persistent-identifiers,
author = "Robert Andrews and Nick Juty and Munazah Andrabi and Nicola Soranzo and Sara Morsy and Kellie Snow and Korneel Hens and Philippe Rocca-Serra and Laura Cooper and Xenia Perez Sitja and Andrew Mason and Branka Franicevic and Saskia Lawson-Tovey and Katarzyna Kamieniecka and Khaled Jum'ah and Krzysztof Poterlowicz",
	title = "Persistent Identifiers (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/fair/tutorials/fair-persistent-identifiers/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

DASH UK

This Fellowship was funded through the ELIXIR-UK DaSH project as part of the UKRI Innovation Scholars: Data Science Training in Health and Bioscience call (DaSH). (MR/V038966/1). The project aims to embed Research Data Management (RDM) know-how into UK universities and institutes by producing and delivering training in FAIR data stewardship using ELIXIR-UK knowledge and resources.

Congratulations on successfully completing this tutorial!

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.