Data Registration

Overview
Creative Commons License: CC-BY Questions:
  • What is data registration?

  • Why should you upload your data to a data repository?

  • What types of data repositories are there?

  • How to choose the right repository for your dataset?

Objectives:
  • Describe why indexed data repositories are important.

  • Summarise resources enabling you to choose a searchable repository.

Requirements:
Time estimation: 40 minutes
Supporting Materials:
Published: Mar 26, 2024
Last modification: Sep 27, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00431
version Revision: 3

The concept of data registration is defined as well as ways in which data registration can be achieved. Learners will be able to describe why indexed data repositories are important as well as resources enabling you to choose a searchable repository.

Agenda

In this tutorial, we will cover:

  1. Data registration and the FAIR Principles
  2. What is data deposition and registration?
  3. Why should I upload my data to a data repository?
  4. Types of data repository
  5. Useful Resources

Data registration and the FAIR Principles

Data registration relates to the following 3 FAIR Principles (Table 3.1).
We will discuss and signpost these in this Episode.

The FAIR Guiding Principles  
To be Findable: F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource
To be Accessible: A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata are accessible, even when the data are no longer available
To be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data
To be Reusable: R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards

Table 3.1: The 15 FAIR Guiding Principles. Principles relating to data registration in black.

What is data deposition and registration?

Data deposition and registration refer to the process of uploading data to a searchable resource and providing appropriate metadata to facilitate its discoverability. For example, a data repository, where data and metadata can be uploaded, may enable it to be discovered, preserved and accessed. Here we use the general term data repository to describe any online storage location that can host deposited (meta)data.

In the context of FAIR, data deposition relates to a number of the Guiding Principles. Firstly, “(meta)data are registered or indexed in a searchable resource, Indexed in a searchable resource: a resource where (meta)data are organised so that they can be queried based on defined fields, ” (FAIR Principle F4). Searchable (indexed) metadata enables humans and computers to query and discover data of interest, though this depends on what is indexed. Here, indexing refers to a process that occurs within the architecture of the data repository (local indexing) where metadata are organised so that they can be queried based on a defined field. It is worth noting that community resources focused on a particular domain (for example, the human database in Ensembl) are better indexed for a particular community, rather than generic repositories (for example, Zenodo) which may not index the community specific components, and may focus on higher level metadata. Indexing by an internet search engine is another example of this. Google (and other search engines, such as Yahoo and Yandex) have an agreed vocabulary (schema.org), within web pages, that are ‘scraped’ and indexed. While the focus of this vocabulary was originally intended for commercial products, community-specific efforts to facilitate discipline-specific indexing are underway (for example, Bioschemas).

Why should I upload my data to a data repository?

Data repositories are generally preferred to file storage systems (such as Dropbox) or sharing data on an ad hoc basis since they often better support FAIR best practice. Repositories will assign citable, “globally unique and persistent identifiers” (FAIR Principle F1) to data, and in some cases enable a data submitter to apply a data usage licence through association with the resource (FAIR Principle R1.1).

Although not exclusively, data repositories support the creation of metadata through curation interfaces providing drop-downs and text fields for metadata entry and validation. Often in the case of a domain or data-specific data repository, such as BioStudies shown in the previous Episode, drop-downs for metadata curation will link community-endorsed vocabularies (FAIR Principle R1.3).

Types of data repository

General public data repositories, such as Zenodo, are multidisciplinary and permit registration and upload of open and closed access (meta)data. Metadata curation is relatively high level and made searchable via indexing. Relating to data in the Life Sciences, Zenodo is often used to publish and provide citable URLs to supplementary data within articles, usually in instances where a domain repository does not exist.

Institutional repositories work similarly and provide an online archive for hosting, indexing and preserving research output specific to an institution. Typically these house more than data, providing a repository often for documents and articles. Institutions will have their own systems supported locally or buy into company solutions.

Discipline-specific repositories cater for communities and datatypes, and typically provide web interfaces to annotate rich metadata at the point when data are submitted. Examples of these belong to the suite of data repositories at the European Bioinformatics Institute (EBI) where rich metadata creation is supported by teams of curators.

Question

An example of a discipline-specific repository is ArrayExpress database. ArrayExpress stores data from high-through functional genomics assays, such as RNAseq, ChIPseq and expression microarrays. The data submission interface of ArrayExpress is called Annotare. Without creating a login, what help is given to a person looking to submit a dataset for the first time?

Both a submission guide and YouTube video is provided.

Question

Finding more help on how to upload data to specific repositories The FAIR Cookbook is an online open resource housing specific ‘how to’ guides or recipes. Use the FAIR Cookbook to find two recipes for “depositing data to Zenodo” and “registering datasets with Wikidata”, respectively.

Open the Findability pulldown on the left hand banner to find recipes for the following: Depositing to generic repositories - Zenodo use case and Registering Datasets in Wikidata.

Question

Choosing the right data repository for your data FAIRsharing helps researchers identify suitable data repositories, standards and policies relating to their data. Use this resource to identify data repositories for proteomic data.

Access the search bar for the FAIRsharing database registry. Search for proteomics and select “repository” under “Record Type”.

A screenshot of FAIRsharing showing the results for a search for all proteomics repositories.Open image in new tab

Figure 1: FAIRsharing allows you to search for specific record types that are relevant for your area.

Useful Resources