All GTN training data are now automatically mirrored via Onedata

Author(s) orcid logoPolina Polunina avatar Polina Polunina
esg-wp4 esg

Posted on: 3 June 2024 purlPURL: https://gxy.io/GTN:N00081

As part of the EuroScienceGateway and in cooperation with Onedata and EGI we are providing all GTN training data on a publicly accessible cloud storage. Those training datasets are curated, small but meaningful for educational purposes and contain 1530 files with a total size of 170Gb. An invaluable set of resources for everyone dealing with data science and training. Please thank the more than 350 contributors to GTN.

What does this mean for you?

  • Teachers: if you’re a teacher contributing to the GTN, you can now be sure that the datasets you use in your materials are even more accessible and easier to use.
  • GTN Users: When following training materials, it’ll now be easier to access those same datasets in new locations.
  • Galaxy Admins: If you’re running a Galaxy server, you can now more easily integrate the GTN data into your server without taking up unnecessary storage space, and still making it available to all your users.

Accessing GTN Data in the Cloud

GTN training data is always accessible, annotated and linked for every tutorial. Usually, it’s stored in Zenodo, referenced via a DOI. You can access all GTN training data using several methods:

  1. Onedata Share — access without authentication:

    a. Visit the public share link to browse and download the data via the Onedata Web UI.

    b. Use the public REST API to access the data; on the share page (see above) you will find ready-to-use curl examples by right-clicking on a file/directory and choosing the Information context menu.

  2. Galaxy Server integration: access the data on the European Galaxy server. Go to the “Upload data” button, select “Choose remote files,” and navigate to the GTN repository.

  3. Configure your own Galaxy server: to include the GTN data in your Galaxy server, use the following configuration:

     - type: onedata
       id: gtn_public_onedata
       label: GTN training data
       doc: Training data from the Galaxy Training Network (powered by Onedata)
       # The following token is a public, read-only token that can be shared.
       accessToken: "MDAxY2xvY2F00aW9uIGRhdGFodWIuZWdpLmV1CjAwNmJpZGVudGlmaWVyIDIvbm1kL3Vzci00yNmI4ZTZiMDlkNDdjNGFkN2E3NTU00YzgzOGE3MjgyY2NoNTNhNS9hY3QvMGJiZmY1NWU4NDRiMWJjZGEwNmFlODViM2JmYmRhNjRjaDU00YjYKMDAxNmNpZCBkYXRhLnJlYWRvbmx5CjAwNDljaWQgZGF00YS5wYXRoID00gTHpaa1pUTTROMkl4WmpjMllXVmpOMlU00WWpreU5XWmtNV00ZpT1RKbU1ETXlZMmhoWTJReAowMDJmc2lnbmF00dXJlIIQvnXp01Oey02LnaNwEkFJAyArzhHN8SlXSYFsBbSkqdqCg"
       onezoneDomain: "datahub.egi.eu"
    
  4. Onedata clients — access the data using the public read-only access token and Oneclient (local POSIX mount) or OnedataFS (PyFilesystem interface), e.g.:

     mkdir ~/oneclient
     oneclient \
         -H plg-cyfronet-01.datahub.egi.eu \
         -t MDAxY2xvY2F00aW9uIGRhdGFodWIuZWdpLmV1CjAwNmJpZGVudGlmaWVyIDIvbm1kL3Vzci00yNmI4ZTZiMDlkNDdjNGFkN2E3NTU00YzgzOGE3MjgyY2NoNTNhNS9hY3QvMGJiZmY1NWU4NDRiMWJjZGEwNmFlODViM2JmYmRhNjRjaDU00YjYKMDAxNmNpZCBkYXRhLnJlYWRvbmx5CjAwNDljaWQgZGF00YS5wYXRoID00gTHpaa1pUTTROMkl4WmpjMllXVmpOMlU00WWpreU5XWmtNV00ZpT1RKbU1ETXlZMmhoWTJReAowMDJmc2lnbmF00dXJlIIQvnXp01Oey02LnaNwEkFJAyArzhHN8SlXSYFsBbSkqdqCg \
         ~/oneclient
     ls ~/oneclient/GTN\ data
    

What is the GTN Downloader?

GTN-Downloader makes it easier for users to access and organize data from the Galaxy Training Network (GTN). The GTN Downloader is a Python script that automates the download of data from GTN tutorials. It goes through the tutorials in the GTN repository, finds data-library.yaml files, and creates a structured directory based on the tutorial names and file contents.

Key Features:

  • Automated Data Download: The script finds data-library.yaml files in the GTN repository and downloads the associated data files.
  • Structured Organization: It creates directories based on the tutorial names and the information in the data-library.yaml files, so the files are organized.
  • Download Summary: It generates a download-summary.tsv file, which includes metadata about the downloaded files, a download report (error, success, already downloaded), and the overall size of the files.

Seamless Integration with Onedata

In addition to local downloads, the GTN Downloader can upload data to Onedata, a distributed data management platform. This integration ensures that the latest GTN data is always available to users.

Automated Workflow with GitHub CI/CD:

  • Automated Workflow: A GitHub Actions workflow runs once a week on weekends to download the latest data from the GTN tutorials and upload it to Onedata.
  • Environment Setup: The workflow sets up necessary environment variables and installs dependencies, including Oneclient, the Onedata POSIX client.
  • Data Upload: After downloading the data, the workflow uploads it to Onedata, making it publicly accessible.

Funding

These individuals or organisations provided funding support for the development of this resource


Recent News

See all news

GTN Video Library 2.0: 107 hours of learning across 154 videos

14 June 2024   gtn

Many GTN tutorials already have recordings. These recordings were made by members of the community for a variety of (online) training events. Up until now, this video library were part of the Gallantries Project. We have now integrated this video library directly into the GTN, and made it even easier to add video recordings to GTN tutorials or slide decks! Just use a Google Form to submit your video recordings!

Phylogenetics tutorial takes researchers back to basics!

13 June 2024   Phylogenetics Tutorial GTN Australian BioCommons

A new Galaxy Training Network tutorial has been created to take researchers back to basics to uncover the principles of phylogenetics and how tree-building methods work. A longstanding collaboration between Professor Michael Charleston from the University of Tasmania and Australian BioCommons has delivered this self-guided tutorial featuring videos and hands-on exercises. To maximise its impact, the resource was tailored specifically to be shared globally via the Galaxy Training Network, and will form the basis of an upcoming live training workshop.

From GTN Intern to Tutorial Author to Bioinformatician

13 June 2024   single-cell training education trajectory user contributor

With growing access and interest in sequencing data, Galaxy is a knight in shining armor for wet lab scientists hoping to analyze their own data. With long term intentions of increasing access to bioinformatic analyses, the Galaxy Training Network (GTN) creates a safe space where non-computer-scientists may analyze their own data and even learn to code: an invaluable skill in today’s scientific world. Galaxy introduced me to brand new skills as an undergraduate and ultimately changed the trajectory of my career. Here is my story as a biology undergraduate with no coding experience turned GTN contributor &, eventually, coding bioinformatician: thanks to Galaxy.