All GTN training data are now automatically mirrored via Onedata

Author(s) orcid logoPolina Polunina avatar Polina Polunina
esg-wp4 esg

Posted on: 3 June 2024 purlPURL: https://gxy.io/GTN:N00081

As part of the EuroScienceGateway and in cooperation with Onedata and EGI we are providing all GTN training data on a publicly accessible cloud storage. Those training datasets are curated, small but meaningful for educational purposes and contain 1530 files with a total size of 170Gb. An invaluable set of resources for everyone dealing with data science and training. Please thank the more than 350 contributors to GTN.

What does this mean for you?

  • Teachers: if you’re a teacher contributing to the GTN, you can now be sure that the datasets you use in your materials are even more accessible and easier to use.
  • GTN Users: When following training materials, it’ll now be easier to access those same datasets in new locations.
  • Galaxy Admins: If you’re running a Galaxy server, you can now more easily integrate the GTN data into your server without taking up unnecessary storage space, and still making it available to all your users.

Accessing GTN Data in the Cloud

GTN training data is always accessible, annotated and linked for every tutorial. Usually, it’s stored in Zenodo, referenced via a DOI. You can access all GTN training data using several methods:

  1. Onedata Share — access without authentication:

    a. Visit the public share link to browse and download the data via the Onedata Web UI.

    b. Use the public REST API to access the data; on the share page (see above) you will find ready-to-use curl examples by right-clicking on a file/directory and choosing the Information context menu.

  2. Galaxy Server integration: access the data on the European Galaxy server. Go to the “Upload data” button, select “Choose remote files,” and navigate to the GTN repository.

  3. Configure your own Galaxy server: to include the GTN data in your Galaxy server, use the following configuration:

     - type: onedata
       id: gtn_public_onedata
       label: GTN training data
       doc: Training data from the Galaxy Training Network (powered by Onedata)
       # The following token is a public, read-only token that can be shared.
       accessToken: "MDAxY2xvY2F00aW9uIGRhdGFodWIuZWdpLmV1CjAwNmJpZGVudGlmaWVyIDIvbm1kL3Vzci00yNmI4ZTZiMDlkNDdjNGFkN2E3NTU00YzgzOGE3MjgyY2NoNTNhNS9hY3QvMGJiZmY1NWU4NDRiMWJjZGEwNmFlODViM2JmYmRhNjRjaDU00YjYKMDAxNmNpZCBkYXRhLnJlYWRvbmx5CjAwNDljaWQgZGF00YS5wYXRoID00gTHpaa1pUTTROMkl4WmpjMllXVmpOMlU00WWpreU5XWmtNV00ZpT1RKbU1ETXlZMmhoWTJReAowMDJmc2lnbmF00dXJlIIQvnXp01Oey02LnaNwEkFJAyArzhHN8SlXSYFsBbSkqdqCg"
       onezoneDomain: "datahub.egi.eu"
    
  4. Onedata clients — access the data using the public read-only access token and Oneclient (local POSIX mount) or OnedataFS (PyFilesystem interface), e.g.:

     mkdir ~/oneclient
     oneclient \
         -H plg-cyfronet-01.datahub.egi.eu \
         -t MDAxY2xvY2F00aW9uIGRhdGFodWIuZWdpLmV1CjAwNmJpZGVudGlmaWVyIDIvbm1kL3Vzci00yNmI4ZTZiMDlkNDdjNGFkN2E3NTU00YzgzOGE3MjgyY2NoNTNhNS9hY3QvMGJiZmY1NWU4NDRiMWJjZGEwNmFlODViM2JmYmRhNjRjaDU00YjYKMDAxNmNpZCBkYXRhLnJlYWRvbmx5CjAwNDljaWQgZGF00YS5wYXRoID00gTHpaa1pUTTROMkl4WmpjMllXVmpOMlU00WWpreU5XWmtNV00ZpT1RKbU1ETXlZMmhoWTJReAowMDJmc2lnbmF00dXJlIIQvnXp01Oey02LnaNwEkFJAyArzhHN8SlXSYFsBbSkqdqCg \
         ~/oneclient
     ls ~/oneclient/GTN\ data
    

What is the GTN Downloader?

GTN-Downloader makes it easier for users to access and organize data from the Galaxy Training Network (GTN). The GTN Downloader is a Python script that automates the download of data from GTN tutorials. It goes through the tutorials in the GTN repository, finds data-library.yaml files, and creates a structured directory based on the tutorial names and file contents.

Key Features:

  • Automated Data Download: The script finds data-library.yaml files in the GTN repository and downloads the associated data files.
  • Structured Organization: It creates directories based on the tutorial names and the information in the data-library.yaml files, so the files are organized.
  • Download Summary: It generates a download-summary.tsv file, which includes metadata about the downloaded files, a download report (error, success, already downloaded), and the overall size of the files.

Seamless Integration with Onedata

In addition to local downloads, the GTN Downloader can upload data to Onedata, a distributed data management platform. This integration ensures that the latest GTN data is always available to users.

Automated Workflow with GitHub CI/CD:

  • Automated Workflow: A GitHub Actions workflow runs once a week on weekends to download the latest data from the GTN tutorials and upload it to Onedata.
  • Environment Setup: The workflow sets up necessary environment variables and installs dependencies, including Oneclient, the Onedata POSIX client.
  • Data Upload: After downloading the data, the workflow uploads it to Onedata, making it publicly accessible.

Funding

These organisations or grants provided funding support for the development of this resource


Recent News

See all news

Credit where it's due: GTN Reviewers in the spotlight

2 December 2024   gtn infrastructure new feature automation

We would like to recognise and thank all of the reviewers who have contributed to the GTN tutorials; your efforts are greatly appreciated, and we are grateful for your contributions to the GTN community. Today, we are highlighting your efforts on every single learning material across the GTN.

Galaxy Administrator Time Burden and Technology Usage

22 July 2024   deploying maintenance survey

Have you wondered how difficult Galaxy is to run? How much time people must spend to run Galaxy? In February 2024, we collected 9 responses from the Galaxy Small Scale Admin group. The questions cover various time burdens and technological choices. The report provides answers to prospective future admins’ most common questions.