Cleaning GBIF data for the use in Ecology
Author(s) | Yvan Le Bras Simon Benateau |
Reviewers |
OverviewQuestions:Objectives:
How can I get ecological data from GBIF?
How do I check and clean the data from GBIF?
Which ecoinformatics techniques are important to know for this type of data?
Requirements:
Get occurrence data on a species
Visualize the data to understand them
Clean GBIF dataset for further analyses
Time estimation: 0 hours 30 minutesSupporting Materials:Published: Oct 28, 2022Last modification: Jun 27, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00129version Revision: 5
GBIF (Global Biodiversity Information Facility, www.gbif.org) is for sure THE most remarkable biodiversity data aggregator worldwide giving access to more than 1 billion records across all taxonomic groups. The data provided via these sources are highly valuable for research. However, some issues exist concerning data heterogeneity, as they are obtained from various collection methods and sources.
In this tutorial we will propose a way to clean occurrence records retrieved from GBIF.
This tutorial is based on the Ropensci Zizka tutorial.
AgendaIn this tutorial, we will cover:
- Retrive data from GBIF
- Get data
- Where do the records come from?
- Filtering data based on the data origin
- Have a look at the number of counts per record
- Filtering data on individual counts
- Have a look at the age of records
- Filtering data based on the age of records
- Taxonomic investigation
- Filtering
- Sub-step with OGR2ogr
- Visualize your data on a GIS oriented visualization
- Conclusion
Retrive data from GBIF
Get data
Hands-on: Data upload
Create a new history for this tutorial
To create a new history simply click the new-history icon at the top of the history panel:
- Import the files from GBIF: Get species occurrences data tool with the following parameters:
- param-file “Scientific name of the species”: write the scientific name of something you are interested on, for example
Loligo vulgaris
- “Data source to get data from”:
Global Biodiversity Information Facility : GBIF
- “Number of records to return”:
999999
is a minimum valueCommentThe spocc Galaxy tool allows you to search species occurrences across a single or many data sources (GBIF, eBird, iNaturalist, EcoEngine, VertNet, BISON). Changing the number of records to return allows you to have all or limited numbers of occurrences. Specifying more than one data source will change the manner the output dataset is formatted.
Check the datatype galaxy-pencil, it should be
tabular
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
tabular
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
- Add tags galaxy-tags to the dataset
- make them propagating tags (tags starting with
#
)- make a tag corresponding to the species (
#LoligoVulgaris
for example here)- and another tag mentioning the data source (#GBIF for example here).
Tagging dataset like this is good practice in Galaxy, and will help you 1/ finding content of particular interest (using the filtering option on the history search form for example) and 2/ visualizing rapidly (notably thanks to the propagated tags) which dataset is associated to which content.
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
Where do the records come from?
Here we propose to investigate the content of the dataset looking notably at the “basisOfRecord” attribute to know more about heterogeneity related to the data collection origin.
Hands-on: "basisOfRecord" filtering
- Count tool with the following parameters:
- param-file “from dataset”:
output
(output of Get species occurrences data tool)- “Count occurrences of values in column(s)”:
c[17]
CommentThis tool is one of the important “classical” Galaxy tool who allows you to better synthesize information content of your data. Here we apply this tool to the 17th column (corresponding to the basisOfRecord attribute) but don’t hesitate to investigate others attributes!
Question
- How many different types of data collection origin are there?
- What is your assumption regarding this heterogeneity?
- 5
- each basisOfRecord type is related to different collection method so different data quality
Filtering data based on the data origin
Hands-on: Filter data on basisOfRecord GBIF attribute
- Filter tool with the following parameters:
- param-file “Filter”:
output
(output of Get species occurrences data tool)- “With following condition”:
c17=='HUMAN_OBSERVATION' or c17=='OBSERVATION' or c17=='PRESERVED_SPECIMEN'
- “Number of header lines to skip”:
1
CommentA comment about the tool or something else. This box can also be in the main text
Question
- How many records are kept and what is the percentage of filtered data?
- Why are we keeping only these 3 types of data collection origin?
- 470 and 8.79% of records were drop out
- These data collection methods are the most relevant
Add to the output dataset a propagating tag corresponding to the filtering criteria adding
#basisOfRecord
string for exampleDatasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
Have a look at the number of counts per record
Here we propose to have a look at the number of counts by record to know if there is some possible records with errors.
Hands-on: Summary statistics of count
- Summary Statistics tool with the following parameters:
- param-file “Summary statistics on”:
out_file1
(output of Filter tool)- “Column or expression”:
c72
- Add to the output dataset a propagating tag corresponding to the filtering criteria adding
#individualCount
string for example
Question
- What is the min and max of individual counts?
- From 1 to 100
Filtering data on individual counts
Hands-on: Filter data on individualCount GBIF attribute
- Filter tool with the following parameters:
- param-file “Filter”:
out_file1
(output of Filter tool)- “With following condition”:
c72>0 and c72<99
- “Number of header lines to skip”:
1
Question
- How many records are kept and what is the percentage of filtered data?
- How can you explain this result?
- Which propagated tag you can propose to add here?
- 50 and 89.29% o records were drop out
- An important percentage of data were drop out because of many records whithout any value for this individual count field
- As for the previous “count” step you are dealing with the
individualCount
column, you can add a to the output dataset a#individualCount
tag for example
Have a look at the age of records
Hands-on: Here we propose to have a look at the age of records, through the `year` GBIF attribute to know if there is some ancient data to maybe not consider.
- Summary Statistics tool with the following parameters:
- param-file “Summary statistics on”:
out_file1
(output of Filter tool)- “Column or expression”:
c41
- Add to the output dataset a propagating tag corresponding to the filtering criteria adding
#ageOfRecord
string for example
Question
- What is the year of the older and younger records?
- Why do you think of interest to treat differently ancient and recent records?
- From 1903 to 2018
- We can assume ancient records are not made in the same way than recent one so keeping ancient records can enhance heterogeneity of our dataset.
Filtering data based on the age of records
Hands-on: Filter data on ageOfRecord GBIF attribute
- Filter tool with the following parameters:
- param-file “Filter”:
out_file1
(output of Get species occurrences data tool)- “With following condition”:
c41>1945
- “Number of header lines to skip”:
1
CommentA comment about the tool or something else. This box can also be in the main text
Question
- How many records are kept and what is the percentage of filtered data?
- Why are we keeping only data from 1945?
- 44 and 11.76% of records were drop out
- This arbitrary date allow to have only quite recent records, but you can specify another year.
Add to the output dataset a propagating tag corresponding to the filtering criteria adding
#ageOfRecord
string for exampleDatasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
Taxonomic investigation
Hands-on: Investigate the taxonomic coverage, at the family level
- Count tool with the following parameters:
- param-file “from dataset”:
out_file1
(output of Filter tool)- “Count occurrences of values in column(s)”:
c[31]
CommentThis column allows us to look at the different families associated to records. Normally, looking at a unique species, we will obtain only one family
Filtering
Hands-on: Filter data on family attribute
- Filter tool with the following parameters:
- param-file “Filter”:
out_file1
(output of Filter tool)- “With following condition”:
c31=='Loliginidae'
- “Number of header lines to skip”:
1
CommentWe here select only records with the family of interest, Loliginidae
Question
- Is the filtering here of interest ?
- Why keeping this step can be of interest?
- No, because 100% of records are kept
- Because this is an important step we have to take into account in such a GBIF data treatment, and if your goal is to create your own workflow you plan to use on others species, this can be of interest to keep this step
Sub-step with OGR2ogr
Hands-on: Convert occurrence dataset to GIS one for visualization
- OGR2ogr tool with the following parameters:
- param-file “Gdal supported input file”:
out_file1
(output of Filter tool)- “Conversion format”:
GEOJSON
- “Specify advanced parameters”:
Yes, see full parameter list.
- In “Add an input dataset open option”:
- param-repeat “Insert Add an input dataset open option”
- “Input dataset open option”:
X_POSSIBLE_NAMES=longitude
- param-repeat “Insert Add an input dataset open option”
- “Input dataset open option”:
Y_POSSIBLE_NAMES=latitude
Question
- Did you have access to standard output and error of the original R script?
- What kind of information you can retrieve here in the standard output and/or error?
- Yes, of course ;) A previsualization of stdout is visible when clicking on the history output dataset and full report accessible through the information button, then stdout or stderr (here you can see warnings on the stderr)
- The stderr is showing several warning related to automatic variable name mapping from GBIF to OGR plus information about application of a truncate process on a particularly long GeoJSON value
Visualize your data on a GIS oriented visualization
From your GeoJSON Galaxy history dataset, you can launch GIS visualization.
Hands-on: Launch OpenLayers to visualize a map with your filtered records
- Click on the Visualize tab on the upper menu and select
Create Visualization
- Click on the OpenLayers icon
- Select the GeoJSON file from your history
- Click on
Create Visualization
- Select Openlayers
Question
- You don’t see Opebnlayers? Did you know why?
1.If you don’t see Openlayers but others visualization types like Cytoscape, this means your datatype is JSON, not geojson. You have to change the datafile manually before visualizing it
Conclusion
In this tutorial we learned how to get occurrence records from GBIF and several steps to filter these data to be ready to analyze it! So now, let’s go for the show!