Biodiversity data exploration

Overview

Questions:
  • How to explore biodiversity data?

  • How to look at Homoscedasticity, normality or collinearity of presences-absence or abundance data?

  • How to compare beta diversity taking into account space, time and species components?

Objectives:
  • Explore Biodiversity data with taxonomic, temporal and geographical informations

  • Have an idea about quality content of the data regarding statistical tests like normality or homoscedasticity and coverage like temporal or geographical coverage

Requirements:
Time estimation: 1 hour
Supporting Materials:
Last modification: Mar 3, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

This tutorial will guide you on the exploration of biodiversity data having taxonomic, spatial and temporal informations.

We’ll be using Reef Life Survey (RLS) data extracted from the Australian Ocean Data Network (AODN) portal. We’ll use a subset done directly on the AODN data portal (https://portal.aodn.org.au/) on this dataset “IMOS - National Reef Monitoring Network Sub-Facility - Global reef fish abundance and biomass”. We decided to use data only on the Mollusca phylum from the east coast of Australia between 2008 and 2021. We’ll explore this dataset in the view of making statistical analyses so we will check the homoscedasticity and normality of the variables, see if some variables are correlated, how the data is distributed through space and time, etc … And finally, we’ll explore Beta diversity through the computation of the SCBD and LCBD (Species and Local Contribution to Beta Diversity).

details Definitions of SCBD and LCBD

Species Contribution to Beta Diversity: degree of variation for individual species across the study area.

Local Contribution to Beta Diversity: comparative indicators of the ecological uniqueness of the sites.

Agenda

In this tutorial, we will cover:

  1. Data preparation
    1. Get data
    2. Customize your dataset
  2. Data checking
    1. Homoscedasticity and normality analysis
    2. Autocorrelation in your data
    3. Check collinearity in your data
  3. Data exploration
    1. Visualize abundance repartition through space
    2. Visualize the number of locations where each taxons are present
    3. Visualize the rarefaction curves of your species
    4. Visualize the dispersion of a numeric variable and the correlation between species absence
  4. Beta diversity
    1. Local and Species Contribution to Beta Diversity (SCBD and LCBD)
  5. Bonus: Want to spatially anoymize your data?
    1. Bonus! Spatial coordinates anonymization

Data preparation

First step is to download biodiversity data on your Galaxy history. Here we will use a “classical” (containing taxonomic, spatial and temporal informations) biodiversity dataset from the well known “Reef life survey” initiative.

Get data

hands_on Hands-on: Data upload

  1. Create a new history for this tutorial and give it a name (example: “RLS for biodiversity data exploration tutorial”) for you to find it again later if needed.
  2. Import the files from Zenodo

    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    Tip: Importing data from a data library

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    • Go into Shared data (top panel) then Data libraries
    • Navigate to the correct folder as indicated by your instructor
    • Select the desired files
    • Click on the To History button near the top and select as Datasets from the dropdown menu
    • In the pop-up window, select the history you want to import the files to (or create a new one)
    • Click on Import
  3. Rename the datasets “reef_life_molluscs” for example and preview your dataset

    You can see that the dataset hasn’t been detected to be a CSV dataframe, it is because RLS data directly puts the metadata of the dataframe in the first lines before the dataframe so you’ll have to remove these lines using the Remove beginning Tool: Remove beginning1 with the following parameters: - param-text “Remove first”: 72 - param-file “from”: reef_life_molluscs data file

    Then, verify if your new file hasn’t got hashtags in the first lines and then ask Galaxy to autodetect datatype (click on the pencil, then “Datatypes” then click on “Auto-detect” button). Galaxy will normally detect it as csv.

  4. Convert datatype CSV to tabular

    Tip: Converting the file format

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-gear Convert tab on the top
    • Select Convert CSV to tabular
    • Click the Convert datatype button

Customize your dataset

In order to clean unnecessary informations from the table we will now cut a few columns and change the format of time information.

hands_on Hands-on: Clean your data

  1. Use Advanced cut columns from a table (cut) Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0 with following parameters :
    • param-files “File to cut”: Convert CSV to tabular data file
    • param-select “Operation”: Keep
    • param-select “Delimited by”: Tab
    • param-select “Cut by”: fields
      • param-select “List of Fields”: Column: 8 Column: 10 Column: 11 Column: 12 Column: 25 Column: 28
  2. Use Column Regex Find And Replace Tool: toolshed.g2.bx.psu.edu/repos/galaxyp/regex_find_replace/regexColumn1/1.0.0 with following parameters:
    • param-files “Select cells from”: Advanced Cut data file
    • param-select “using column”: Column: 4
    • param-repeat Click ”+ Insert Check”:
      • param-text “Find Regex”: ([0-9]{4})-[0-9]{2}-[0-9]{2}
      • param-text “Replacement”: \1

Data checking

Homoscedasticity and normality analysis

hands_on Hands-on: Here we will check homogeneity of variances (Levene test) for every species and represent it through multiple boxplots and the normal distribution (Kolmogorov-Smirnov test) represented by a distribution histogram and a Q-Q plot.

  1. Homoscedasticity and normality Tool: toolshed.g2.bx.psu.edu/repos/ecology/ecology_homogeneity_normality/ecology_homogeneity_normality/0.0.0 with the following parameters:
    • param-file “Input table”: Column Regex Find and Replace data file
    • param-select “First line is a header line”: Yes
    • param-select “Select column containing temporal data (year, date, …)”: c4
    • param-select “Select column containing species”: c5
    • param-select “Select column containing numerical values (like abundances)”: c6

    You have to get three outputs: the Levene Test for homoscedasticity dataset, the Kolmogrov-Smirnov test for normality and 9 PNG files in a data collection representing the homogeneity of variances for each species at each time point of the study. If the levene test is significant (P-value in column Pr < 0.5 and at least one * at the end of the 4th line), variances aren’t homogeneous, the hypothesis of homoscedasticity is rejected. If the K-S test is significant (p-value < 0.5), your numerical variable isn’t normally distributed, the hypothesis of normality is rejected. The two tests have to be significant so variances aren’t homogenous and data isn’t normally distributed.

Homoscedasticity and normality_example on Sepioteuthis australis.

Autocorrelation in your data

hands_on Hands-on: Autocorrelation

  1. Variables exploration Tool: toolshed.g2.bx.psu.edu/repos/ecology/ecology_link_between_var/ecology_link_between_var/0.0.0 with the following parameters:
    • param-file “Input table”: Column Regex Find and Replace data file
    • param-select “First line is a header line”: Yes
    • param-select “Variables links exploration”: Autocorrelation of one selected numerical variable
      • param-select “Select columns containing numerical values”: c6

    You have to get two outputs, one text file containing the Autocorrelation function values and one PNG file in the data collection showing the autocorrelation for a variable. If the bars of the histogram are strictly confined between the dashed lines (representing 95% confidence interval without white noise), there is auto-correlation.

    Here, we don’t see there is autocorrelation.

Variable_exploration_autocorrelation example.

Check collinearity in your data

hands_on Hands-on: Collinearity between numerical variables

  1. Variables exploration Tool: toolshed.g2.bx.psu.edu/repos/ecology/ecology_link_between_var/ecology_link_between_var/0.0.0 with the following parameters:
    • param-file “Input table”: formatted biodiversity data file
    • param-select “First line is a header line”: Yes
    • param-select “Variables links exploration”: Collinearity between selected numerical variables for each species
      • param-select “Select column containing species”: c5
      • param-select “Select columns containing numerical values”: c['4', '6']

    You have to get two outputs, one describing species we couldn’t evaluate and one PNG file with one plot containing multiple correlation plots and the correlation values between each variables.

Variable_exploration_collinarity example.

Data exploration

Visualize abundance repartition through space

hands_on Hands-on: Abundance map in the environment

  1. Presence-absence and abundance Tool: toolshed.g2.bx.psu.edu/repos/ecology/ecology_presence_abs_abund/ecology_presence_abs_abund/0.0.0 with the following parameters:
    • param-file “Input table”: formatted biodiversity data file
    • param-select “First line is a header line”: Yes
    • param-select “Variables presence, absence and abundance”: Abundance map in the environment
      • param-select “Select column containing latitudes “: c2
      • param-select “Select column containing longitudes”: c3
      • param-text “What do you study in this analysis ?”: Molluscs of Australian east coast
      • param-select “Select column containing taxon “: c5
    • param-select “Select column containing abundances “: c6

    You have to get two outputs, one with the map of the abundance through space with the coordinates and one text file to inform you about the geographical extent of your map.

Presence-absence-example.

Visualize the number of locations where each taxons are present

hands_on Hands-on: Presence count of taxons (barplot)

  1. Presence-absence and abundance Tool: toolshed.g2.bx.psu.edu/repos/ecology/ecology_presence_abs_abund/ecology_presence_abs_abund/0.0.0 with the following parameters:
    • param-file “Input table”: formatted biodiversity data file
    • param-select “First line is a header line”: Yes
    • param-select “Variables presence, absence and abundance”: Presence count of taxons (barplot)
      • param-select “Select column containing your separation variable”: c1
      • param-select “Select column containing taxon”: c5
    • param-select “Select column containing abundances “: c6

    You have to get two outputs, one with 120 PNG files (one for each site) representing the number of locations where each taxons are present and one text file to inform you about the used locations.

Visualize the rarefaction curves of your species

hands_on Hands-on: Rarefaction curve of species

  1. Presence-absence and abundance Tool: toolshed.g2.bx.psu.edu/repos/ecology/ecology_presence_abs_abund/ecology_presence_abs_abund/0.0.0 with the following parameters:
    • param-file “Input table”: formatted biodiversity data file
    • param-select “First line is a header line”: Yes
    • param-select “Variables presence, absence and abundance”: Rarefaction curve of species
      • param-text “Size of subsamples”: 200
      • param-select “Select column containing species”: c5
    • param-select “Select column containing abundances “: c6

    You have to get two outputs, one data collection with one PNG files representing the rarefaction curves of each species in one graph and one tabular file with log informations.

Variable_exploration_rarefaction curves.

Visualize the dispersion of a numeric variable and the correlation between species absence

hands_on Hands-on: Boxplot of dispersion and correlation of absence

  1. Statistics on presence-absence Tool: toolshed.g2.bx.psu.edu/repos/ecology/ecology_stat_presence_abs/ecology_stat_presence_abs/0.0.0 with the following parameters:
    • param-file “Input table”: formatted biodiversity data file
    • param-select “First line is a header line”: Yes
    • param-select “Select a column containing numerical values (such as the abundance) “: c6
    • param-select “Select the column of the x-axis : most commonly species”: c5
    • param-select “Select column containing locations “: c1
    • param-select “Select column containing temporal data (year, date, …) “: c4

    You have to get two outputs, one PNG file with the boxplot and dispersion plot of the abundance and one plot representing wether the absence of several species is correlated. In the second plot if you see there is a cross on the round representation the two related species haven’t got their absences correlated, the other are correlated and seem to be co-absent.

Boxplot dispersion_example. Absence correlation_example.

Beta diversity

Local and Species Contribution to Beta Diversity (SCBD and LCBD)

hands_on Hands-on: Task description

  1. Local Contributions to Beta Diversity (LCBD) Tool: toolshed.g2.bx.psu.edu/repos/ecology/ecology_beta_diversity/ecology_beta_diversity/0.0.0 with the following parameters:
    • param-file “Input table”: formatted biodiversity data file
    • param-select “First line is a header line”: Yes
    • param-select “Select column with abundances”: c6
    • param-select “Select column with locations”: c1
    • param-select “Select column containing taxon”: c5
    • param-select “Select column containing dates”: c4
    • param-select “Other LCBD : spatialized representation or xy plot.”: Spatialized representation
      • param-select “Select column containing latitudes in decimal degrees”: c2
      • param-select “Select column containing longitudes in decimal degrees”: c3

    You have to get three outputs. Two text file containing a table with information on the beta diversity and one text file with the list of species that has a SCBD larger than the mean SCBD. One data collection with PNG files showing multiple plots according to one type of variable in order to vizualize the beta diversity.

Variable_exploration_example. Variable_exploration_example. Variable_exploration_example. Variable_exploration_example.

Conclusion

Here, you just explored your biodiversity dataframe properly and you know a lot more about your data. You can now peacefully make your statiscal analyses as most of the red flags you can get have been revealed by this toolsuite ! Enjoy !

Bonus: Want to spatially anoymize your data?

A step of this tutorial could be to show you how you can simply apply spatial coordinates anonymization if you want to share data without spatial context, particularly of interest if you want to share endangered species oriented data.

Bonus! Spatial coordinates anonymization

hands_on Hands-on: Task description

  1. Spatial coordinates anonymization Tool: toolshed.g2.bx.psu.edu/repos/ecology/tool_anonymization/tool_anonymization/0.0.0 with the following parameters:
    • param-file “Input table”: Column Regex Find and Replace data file
    • param-select “First line is a header line”: Yes
    • param-select “Select column containing latitudes in decimal degrees”: c2
    • param-select “Select column containing longitudes in decimal degrees”: c3

Key points

  • Explore your data before diving into deep analysis

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Ecology topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Olivier Norvez, Marie, Coline Royaux, Yvan Le Bras, 2022 Biodiversity data exploration (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/ecology/tutorials/biodiversity-data-exploration/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{ecology-biodiversity-data-exploration,
author = "Olivier Norvez and Marie and Coline Royaux and Yvan Le Bras",
title = "Biodiversity data exploration (Galaxy Training Materials)",
year = "2022",
month = "03",
day = "03"
url = "\url{https://training.galaxyproject.org/training-material/topics/ecology/tutorials/biodiversity-data-exploration/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                

Congratulations on successfully completing this tutorial!