Cleaning GBIF data using OpenRefine
Author(s) | Yvan Le Bras Sophie Pamerlon Laura Russell |
OverviewQuestions:Objectives:
How can I use OpenRefine to clean data?
How do I check and clean biodiversity data using OpenRefine?
Requirements:
Use OpenRefine faceting functionalities to apply mass editing and manage duplicates
Use OpenRefine clustering and filtering functionalities to edit, transform data notably using regular expression
Use OpenRefine to apply API services on your data
Time estimation: 45 minutesSupporting Materials:Published: Jan 21, 2025Last modification: Jan 21, 2025License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITversion Revision: 1
In this tutorial we will use OpenRefine tool to clean occurrence records retrieved from GBIF.
This tutorial is based on the GBIF Laure Russell 2021 “Biodiversity data mobilization course” accessible on GBIF website on Use case 1 “Herbarium Specimens” using file from Exercise 3C, at the bottom of the page, reachable through this direct URL.
AgendaIn this tutorial, we will cover:
Retrieve data from GBIF “Biodiversity data mobilization course”
Hands-on: Data import and project creation
Create a new history for this tutorial and give it a name (example: “OpenRefine tutorial”) for you to find it again later if needed.
To create a new history simply click the new-history icon at the top of the history panel:
- Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
- Type the new name
- Click on Save
- To cancel renaming, click the galaxy-undo “Cancel” button
If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:
- Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
- Type the new name
- Press Enter
Import data file from GBIF Laure Russell 2021 “Biodiversity data mobilization course” accessible on GBIF website on Use case 1 “Herbarium Specimens” using file from Exercice 3C, at the bottom of the page
“Herbarium Specimens” data file as a csv file named “UC1-3c-open-refine.csv”
https://docs.gbif.org/course-data-mobilization/exercise-data/UC1-3c-open-refine.csv
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
- Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
- param-file “Select lines from”:
UC1-3c-open-refine.csv
- In “Check”:
- param-repeat “Insert Check”
- “Find Regex”:
;
- “Replacement”:
\t
Change the datatype from “txt” to “tabular”
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
tabular
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Check your output. All columns must be separated by a tabulation now. See example below.
Rename the datasets if needed, for example with “UC1-3c-open-refine-tabular” to keep orginal name and specify the tabular format.
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
Deploy an OpenRefine instance and push your data in
Hands-on: Deploy an OpenRefine instance and import data from your Galaxy history
Faceting
Faceting is a feature that will allow us to get a big picture overview of the data, and to filter down to just the subset of rows that we want to change or view in bulk. It facilitates the use and analysis of data and can be done with cells containing any kind of text, numbers and dates.
Hands-on: Hands-on: Faceting and mass editing
- Go to column “kingdom”, and then click on the column menu and follow the route to
Text facet
.- Click on count to sort by count, then click on name to sort alphabetically
- Fix the spelling mistakes (Plante -> Plantae). Place the cursor over the text in the window and click on
edit
, then fix the error in the text box, and to save click on apply.
- All the values will be fixed automatically.
Question
- How many Plantae occurences you now have?
- 736
Hands-on: Hands-on: Faceting and white space 1
- Go to “Country col.” column and click on column menu and perform a
Text Facet
- On a quick view, the country appears to be spelled correctly, but the facet shows three different values due to the extra spaces at the end of the text.
- Fix the error from the column menu on “Country col.” column, following the route
Edit Cells > Common transforms > Trim leading and trailing whitespace
.
- You will see a notification message “Text transform on 38 cells in column Country col.: value.trim()” and the possibility to come back on previous state clicking on the
undo
hyperlink.- Now check the facet window; only one value will remain.
- Go to column “Full name” and click on then go to
Text facet
. Then click oncount
.
- Guzmania lingulata is the first item in the list with 25 specimens, but it is also present in the 4th position with 20 specimens.
- Fix the error from the “Full name” column menu,
Edit Cells > Common transforms > Collapse consecutive whitespaces
.
Question
- How many Guzmania lingulata records only appear in the list finally?
- 45
- Go to column catalog in “Cat. Numb”, and follow the route
Facet > Customized facets > Duplicates facet
.
- The facet will show 4 duplicates
- Click on true, and you’ll see the values in the main window
- After a check with the specimens labels, fix the values clicking edit directly on the cell with the correct catalogue numbers
- UWP:122470 Vargas P
- UWP:122471 Vargas I
- UWP:157351 Betancur H
- UWP:157339 Betancur J
Filtering
- Go again to “Full name” column menu and perform a
Text facet
to visualize the values- Then go again to the column menu and click on
Text filter
- Perform the following filters and fix them as described below:
- search for “sp1” entries -> Then remove it and obtain “Cyperus”, clicking on
Edit
directly in the cell- search for “SP2” entries, check
case sensitive
-> To remove it and obtain “Cyperus”, you canEdit
directly in the cell- search for “spp”.
- Go to “Full name” column menu, then click
Edit cells > Transform
- In the text box paste the formula
value.replace(" spp.", "")
and clickok
Question
- Do you think you can use also
Edit cells > Transform
formula to apply the changes you made manually for the 2 first terms searched (“sp1” and “SP2”)?- Why you can easily edit directly these 2 first entries and not the third one?
- Yes, using respectively
value.replace(" sp1", "")
andvalue.replace(" SP2", "")
formulas- Because the 2 first searchs only concern one entry for each when the third one return several.
- Go to column “genus” and perform a
Text filter
.- Check
regular expression
andcase sensitive
, then paste the expression “^[a-z]”
- This regular expression filters the strings in which the first letter is lowercase.
Perform a correction since the genus should be capitalized.
CommentIf you want to know more about regular expressions, you can consult the Openrefine regular expressions documentation and specifically the documentation for GREL function
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression Matches abc
an occurrence of abc
within your data(abc|def)
abc
ordef
[abc]
a single character which is either a
,b
, orc
[^abc]
a character that is NOT a
,b
, norc
[a-z]
any lowercase letter [a-zA-Z]
any letter (upper or lower case) [0-9]
numbers 0-9 \d
any digit (same as [0-9]
)\D
any non-digit character \w
any alphanumeric character \W
any non-alphanumeric character \s
any whitespace \S
any non-whitespace character .
any character \.
{x,y}
between x and y repetitions ^
the beginning of the line $
the end of the line Note: you see that characters such as
*
,?
,.
,+
etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So\?
matches the question mark character exactly.Examples
Regular expression matches \d{4}
4 digits (e.g. a year) chr\d{1,2}
chr
followed by 1 or 2 digits.*abc$
anything with abc
at the end of the line^$
empty line ^>.*
Line starting with >
(e.g. Fasta header)^[^>].*
Line not starting with >
(e.g. Fasta sequence)Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups
(...)
, which we can refer to using\1
,\2
etc for the first and second captured values. If you want to refer to the whole match, use&
.
Regular expression Input Captures chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984 \1 = 24
,\2 = 1984
An expression like
s/find/replacement/g
indicates a replacement expression, this will search (s
) for any occurrence offind
, and replace it withreplacement
. It will do this globally (g
) which means it doesn’t stop after the first match.Example:
s/chr(\d{1,2})/CHR\1/g
will replacechr14
withCHR14
etc.You can also use replacement modifier such as convert to lower case
\L
or upper case\U
. Example:s/.*/\U&/g
will convert the whole text to upper case.Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the
s/../../g
structure.There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip: Cyrilex is a visual regular expression tester.
Question
- Try to find a regular expression to apply the same kind of
Text filter
operation than on the previous exercise. Which regular expression can work?
- On my side, I found this formla
value.replace(value.substring(0,1),toUppercase(value.substring(0,1)))
- Go to column “Full name” and perform a
Text filter
.- Check
regular expression
andcase sensitive
, then paste the expression^[A-Z].*\s[A-Z]
- This regular expression filters the strings that start with a capital letter followed by 0 or more characters, then a space, then a capital letter.
- Perform a correction since the second word of the name should be lowercase.
Clustering
- Go to County and perform a
Text facet
.
- Keep in mind that the correct counties are: “Flores”, “La Libertad”, “Melchor de Mencos”, “San Andres” and “San Jose”.
- On the top right of the facet window click on
Cluster
, a new window will appear.- Click on the
Cluster
button from this new window.- Now you can see information about the clusters:
- “Cluster size”: the number of different versions that the clustering algorithm believes to be the same.
- “Row count”: the number of records with any of the cluster values.
- “Values in cluster”: the actual values that the algorithm believes to be the same. There is also the number of records with each particular value, and the possibility to browse the contents of the cluster in a different tab.
- “Merge?”: check if values are to be merged into a single standard value.
- “New cell value”: the value to be applied to every record in the cluster. By default, it is the value with most records. You can also click on any value to apply that to the New cell value.
CommentIf you want to know more about clustering have a look at the manual
- Click on
Select All
and then onMerge Selected & close
- You will see a notification message “Mass edit 119 cells in column County”.
- To fix the remaining counties go again to Cluster in the facet window of Count.
- In the Cluster and edit window, go to
Keying Function
, then selectngram-fingerprint
- set “1” as the value in
n-Gram Size
.- Press the
Cluster
button, you normally see a cluster about “San Andres” of size “4”.- Click on
Select All
and then onMerge Selected & close
, you will see a notification message “Mass edit 360 cells in column County”.
- Your counties are now fixed! Congratulation!
Exporting
- On the upper right corner click on
Export
and selectGalaxy exporter
.
- A notification message as “Dataset has been exported to Galaxy, please close this tab” is displayed.
- You normally have your resulting data file exported on your Galaxy history as “openrefine-Galaxt file.tsv” dataset.
CommentYou can also download directly file on your computer choosing
Export
andCustom tabular exporter
, allowing you to select notably columns and rows you want to download.
API use
Reconciliation matches the information in one of your columns to an outside database. This is particularly helpful when it comes to name validation, as it proves the name you have exists somewhere else. This is a really useful service, but can be time consuming. In this case we will go through the process with only three records using the API from GBIF.
- Go to “Collector” column, then make a
Text facet
. Select the collector “Elsa P”.- Under “Full name”, click on column menu and then
Edit column > Add column by fetching URLs…
- call the new column “Api_name”
- Change the Throttle Delay to 250 and paste the expression
http://api.gbif.org/v1/species/match?verbose=true&name="+escape(value,'url')
- Click ok and wait, this might take some time depending on internet connection and the number of taxa.
- Go to “Api_name”, click on column menu and then
Edit column > Add column based on this column...
.
- Call the new column “higherClassification” and paste the expression:
value.parseJson().get("kingdom")+ ", "+value.parseJson().get("phylum")+ ", "+value.parseJson().get("class")+ ", "+value.parseJson().get("order")+ ", "+value.parseJson().get("family")
You will see the Kingdom, Phylum, Class, Order and family of each taxon.
- Under “higherClassification” follow the route
Edit column > Split into several columns…
, leave the initial settings.- Now you know how to obtain the taxonomic categories of a given taxon if this is available in the GBIF API. Column names can be edited in
Edit column > Rename this column
.- For the purpose of the original GBIF workshop, the columns created in this exercise (Higher taxonomy) must be deleted.
- Under All, which is the first column, go to
Edit columns > Re-order / remove columns…
.- Remove columns “Api_name”, “higherClassification 1”, “higherClassification 2”, “higherClassification 3”, “higherClassification 4” and “higherClassification 5”.
- No need to export this file as it is normally come back to previous version you already exported.
Conclusion
Here you learned how to use OpenRefine tool from Galaxy platform to clean Biodiversity data. This tutorial notably allowed you to apply some basic but powerfull functionnalities of OpenRefine to clean your data.