Python - Globbing

Overview

Questions:
  • How can I collect a list of files.

Objectives:
  • Use glob to collect a list of files

  • Learn about the potential pitfalls of glob

Requirements:
Time estimation: 15 minutes
Level: Intermediate Intermediate
Supporting Materials:
Last modification: Apr 25, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Best viewed in a Jupyter Notebook

This tutorial is best viewed in a Jupyter notebook! You can load this notebook one of the following ways

Launching the notebook in Jupyter in Galaxy

  1. Instructions to Launch JupyterLab
  2. Open a Terminal in JupyterLab with File -> New -> Terminal
  3. Run wget https://training.galaxyproject.org/archive/2022-06-01/topics/data-science/tutorials/python-glob/data-science-python-glob.ipynb
  4. Select the notebook that appears in the list of files on the left.

Downloading the notebook

  1. Right click one of these links: Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)
  2. Save Link As..

Globbing is the term used in computer science when we have a bunch of files and we want to list all of them matching some pattern.

Agenda

In this tutorial, we will cover:

  1. Setup
  2. Finding Files
  3. Finding files in directories
  4. Exercise
  5. Pitfalls

Setup

We’ll start by creating some files for use in the rest of this tutorial

import os
import subprocess

dirs = ['a', 'a/b', 'c', 'c/e', 'd', '.']
files = ['a.txt', 'a.csv', 'b.csv', 'b.txt', 'e.glm']

for d in dirs:
    # Create some directories
    os.makedirs(d, exist_ok=True)
    # Create some files
    for f in files:
        subprocess.check_output(['touch', os.path.join(d, f)])

Now we should have a pretty full folder!

Finding Files

We can use the glob module to find files:

import glob
print(glob.glob('*.csv'))
print(glob.glob('*.txt'))

Here we use an asterisk (*) as a wildcard, it matches any bit of text (but not into folders!) to all matching files. Here we list all matching csv or txt files. This is great to find files matching a pattern.

We can also use asterisks anywhere in the glob, it doesn’t just have to be the filename portion:

print(glob.glob('a*'))

Here we even see a third entry: the directory.

Finding files in directories

Until now we’ve found only files in a single top level directory, but what if we wanted to find files in subdirectories?

Only need a single directory? Just include that!

print(glob.glob('a/*.csv'))

But if you need more levels, or want to look in all folders, then you need the double wildcard! With two asterisks ** we can search recursively through directories for files:

print(glob.glob('**/a.csv'))

Exercise

question Question: Where in the world is the CSV?

  1. How would you find all .csv files?
  2. How would you find all .txt files?
  3. How would you find all files starting with the letter ‘e’?

solution Solution

  1. glob.glob('**/*.csv')
  2. glob.glob('**/*.txt')
  3. glob.glob('**/e*')
# Try things out here!

Pitfalls

Some analyses (especially simultaions) can be dependent on data input order or data sorting. This was recently seen in Neupane et al. 2019 where the data files used were sorted one way on Windows, and another on Linux, resulting in different results for the same code and the same datasets! Yikes!

If you know your analyses are dependent on file ordering, then you can use sorted() to make sure the data is provided in a uniform way every time.

print(sorted(glob.glob('**/a.csv')))

If you’re not sure if your results will be dependent, you can try sorting anyway. Or better yet, randomising the list of inputs to make sure your code behaves properly in any scenario.

Key points

  • If your data is ordering dependent, sort your globs!

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Foundations of Data Science topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. Neupane, J. B., R. P. Neupane, Y. Luo, W. Y. Yoshida, R. Sun et al., 2019 Characterization of Leptazolines A–D, Polar Oxazolines from the Cyanobacterium Leptolyngbya sp., Reveals a Glitch with the “Willoughby–Hoye” Scripts for Calculating NMR Chemical Shifts. Organic Letters 21: 8449–8453. 10.1021/acs.orglett.9b03216

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Helena Rasche, Donny Vrins, Bazante Sanders, 2022 Python - Globbing (Galaxy Training Materials). https://training.galaxyproject.org/archive/2022-06-01/topics/data-science/tutorials/python-glob/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{data-science-python-glob,
author = "Helena Rasche and Donny Vrins and Bazante Sanders",
title = "Python - Globbing (Galaxy Training Materials)",
year = "2022",
month = "04",
day = "25"
url = "\url{https://training.galaxyproject.org/archive/2022-06-01/topics/data-science/tutorials/python-glob/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                

Congratulations on successfully completing this tutorial!