View markdown source on GitHub

Pangeo ecosystem 101 for everyone

Contributors

AvatarAnne Fouilloux AvatarRyan Abernathey

Questions

Objectives

last_modification Last modification: Feb 23, 2022

Pangeo in a nutshell

A Community platform for Big Data geoscience

Funders

NSF Logo EarthCube Logo NASA Logo MOORE Logo By Gordon and Betty Moore Foundation - Own work, Public Domain

Speaker Notes

Pangeo is first and foremost a community promoting open, reproducible, and scalable science. This community provides documentation, develops and maintains Open Source software, and deploys computing infrastructure to make scientific research and programming easier.

Pangeo is funded through many different projects in USA, Europe and Australia but the main funders are NSF, EarthCube, NASA and the Gordon and Betty Moore foundation.


Motivations

There are several building crises facing the geoscience community:

.left[- Big Data: datasets are growing too rapidly and legacy software tools for scientific analysis can’t handle them. This is a major obstacle to scientific progress.] .left[- Technology Gap: a growing gap between the technological sophistication of industry solutions (high) and scientific software (low).] .left[- Reproducibility: a fragmentation of software tools and environments renders most geoscience research effectively unreproducible and prone to failure.]

Speaker Notes

The Pangeo Project has been motivated by several building crises faced by the geoscience community: Big data, Technology gap and Reproducibility crisis. Indeed, datasets are are growing too rapidly and legacy software tools for scientific analysis can’t handle them. This is a major obstacle to scientific progress. Another obstacle concerns the growing gap between the technological sophistication of industry solutions (high) and scientific software (low). Finally, the fragmentation of software tools and environments renders most geoscience research effectively unreproducible and prone to failure.


Goals

Pangeo aims to address these challenges through a unified, collaborative effort.

The mission of Pangeo is to cultivate an ecosystem in which the next generation of open-source analysis tools for ocean, atmosphere and climate science can be developed, distributed, and sustained. These tools must be scalable in order to meet the current and future challenges of big data, and these solutions should leverage the existing expertise outside of the geoscience community.

Speaker Notes

Pangeo aims to address these challenges through a unified, collaborative effort.

The mission of Pangeo is to cultivate an ecosystem in which the next generation of open-source analysis tools for ocean, atmosphere and climate science can be developed, distributed, and sustained. These tools must be scalable in order to meet the current and future challenges of big data, and these solutions should leverage the existing expertise outside of the geoscience community.


The Pangeo Approach

Pangeo approach

Source: Pangeo 2.0 by Ryan Abernathey, December 22, 2020.

Speaker Notes

To address the issues mentioned earlier, an agile development methodology needs to be adopted. Challenges related to scientific use cases as well as Open-Source software libraries, HPC and cloud infrastructure need for answering these scientific questions cannot be tackled separately. For example, scientific users are most often identifying bugs/bottlenecks and when “immediately” reported, developers can define priorities for future releases and provide relevant inputs to HPC and cloud infrastructure providers on the current trend in the community.


The Pangeo Software ecosystem

Pangeo approach

Source: Pangeo Tutorial - Ocean Sciences 2020 by Ryan Abernathey, February 17, 2020.

Speaker Notes The Pangeo software ecosystem involves open source tools such as X-array, iris, dask, jupyter, and many other packages. There is no single software package called Pangeo; rather, the Pangeo project serves as a coordination point between scientists, software, and computing infrastructure. On this figure, the python packages are “layered” based on their dependencies. At the “bottom”, we have the Python programming language itself; on the second layer, we can find NumPy or Jupyter Notebooks that are very common Python packages and that you may know already. X-array makes an intensive use of Numpy for its underlying data structures. Iris has what we call a “high-level” user interface with many functions for analysing and visualising Earth Science data.


Pangeo Galaxy Tools

A growing number of tools available to non Python programmers

.left[Xarray Galaxy tools:

Our objective is to bridge the gap between disciplines and add tools on demand to help cross-disciplinary research

Speaker Notes

If you are not familiar with Python or any other programming languages, you may wonder whether there are “Pangeo tools” with a graphical user interface. We have indeed wrapped a number of tools using Python packages from the Pangeo software stack. For instance, Galaxy tools using X-array are available in Galaxy. Even if you are a Python programmer but for instance not very familiar with Earth Science data, learning X-array concepts from Galaxy Tools can be useful. The main objective is to bridge the gap between disciplines and add tools on demand to help cross-disciplinary research.


How to cite and support Pangeo

Speaker Notes Like any Open Source community, it is important to acknowledge the community effort to provide open and easy to use Python packages and the associated Galaxy Tools. We encourage you to cite Abernathey et al. 2017, Abernathey et al. 2021, Gentemann et al. 2021 and Sambasivan et al. 2021 in your pangeo-dev-process when using Xarray Galaxy Tools. As mentioned earlier, user feedback is important for improving open source tools and packages. Report issues, wishes and thank the Pangeo Community whenever you use tools from the Pangeo software stack!


Learn more

Speaker Notes

If you want to learn more about Pangeo, visit the Pangeo website https://pangeo.io, or github reposity https://github.com/pangeo-data. Get help on discourse at https://discourse.pangeo.io/ and follow Pangeo on Twitter @pangeo_data.


Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.

References

  1. Abernathey, R., kevin paul, joe hamman, matthew rocklin, chiara lepore et al., 2017 Pangeo NSF Earthcube Proposal. 10.6084/m9.figshare.5361094.v1 https://figshare.com/articles/Pangeo_NSF_Earthcube_Proposal/5361094
  2. Abernathey, R. P., T. Augspurger, A. Banihirwe, C. C. Blackmon-Luca, T. J. Crone et al., 2021 Cloud-Native Repositories for Big Scientific Data. Computing in Science & Engineering 23: 26–35. 10.1109/mcse.2021.3059437
  3. Gentemann, C. L., C. Holdgraf, R. Abernathey, D. Crichton, J. Colliander et al., 2021 Science Storms the Cloud. 10.1002/essoar.10506344.2
  4. Sambasivan, N., S. Kapania, H. Highfill, D. Akrong, P. Paritosh et al., 2021 “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI, in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, ACM. 10.1145/3411764.3445518