View markdown source on GitHub

The Pangeo ecosystem

Contributors

AvatarBasile Goussard AvatarRyan Abernathey AvatarAnne Fouilloux

Questions

Objectives

last_modification Last modification: Mar 1, 2022

About this presentation

.left[ This presentation is a summary of: ]

Speaker Notes


Pangeo in a nutshell

A Community platform for Big Data geoscience

Funders

NSF Logo EarthCube Logo NASA Logo MOORE Logo By Gordon and Betty Moore Foundation - Own work, Public Domain

Speaker Notes


Motivations

.left[ There are several building crises facing the geoscience community: ]

.left[- Big Data: datasets are growing too rapidly and legacy software tools for scientific analysis can’t handle them. This is a major obstacle to scientific progress.] .left[- Technology Gap: a growing gap between the technological sophistication of industry solutions (high) and scientific software (low).] .left[- Reproducibility: a fragmentation of software tools and environments renders most geoscience research effectively unreproducible and prone to failure.]

Speaker Notes


Goals

Pangeo aims to address these challenges through a unified, collaborative effort.

The mission of Pangeo is to cultivate an ecosystem in which the next generation of open-source analysis tools for ocean, atmosphere and climate science can be developed, distributed, and sustained. These tools must be scalable in order to meet the current and future challenges of big data, and these solutions should leverage the existing expertise outside of the geoscience community.

Speaker Notes


The Pangeo Software Ecosystem

Pangeo approach

Source: Pangeo Tutorial - Ocean Sciences 2020 by Ryan Abernathey, February 17, 2020.

Speaker Notes


Xarray

Xarray is an open source project and Python package that makes working with labeled multi-dimensional arrays simple, efficient, and fun!

Xarray logo

Speaker Notes


What is Xarray?

.left[Xarray expands on NumPy arrays and pandas. Xarray has two core data structures:]

.left[- DataArray is our implementation of a labeled, N-dimensional array. It is a generalization of a pandas.Series.] .left[- Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.]

.left[Source: Xarray documentation]

Speaker Notes


Example

Xarray concept

Xarray dataset

Speaker Notes


iris

A powerful, format-agnostic, community-driven Python package for analysing and visualising Earth science data.

.pull-left[

.pull-right[

.image-40[ IRIS logo ]

]

.left[Source: Scitools Iris documentation]

Speaker Notes


Dask

.pull-left[

Enabling performance at scale for the tools you love

Dask accelerates the existing Python ecosystem (Numpy, Pandas, Scikit-learn) ]

.pull-right[

.image-40[ DASK logo ]

]

.left[Source: Dask documentation]

Speaker Notes


How does Dask accelerate Numpy?

.image-40[ Dask and Numpy ]

.pull-left[

import numpy as np

x = np.ones((1000, 1000))
x + x.T - x.mean(axis=0)

]

.pull-right[

import dask.array as da

x = da.ones((1000, 1000))
x + x.T - x.mean(axis=0)

]

Speaker Notes


How does Dask accelerate Pandas?

.image-25[ Dask and Pandas ]

.pull-left[

import pandas as pd

df = pd.read_csv("file.csv")
df.groupby("x").y.mean()

]

.pull-right[

import dask.dataframe as dd

df = dd.read_csv("s3://*.csv")
df.groupby("x").y.mean()

]

Speaker Notes


How does Dask accelerate Scikit-Learn?

.image-40[ Dask and Scikit-Learn ]

.pull-left[

from scikit_learn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(data, labels)

]

.pull-right[

from dask_ml.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(data, labels)

]

Speaker Notes


jupyter

.pull-left[

Free software, open standards, and web services for interactive computing across all programming languages

.pull-right[

.image-40[ Jupyter logo ] ]

.left[Source: Jupyter documentation]

Speaker Notes


Jupyter and Galaxy

Speaker Notes


Analysis Ready, Cloud Optimized Data (ARCO)

Speaker Notes


Example of ARCO Data

Arco data

Speaker Notes


Pangeo Forge

Pangeo Forge Logo

https://pangeo-forge.org

Pangeo Forge is an open source platform for data Extraction, Transformation, and Loading (ETL). The goal of Pangeo Forge is to make it easy to extract data from traditional repositories and deposit this data in cloud object storage in an analysis-ready, cloud optimized (ARCO) format.

Pangeo Forge is inspired directly by Conda Forge, a community-led collection of recipes for building conda packages.

Speaker Notes


How does Pangeo Forge work?

.image-40[ pangeo forge explained ]

.image-40[ pangeo forge recipe ]

Speaker Notes


STAC

.center[STAC stands for SpatioTemporal Asset Catalog.]

Speaker Notes


Why STAC?

Each provider has its catalog and interface.

.pull-left[

Just searching the relevant data for your project could be a tough work…

]

.pull-right[ Why STAC ]

Speaker Notes


Why STAC?

Each provider has its own Application Programming Interface (API).

.pull-left[

If you are a programmer that’s exactly the same…

You should design a new data connector each time…

]

.pull-right[ Why STAC ]

Speaker Notes


Why STAC?

Let’s work together.

.pull-left[

The main purpose of STAC is:

]

.pull-right[ STAC ]

Speaker Notes

Why STAC?

Let’s work together.

.pull-left[

It’s extremely simple, STAC catalogs are composed of three layers :

]

.pull-right[

It’s already used for Sentinel 2 in AWS

.image-90[ Sentinel 2 ]

It’s already used for Landsat 8 in MICROSOFT

.image-90[ Landsat 8 ]

]

Speaker Notes


How to use STAC

Depending on your needs.

.pull-left[ Storing your data

.image-40[ Storing data ] ]

.pull-right[

Searching data

.image-40[ Searching data ]

]

Speaker Notes


Searching data

Let’s search data over the main region (France) between the 1st January 2019 and the 4th June 2019.

.image-100[ Search data over main and specific dates ]

Speaker Notes


Searching and processing

.image-100[ Search and process ]

Speaker Notes


STAC ecosystem

A lot of project are now build around STAC.

Speaker Notes


A lot of contributors!

Join and contribute to STAC: https://github.com/radiantearth/stac-spec

.image-100[ STAC contributors ]

Speaker Notes


STAC and Pangeo Forge

Speaker Notes


Using and/or contributing to Pangeo

.left[The Pangeo project is completely open to involvement from anyone with interest.

There are many ways to get involved:]

For more information, consult the Frequently Asked Questions.

Everyone is welcome to the Pangeo Weekly Community Meeting.

Speaker Notes


Learn more

Speaker Notes


Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.