Advanced SQL

Overview
Creative Commons License: CC-BY Questions:
  • How can I calculate sums, averages, and other summary values?

  • How can I combine data from multiple tables?

  • How should I format data in a database, and why?

  • How can I create, modify, and delete tables and data?

  • How can I access databases from programs written in Python?

Objectives:
  • Define aggregation and give examples of its use.

  • Write queries that compute aggregated values.

  • Trace the execution of a query that performs aggregation.

  • Explain how missing data is handled during aggregation.

  • Explain the operation of a query that joins two tables.

  • Explain how to restrict the output of a query containing a join to only include meaningful combinations of values.

  • Write queries that join tables on equal keys.

  • Explain what primary and foreign keys are, and why they are useful.

  • Explain what an atomic value is.

  • Distinguish between atomic and non-atomic values.

  • Explain why every value in a database should be atomic.

  • Explain what a primary key is and why every record should have one.

  • Identify primary keys in database tables.

  • Explain why database entries should not contain redundant information.

  • Identify redundant information in databases.

  • Write statements that create tables.

  • Write statements to insert, modify, and delete records.

  • Write short programs that execute SQL queries.

  • Trace the execution of a program that contains an SQL query.

  • Explain why most database applications are written in a general-purpose language rather than in SQL.

Requirements:
Time estimation: 3 hours
Level: Introductory Introductory
Supporting Materials:
Published: Oct 11, 2021
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00106
version Revision: 12
Best viewed in a Jupyter Notebook

This tutorial is best viewed in a Jupyter notebook! You can load this notebook one of the following ways

Launching the notebook in Jupyter in Galaxy

  1. Instructions to Launch JupyterLab
  2. Open a Terminal in JupyterLab with File -> New -> Terminal
  3. Run wget https://training.galaxyproject.org/training-material/topics/data-science/tutorials/sql-advanced/data-science-sql-advanced.ipynb
  4. Select the notebook that appears in the list of files on the left.

Downloading the notebook

  1. Right click one of these links: Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)
  2. Save Link As..

Quiz: SQL Advanced Recap

Check your knowledge with a quiz!

  • Self Study Mode - do the quiz at your own pace, to check your understanding.
  • Classroom Mode - do the quiz synchronously with a classroom of students.
Comment

This tutorial is significantly based on the Carpentries Databases and SQL lesson, which is licensed CC-BY 4.0.

Abigail Cabunoc and Sheldon McKay (eds): “Software Carpentry: Using Databases and SQL.” Version 2017.08, August 2017, github.com/swcarpentry/sql-novice-survey, https://doi.org/10.5281/zenodo.838776

Adaptations have been made to make this work better in a GTN/Galaxy environment.

Agenda

In this tutorial, we will cover:

  1. Aggregation
    1. Averaging with NULL
  2. Combining Data
  3. Data Hygiene
  4. Creating and Modifying Data
# This preamble sets up the sql "magic" for jupyter. Use %%sql in your cells to write sql!
!python3 -m pip install ipython-sql sqlalchemy
!wget -c http://swcarpentry.github.io/sql-novice-survey/files/survey.db
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite:///survey.db")
%load_ext sql
%sql sqlite:///survey.db
%config SqlMagic.displaycon=False

Aggregation

We now want to calculate ranges and averages for our data. We know how to select all of the dates from the Visited table:

SELECT dated FROM Visited;

but to combine them, we must use an aggregation function such as min or max. Each of these functions takes a set of records as input, and produces a single record as output:

SELECT min(dated) FROM Visited;

SQL Aggregation.

SELECT max(dated) FROM Visited;

min and max are just two of the aggregation functions built into SQL. Three others are avg, count, and sum:

SELECT avg(reading) FROM Survey WHERE quant = 'sal';
SELECT count(reading) FROM Survey WHERE quant = 'sal';
SELECT sum(reading) FROM Survey WHERE quant = 'sal';

We used count(reading) here, but we could just as easily have counted quant or any other field in the table, or even used count(*), since the function doesn’t care about the values themselves, just how many values there are.

SQL lets us do several aggregations at once. We can, for example, find the range of sensible salinity measurements:

SELECT min(reading), max(reading) FROM Survey WHERE quant = 'sal' AND reading <= 1.0;

We can also combine aggregated results with raw results, although the output might surprise you:

SELECT person, count(*) FROM Survey WHERE quant = 'sal' AND reading <= 1.0;

Why does Lake’s name appear rather than Roerich’s or Dyer’s? The answer is that when it has to aggregate a field, but isn’t told how to, the database manager chooses an actual value from the input set. It might use the first one processed, the last one, or something else entirely.

Another important fact is that when there are no values to aggregate — for example, where there are no rows satisfying the WHERE clause — aggregation’s result is “don’t know” rather than zero or some other arbitrary value:

SELECT person, max(reading), sum(reading) FROM Survey WHERE quant = 'missing';

One final important feature of aggregation functions is that they are inconsistent with the rest of SQL in a very useful way. If we add two values, and one of them is null, the result is null. By extension, if we use sum to add all the values in a set, and any of those values are null, the result should also be null. It’s much more useful, though, for aggregation functions to ignore null values and only combine those that are non-null. This behavior lets us write our queries as:

SELECT min(dated) FROM Visited;

instead of always having to filter explicitly:

SELECT min(dated) FROM Visited WHERE dated IS NOT NULL;

Aggregating all records at once doesn’t always make sense. For example, suppose we suspect that there is a systematic bias in our data, and that some scientists’ radiation readings are higher than others. We know that this doesn’t work:

SELECT person, count(reading), round(avg(reading), 2)
FROM  Survey
WHERE quant = 'rad';

because the database manager selects a single arbitrary scientist’s name rather than aggregating separately for each scientist. Since there are only five scientists, we could write five queries of the form:

SELECT person, count(reading), round(avg(reading), 2)
FROM  Survey
WHERE quant = 'rad'
AND   person = 'dyer';

but this would be tedious, and if we ever had a data set with fifty or five hundred scientists, the chances of us getting all of those queries right is small.

What we need to do is tell the database manager to aggregate the hours for each scientist separately using a GROUP BY clause:

SELECT   person, count(reading), round(avg(reading), 2)
FROM     Survey
WHERE    quant = 'rad'
GROUP BY person;

GROUP BY does exactly what its name implies: groups all the records with the same value for the specified field together so that aggregation can process each batch separately. Since all the records in each batch have the same value for person, it no longer matters that the database manager is picking an arbitrary one to display alongside the aggregated reading values.

GROUP BY is basically just a pivot table for Excel users, it lets you build nice summary tables which aggregate your results.

And if you didn’t already know the Excel equivalent, now you know what to look for when you need it!

Just as we can sort by multiple criteria at once, we can also group by multiple criteria. To get the average reading by scientist and quantity measured, for example, we just add another field to the GROUP BY clause:

SELECT   person, quant, count(reading), round(avg(reading), 2)
FROM     Survey
GROUP BY person, quant;

Note that we have added quant to the list of fields displayed, since the results wouldn’t make much sense otherwise.

Let’s go one step further and remove all the entries where we don’t know who took the measurement:

SELECT   person, quant, count(reading), round(avg(reading), 2)
FROM     Survey
WHERE    person IS NOT NULL
GROUP BY person, quant
ORDER BY person, quant;

Looking more closely, this query:

  1. selected records from the Survey table where the person field was not null;
  2. grouped those records into subsets so that the person and quant values in each subset were the same;
  3. ordered those subsets first by person, and then within each sub-group by quant; and
  4. counted the number of records in each subset, calculated the average reading in each, and chose a person and quant value from each (it doesn’t matter which ones, since they’re all equal).
Question: Counting Temperature Readings

How many temperature readings did Frank Pabodie record, and what was their average value?

SELECT count(reading), avg(reading) FROM Survey WHERE quant = 'temp' AND person = 'pb';
count(reading) avg(reading)
2 -20.0
-- Try solutions here!

Averaging with NULL

The average of a set of values is the sum of the values divided by the number of values. Does this mean that the avg function returns 2.0 or 3.0 when given the values 1.0, null, and 5.0?

The answer is 3.0. NULL is not a value; it is the absence of a value. As such it is not included in the calculation.

You can confirm this, by executing this code:

SELECT AVG(a) FROM (
    SELECT 1 AS a
    UNION ALL SELECT NULL
    UNION ALL SELECT 5);
-- Try solutions here!
Question: What Does This Query Do?

We want to calculate the difference between each individual radiation reading and the average of all the radiation readings. We write the query:

SELECT reading - avg(reading) FROM Survey WHERE quant = 'rad';

What does this actually produce, and can you think of why?

The query produces only one row of results when we what we really want is a result for each of the readings. The avg() function produces only a single value, and because it is run first, the table is reduced to a single row. The reading value is simply an arbitrary one.

To achieve what we wanted, we would have to run two queries:

SELECT avg(reading) FROM Survey WHERE quant='rad';

This produces the average value (6.5625), which we can then insert into a second query:

SELECT reading - 6.5625 FROM Survey WHERE quant = 'rad';

This produces what we want, but we can combine this into a single query using subqueries.

SELECT reading - (SELECT avg(reading) FROM Survey WHERE quant='rad') FROM Survey WHERE quant = 'rad';

This way we don’t have execute two queries.

In summary what we have done is to replace avg(reading) with (SELECT avg(reading) FROM Survey WHERE quant='rad') in the original query.

-- Try solutions here!
Question: Ordering When Concatenating

The function group_concat(field, separator) concatenates all the values in a field using the specified separator character (or ‘,’ if the separator isn’t specified). Use this to produce a one-line list of scientists’ names, such as:

William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth

Can you find a way to order the list by surname?

-- Try solutions here!

Combining Data

In order to submit our data to a web site that aggregates historical meteorological data, we might need to format it as latitude, longitude, date, quantity, and reading. However, our latitudes and longitudes are in the Site table, while the dates of measurements are in the Visited table and the readings themselves are in the Survey table. We need to combine these tables somehow.

This figure shows the relations between the tables:

Survey Database Structure.

The SQL command to do this is JOIN. To see how it works, let’s start by joining the Site and Visited tables:

SELECT * FROM Site JOIN Visited;

JOIN creates the cross product of two tables, i.e., it joins each record of one table with each record of the other table to give all possible combinations. Since there are three records in Site and eight in Visited, the join’s output has 24 records (3 * 8 = 24) . And since each table has three fields, the output has six fields (3 + 3 = 6).

What the join hasn’t done is figure out if the records being joined have anything to do with each other. It has no way of knowing whether they do or not until we tell it how. To do that, we add a clause specifying that we’re only interested in combinations that have the same site name, thus we need to use a filter:

SELECT * FROM Site JOIN Visited ON Site.name = Visited.site;

ON is very similar to WHERE, and for all the queries in this lesson you can use them interchangeably. There are differences in how they affect outer joins, but that’s beyond the scope of this lesson. Once we add this to our query, the database manager throws away records that combined information about two different sites, leaving us with just the ones we want.

Notice that we used Table.field to specify field names in the output of the join. We do this because tables can have fields with the same name, and we need to be specific which ones we’re talking about. For example, if we joined the Person and Visited tables, the result would inherit a field called id from each of the original tables.

We can now use the same dotted notation to select the three columns we actually want out of our join:

SELECT Site.lat, Site.long, Visited.dated
FROM   Site JOIN Visited
ON     Site.name = Visited.site;

If joining two tables is good, joining many tables must be better. In fact, we can join any number of tables simply by adding more JOIN clauses to our query, and more ON tests to filter out combinations of records that don’t make sense:

SELECT Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading
FROM   Site JOIN Visited JOIN Survey
ON     Site.name = Visited.site
AND    Visited.id = Survey.taken
AND    Visited.dated IS NOT NULL;

We can tell which records from Site, Visited, and Survey correspond with each other because those tables contain primary keys and foreign keys. A primary key is a value, or combination of values, that uniquely identifies each record in a table. A foreign key is a value (or combination of values) from one table that identifies a unique record in another table. Another way of saying this is that a foreign key is the primary key of one table that appears in some other table. In our database, Person.id is the primary key in the Person table, while Survey.person is a foreign key relating the Survey table’s entries to entries in Person.

Most database designers believe that every table should have a well-defined primary key. They also believe that this key should be separate from the data itself, so that if we ever need to change the data, we only need to make one change in one place. One easy way to do this is to create an arbitrary, unique ID for each record as we add it to the database. This is actually very common: those IDs have names like “student numbers” and “patient numbers”, and they almost always turn out to have originally been a unique record identifier in some database system or other. As the query below demonstrates, SQLite automatically numbers records as they’re added to tables, and we can use those record numbers in queries:

SELECT rowid, * FROM Person;
Question: Listing Radiation Readings

Write a query that lists all radiation readings from the DR-1 site.

SELECT Survey.reading
FROM Site JOIN Visited JOIN Survey
ON Site.name = Visited.site
AND Visited.id = Survey.taken
WHERE Site.name = 'DR-1'
AND Survey.quant = 'rad';
reading
9.82
7.8
11.25
-- Try solutions here!
Question: Where's Frank?

Write a query that lists all sites visited by people named “Frank”.

SELECT DISTINCT Site.name
FROM Site JOIN Visited JOIN Survey JOIN Person
ON Site.name = Visited.site
AND Visited.id = Survey.taken
AND Survey.person = Person.id
WHERE Person.personal = 'Frank';
name
DR-3
-- Try solutions here!
Question: Reading Queries

Describe in your own words what the following query produces:

SELECT Site.name FROM Site JOIN Visited
ON Site.lat < -49.0 AND Site.name = Visited.site AND Visited.dated >= '1932-01-01';
-- Try solutions here!
Question: Who Has Been Where?

Write a query that shows each site with exact location (lat, long) ordered by visited date, followed by personal name and family name of the person who visited the site and the type of measurement taken and its reading. Please avoid all null values. Tip: you should get 15 records with 8 fields.

SELECT Site.name, Site.lat, Site.long, Person.personal, Person.family, Survey.quant, Survey.reading, Visited.dated
FROM Site JOIN Visited JOIN Survey JOIN Person
ON Site.name = Visited.site
AND Visited.id = Survey.taken
AND Survey.person = Person.id
WHERE Survey.person IS NOT NULL
AND Visited.dated IS NOT NULL
ORDER BY Visited.dated;
name lat long personal family quant reading dated
DR-1 -49.85 -128.57 William Dyer rad 9.82 1927-02-08
DR-1 -49.85 -128.57 William Dyer sal 0.13 1927-02-08
DR-1 -49.85 -128.57 William Dyer rad 7.8 1927-02-10
DR-1 -49.85 -128.57 William Dyer sal 0.09 1927-02-10
DR-3 -47.15 -126.72 Anderson Lake sal 0.05 1930-01-07
DR-3 -47.15 -126.72 Frank Pabodie rad 8.41 1930-01-07
DR-3 -47.15 -126.72 Frank Pabodie temp -21.5 1930-01-07
DR-3 -47.15 -126.72 Frank Pabodie rad 7.22 1930-01-12
DR-3 -47.15 -126.72 Anderson Lake sal 0.1 1930-02-26
DR-3 -47.15 -126.72 Frank Pabodie rad 4.35 1930-02-26
DR-3 -47.15 -126.72 Frank Pabodie temp -18.5 1930-02-26
MSK-4 -48.87 -123.4 Anderson Lake rad 1.46 1932-01-14
MSK-4 -48.87 -123.4 Anderson Lake sal 0.21 1932-01-14
MSK-4 -48.87 -123.4 Valentina Roerich sal 22.5 1932-01-14
DR-1 -49.85 -128.57 Valentina Roerich rad 11.25 1932-03-22
-- Try solutions here!

A good visual explanation of joins can be found in the SQL Join Visualizer

Data Hygiene

Now that we have seen how joins work, we can see why the relational model is so useful and how best to use it. The first rule is that every value should be atomic, i.e., not contain parts that we might want to work with separately. We store personal and family names in separate columns instead of putting the entire name in one column so that we don’t have to use substring operations to get the name’s components. More importantly, we store the two parts of the name separately because splitting on spaces is unreliable: just think of a name like “Eloise St. Cyr” or “Jan Mikkel Steubart”.

The second rule is that every record should have a unique primary key. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the id field in the Person table), or even a combination of values: the triple (taken, person, quant) from the Survey table uniquely identifies every measurement.

The third rule is that there should be no redundant information. For example, we could get rid of the Site table and rewrite the Visited table like this:

id lat long dated
619 -49.85 -128.57 1927-02-08
622 -49.85 -128.57 1927-02-10
734 -47.15 -126.72 1930-01-07
735 -47.15 -126.72 1930-01-12
751 -47.15 -126.72 1930-02-26
752 -47.15 -126.72 None
837 -48.87 -123.40 1932-01-14
844 -49.85 -128.57 1932-03-22

In fact, we could use a single table that recorded all the information about each reading in each row, just as a spreadsheet would. The problem is that it’s very hard to keep data organized this way consistent: if we realize that the date of a particular visit to a particular site is wrong, we have to change multiple records in the database. What’s worse, we may have to guess which records to change, since other sites may also have been visited on that date.

The fourth rule is that the units for every value should be stored explicitly. Our database doesn’t do this, and that’s a problem: Roerich’s salinity measurements are several orders of magnitude larger than anyone else’s, but we don’t know if that means she was using parts per million instead of parts per thousand, or whether there actually was a saline anomaly at that site in 1932.

Stepping back, data and the tools used to store it have a symbiotic relationship: we use tables and joins because it’s efficient, provided our data is organized a certain way, but organize our data that way because we have tools to manipulate it efficiently. As anthropologists say, the tool shapes the hand that shapes the tool.

Question: Identifying Atomic Values

Which of the following are atomic values? Which are not? Why?

  • New Zealand
  • 87 Turing Avenue
  • January 25, 1971
  • the XY coordinate (0.5, 3.3)

New Zealand is the only clear-cut atomic value.

The address and the XY coordinate contain more than one piece of information which should be stored separately:

  • House number, street name
  • X coordinate, Y coordinate

The date entry is less clear cut, because it contains month, day, and year elements. However, there is a DATE datatype in SQL, and dates should be stored using this format. If we need to work with the month, day, or year separately, we can use the SQL functions available for our database software (for example EXTRACT or STRFTIME for SQLite).

-- Try solutions here!
Question: Identifying a Primary Key

What is the primary key in this table? I.e., what value or combination of values uniquely identifies a record?

latitude longitude date temperature
57.3 -22.5 2015-01-09 -14.2

Latitude, longitude, and date are all required to uniquely identify the temperature record.

-- Try solutions here!

Creating and Modifying Data

So far we have only looked at how to get information out of a database, both because that is more frequent than adding information, and because most other operations only make sense once queries are understood. If we want to create and modify data, we need to know two other sets of commands.

The first pair are CREATE TABLE and DROP TABLE. While they are written as two words, they are actually single commands. The first one creates a new table; its arguments are the names and types of the table’s columns. For example, the following statements create the four tables in our survey database:

CREATE TABLE Person(id text, personal text, family text);
CREATE TABLE Site(name text, lat real, long real);
CREATE TABLE Visited(id integer, site text, dated text);
CREATE TABLE Survey(taken integer, person text, quant text, reading real);

We can get rid of one of our tables using:

DROP TABLE Survey;

Be very careful when doing this: if you drop the wrong table, hope that the person maintaining the database has a backup, but it’s better not to have to rely on it.

Different database systems support different data types for table columns, but most provide the following:

data type use
INTEGER a signed integer
REAL a floating point number
TEXT a character string
BLOB a “binary large object”, such as an image

Most databases also support Booleans and date/time values; SQLite uses the integers 0 and 1 for the former, and represents the latter as text or numeric fields.

An increasing number of databases also support geographic data types, such as latitude and longitude. Keeping track of what particular systems do or do not offer, and what names they give different data types, is an unending portability headache.

SQLite is fantastic for small databases or embedded into applications where you want to be able to use SQL to query and process data.

However for any real analysis PostgreSQL is usually the best choice, it scales incredibly well and can meet a wide range of use cases. It has good data type support.

Use Postgres. The PostGIS library is fantastic and industry standard for storing geographic data in a database.

When we create a table, we can specify several kinds of constraints on its columns. For example, a better definition for the Survey table would be:

CREATE TABLE Survey(
    taken   integer not null, -- where reading taken
    person  text,             -- may not know who took it
    quant   text not null,    -- the quantity measured
    reading real not null,    -- the actual reading
    primary key(taken, quant),
    foreign key(taken) references Visited(id),
    foreign key(person) references Person(id)
);

Once again, exactly what constraints are available and what they’re called depends on which database manager we are using.

Once tables have been created, we can add, change, and remove records using our other set of commands, INSERT, UPDATE, and DELETE.

Here is an example of inserting rows into the Site table:

INSERT INTO Site (name, lat, long) VALUES ('DR-1', -49.85, -128.57);
INSERT INTO Site (name, lat, long) VALUES ('DR-3', -47.15, -126.72);
INSERT INTO Site (name, lat, long) VALUES ('MSK-4', -48.87, -123.40);

We can also insert values into one table directly from another:

CREATE TABLE JustLatLong(lat real, long real);
INSERT INTO JustLatLong SELECT lat, long FROM Site;

Modifying existing records is done using the UPDATE statement. To do this we tell the database which table we want to update, what we want to change the values to for any or all of the fields, and under what conditions we should update the values.

For example, if we made a mistake when entering the lat and long values of the last INSERT statement above, we can correct it with an update:

UPDATE Site SET lat = -47.87, long = -122.40 WHERE name = 'MSK-4';

Be careful to not forget the WHERE clause or the update statement will modify all of the records in the database.

Deleting records can be a bit trickier, because we have to ensure that the database remains internally consistent. If all we care about is a single table, we can use the DELETE command with a WHERE clause that matches the records we want to discard. For example, once we realize that Frank Danforth didn’t take any measurements, we can remove him from the Person table like this:

DELETE FROM Person WHERE id = 'danforth';

But what if we removed Anderson Lake instead? Our Survey table would still contain seven records of measurements he’d taken, but that’s never supposed to happen: Survey.person is a foreign key into the Person table, and all our queries assume there will be a row in the latter matching every value in the former.

This problem is called referential integrity: we need to ensure that all references between tables can always be resolved correctly. One way to do this is to delete all the records that use 'lake' as a foreign key before deleting the record that uses it as a primary key. If our database manager supports it, we can automate this using cascading delete. However, this technique is outside the scope of this chapter.

Many applications use a hybrid storage model instead of putting everything into a database: the actual data (such as astronomical images) is stored in files, while the database stores the files’ names, their modification dates, the region of the sky they cover, their spectral characteristics, and so on. This is also how most music player software is built: the database inside the application keeps track of the MP3 files, but the files themselves live on disk.

Question: Replacing NULL

Write an SQL statement to replace all uses of null in Survey.person with the string 'unknown'.

UPDATE Survey SET person = 'unknown' WHERE person IS NULL;
-- Try solutions here!
Question: Backing Up with SQL

SQLite has several administrative commands that aren’t part of the SQL standard. One of them is .dump, which prints the SQL commands needed to re-create the database. Another is .read, which reads a file created by .dump and restores the database. A colleague of yours thinks that storing dump files (which are text) in version control is a good way to track and manage changes to the database. What are the pros and cons of this approach? (Hint: records aren’t stored in any particular order.)

Advantages

  • A version control system will be able to show differences between versions of the dump file; something it can’t do for binary files like databases
  • A VCS only saves changes between versions, rather than a complete copy of each version (save disk space)
  • The version control log will explain the reason for the changes in each version of the database

Disadvantages

  • Artificial differences between commits because records don’t have a fixed order
-- Try solutions here!