5 minutes
Group by primary key
When GROUP
ing an SQL relation, additional constraints are applied to what
columns can be used elsewhere in the query. Grouping keys can be referred
to plainly, while all other columns can only appear within aggregation functions.
SELECT status, SUM(total) FROM invoices GROUP BY status
-- OK
SELECT status, total FROM invoices GROUP BY status
-- ERROR: column "invoices.total" must appear in the GROUP BY clause or be used in an aggregate function
This limitation is a logical consequence of how GROUP operates. Grouping means
that the result will contain exactly one row for each different value of the
grouping key (status
in example above). This means that multiple input rows
are grouped into a single row. And for all columns other than what appears in
the grouping key, it is unclear which value should be picked.
id | status | total
----+--------+------- _ status | total
4 | done | 7.4 \_ -------+-------
_ GROUP => done | ???
5 | done | 4.86 _/ status
The interesting (and obscure) detail here is that this rule does not apply to columns whose primary key is in the grouping key.
SELECT id, total FROM invoices GROUP BY id
-- OK
This works because we know that total
depends on id
. In other words, there
cannot be two different total
values for the same id
value. This implies that
when GROUP BY id
produces a row for each value of id
, there is exactly one
possible value of total
, which can be used as the grouped value.
id | status | total
----+--------+------- _ id | total
4 | done | 7.4 \_ ----+-------
_ GROUP => 4 | 7.4
5 | done | 4.86 _/ id 5 | 4.86
A reasonable objection to this feature is that “grouping by primary key is a
no-op”. But this is not always true. Consider a case where rows of invoices
are duplicated for some reason before the grouping operation:
SELECT
invoices.id,
invoices.total,
MAX(items.price)
FROM invoices
LEFT JOIN items ON (items.invoice_id = invoices.id)
GROUP BY invoices.id
Here, we take invoices
and join it with items
on the invoice. This will
multiply invoice rows for each item it has. But right after, we GROUP BY
invoice.id
. This now de-duplicates invoices
back to one row per invoice and
does some aggregation on items
.
So this is a valid use-case of GROUP BY
and it would be unreasonable
to argue that this query should be rejected due to “column must appear in the
GROUP BY clause”. But it would not be unreasonable to argue so due to this
being an anti-pattern.
Regardless of validity of the use-case, I think that such overly-specific rules should rather be avoided in lieu of guidance toward a clearer solution to the problem. This guidance could be better error messages, additional hints or even a different mental model for querying relations.
But first, we need to know what that “clearer solution” would be.
In our example, the problem is that we are grouping both invoices
and
items
, while we could be grouping just items
:
WITH items_agg AS (
SELECT invoice_id, MAX(items.price) as max
FROM items
GROUP BY invoice_id
)
SELECT
invoices.id,
invoices.total,
items_agg.max
FROM invoices
LEFT JOIN items_agg ON (items_agg.invoice_id = invoices.id)
… or using a LATERAL
join:
SELECT
invoices.id,
invoices.total,
items_agg.max
FROM invoices
LEFT JOIN LATERAL (
SELECT MAX(items.price) as max
FROM items
WHERE invoice_id = invoices.id
) items_agg ON (TRUE)
Here, LATERAL
implies that the sub-query should be evaluated for each parent
row. This allow us to refer to the parent columns (invoices.id
). It is
analogous to the following JavaScript snippet:
const invoices = [...];
const items = [...];
invoices.map(invoice => {
let max_price = (items
.filter(item => item.invoice_id == invoice.id)
.map(item => item.price)
.reduce((a, b) => Math.max(a, b), 0)
);
return {id: invoice.id, total: invoice.total, max: max_price};
})
In my opinion, the JavaScript query structure is much clearer, because:
invoices.map(invoice => { ... })
shows that we are performing this operation for each invoice.We are able to aggregate the items table down to a single value
max_price
. In SQL, JOIN LATERAL produced a relation that has to be JOINed back (albeit using a no-op TRUE condition).The resulting object is constructed at the bottom of the query as opposed to SELECT clause at the top. This gives the query a “flow from top-to-bottom”.
I wish we would be able to run proper programming languages on relational databases. With “proper” I mean languages that don’t base their structure on the natural English language, but on simple logic rules.
Stepping back a bit, we wanted a way to guide users away from grouping by primary key. It seems like the “clearer solution” we were looking for is moving the grouping from the top-level query into a per-row scope.
To be honest, I don’t know how to put this into simple and clear terms so it could become an error message.
Maybe the solution is not to fix SQL, but start with a language that would have
explicit scopes (such as invoice => { ... }
) and a data-flow model that would
force the user to think in terms of mapping operations.