Group by primary key

When GROUPing an SQL relation, additional constraints are applied to what columns can be used elsewhere in the query. Grouping keys can be referred to plainly, while all other columns can only appear within aggregation functions.

SELECT status, SUM(total) FROM invoices GROUP BY status
-- OK

SELECT status, total FROM invoices GROUP BY status
-- ERROR: column "invoices.total" must appear in the GROUP BY clause or be used in an aggregate function

This limitation is a logical consequence of how GROUP operates. Grouping means that the result will contain exactly one row for each different value of the grouping key (status in example above). This means that multiple input rows are grouped into a single row. And for all columns other than what appears in the grouping key, it is unclear which value should be picked.

 id | status | total
----+--------+-------  _             status | total
  4 | done   |   7.4    \_           -------+-------
                         _ GROUP =>  done   |  ???
  5 | done   |  4.86   _/   status

The interesting (and obscure) detail here is that this rule does not apply to columns whose primary key is in the grouping key.

SELECT id, total FROM invoices GROUP BY id
-- OK

This works because we know that total depends on id. In other words, there cannot be two different total values for the same id value. This implies that when GROUP BY id produces a row for each value of id, there is exactly one possible value of total, which can be used as the grouped value.

 id | status | total
----+--------+-------  _              id | total
  4 | done   |   7.4    \_           ----+-------
                         _ GROUP =>    4 |   7.4
  5 | done   |  4.86   _/   id         5 |  4.86

A reasonable objection to this feature is that “grouping by primary key is a no-op”. But this is not always true. Consider a case where rows of invoices are duplicated for some reason before the grouping operation:

SELECT
  invoices.id,
  invoices.total,
  MAX(items.price)
FROM invoices
LEFT JOIN items ON (items.invoice_id = invoices.id)
GROUP BY invoices.id

Here, we take invoices and join it with items on the invoice. This will multiply invoice rows for each item it has. But right after, we GROUP BY invoice.id. This now de-duplicates invoices back to one row per invoice and does some aggregation on items.

So this is a valid use-case of GROUP BY and it would be unreasonable to argue that this query should be rejected due to “column must appear in the GROUP BY clause”. But it would not be unreasonable to argue so due to this being an anti-pattern.

Regardless of validity of the use-case, I think that such overly-specific rules should rather be avoided in lieu of guidance toward a clearer solution to the problem. This guidance could be better error messages, additional hints or even a different mental model for querying relations.

But first, we need to know what that “clearer solution” would be.

In our example, the problem is that we are grouping both invoices and items, while we could be grouping just items:

WITH items_agg AS (
	SELECT invoice_id, MAX(items.price) as max
	FROM items
	GROUP BY invoice_id
) 
SELECT
  invoices.id,
  invoices.total,
  items_agg.max
FROM invoices
LEFT JOIN items_agg ON (items_agg.invoice_id = invoices.id)

… or using a LATERAL join:

SELECT
  invoices.id,
  invoices.total,
  items_agg.max
FROM invoices
LEFT JOIN LATERAL (
	SELECT MAX(items.price) as max
	FROM items
	WHERE invoice_id = invoices.id
) items_agg ON (TRUE)

Here, LATERAL implies that the sub-query should be evaluated for each parent row. This allow us to refer to the parent columns (invoices.id). It is analogous to the following JavaScript snippet:

const invoices = [...];
const items = [...];

invoices.map(invoice => {
	let max_price = (items
		.filter(item => item.invoice_id == invoice.id)
		.map(item => item.price)
		.reduce((a, b) => Math.max(a, b), 0)
	);

	return {id: invoice.id, total: invoice.total, max: max_price};
})

In my opinion, the JavaScript query structure is much clearer, because:

invoices.map(invoice => { ... }) shows that we are performing this operation for each invoice.
We are able to aggregate the items table down to a single value max_price. In SQL, JOIN LATERAL produced a relation that has to be JOINed back (albeit using a no-op TRUE condition).
The resulting object is constructed at the bottom of the query as opposed to SELECT clause at the top. This gives the query a “flow from top-to-bottom”.

I wish we would be able to run proper programming languages on relational databases. With “proper” I mean languages that don’t base their structure on the natural English language, but on simple logic rules.

Stepping back a bit, we wanted a way to guide users away from grouping by primary key. It seems like the “clearer solution” we were looking for is moving the grouping from the top-level query into a per-row scope.

To be honest, I don’t know how to put this into simple and clear terms so it could become an error message.

Maybe the solution is not to fix SQL, but start with a language that would have explicit scopes (such as invoice => { ... }) and a data-flow model that would force the user to think in terms of mapping operations.