5 minutes
Group by primary key
When GROUPing an SQL relation, additional constraints are applied to what
columns can be used elsewhere in the query. Grouping keys can be referred
to plainly, while all other columns can only appear within aggregation functions.
SELECT status, SUM(total) FROM invoices GROUP BY status
-- OK
SELECT status, total FROM invoices GROUP BY status
-- ERROR: column "invoices.total" must appear in the GROUP BY clause or be used in an aggregate function
This limitation is a logical consequence of how GROUP operates. Grouping means
that the result will contain exactly one row for each different value of the
grouping key (status in example above). This means that multiple input rows
are grouped into a single row. And for all columns other than what appears in
the grouping key, it is unclear which value should be picked.
 id | status | total
----+--------+-------  _             status | total
  4 | done   |   7.4    \_           -------+-------
                         _ GROUP =>  done   |  ???
  5 | done   |  4.86   _/   status
The interesting (and obscure) detail here is that this rule does not apply to columns whose primary key is in the grouping key.
SELECT id, total FROM invoices GROUP BY id
-- OK
This works because we know that total depends on id. In other words, there
cannot be two different total values for the same id value. This implies that
when GROUP BY id produces a row for each value of id, there is exactly one
possible value of total, which can be used as the grouped value.
 id | status | total
----+--------+-------  _              id | total
  4 | done   |   7.4    \_           ----+-------
                         _ GROUP =>    4 |   7.4
  5 | done   |  4.86   _/   id         5 |  4.86
A reasonable objection to this feature is that “grouping by primary key is a
no-op”. But this is not always true. Consider a case where rows of invoices
are duplicated for some reason before the grouping operation:
SELECT
  invoices.id,
  invoices.total,
  MAX(items.price)
FROM invoices
LEFT JOIN items ON (items.invoice_id = invoices.id)
GROUP BY invoices.id
Here, we take invoices and join it with items on the invoice. This will
multiply invoice rows for each item it has. But right after, we GROUP BY
invoice.id. This now de-duplicates invoices back to one row per invoice and
does some aggregation on items.
So this is a valid use-case of GROUP BY and it would be unreasonable
to argue that this query should be rejected due to “column must appear in the
GROUP BY clause”. But it would not be unreasonable to argue so due to this
being an anti-pattern.
Regardless of validity of the use-case, I think that such overly-specific rules should rather be avoided in lieu of guidance toward a clearer solution to the problem. This guidance could be better error messages, additional hints or even a different mental model for querying relations.
But first, we need to know what that “clearer solution” would be.
In our example, the problem is that we are grouping both invoices and
items, while we could be grouping just items:
WITH items_agg AS (
	SELECT invoice_id, MAX(items.price) as max
	FROM items
	GROUP BY invoice_id
) 
SELECT
  invoices.id,
  invoices.total,
  items_agg.max
FROM invoices
LEFT JOIN items_agg ON (items_agg.invoice_id = invoices.id)
… or using a LATERAL join:
SELECT
  invoices.id,
  invoices.total,
  items_agg.max
FROM invoices
LEFT JOIN LATERAL (
	SELECT MAX(items.price) as max
	FROM items
	WHERE invoice_id = invoices.id
) items_agg ON (TRUE)
Here, LATERAL implies that the sub-query should be evaluated for each parent
row. This allow us to refer to the parent columns (invoices.id). It is
analogous to the following JavaScript snippet:
const invoices = [...];
const items = [...];
invoices.map(invoice => {
	let max_price = (items
		.filter(item => item.invoice_id == invoice.id)
		.map(item => item.price)
		.reduce((a, b) => Math.max(a, b), 0)
	);
	return {id: invoice.id, total: invoice.total, max: max_price};
})
In my opinion, the JavaScript query structure is much clearer, because:
- invoices.map(invoice => { ... })shows that we are performing this operation for each invoice.
- We are able to aggregate the items table down to a single value - max_price. In SQL, JOIN LATERAL produced a relation that has to be JOINed back (albeit using a no-op TRUE condition).
- The resulting object is constructed at the bottom of the query as opposed to SELECT clause at the top. This gives the query a “flow from top-to-bottom”. 
I wish we would be able to run proper programming languages on relational databases. With “proper” I mean languages that don’t base their structure on the natural English language, but on simple logic rules.
Stepping back a bit, we wanted a way to guide users away from grouping by primary key. It seems like the “clearer solution” we were looking for is moving the grouping from the top-level query into a per-row scope.
To be honest, I don’t know how to put this into simple and clear terms so it could become an error message.
Maybe the solution is not to fix SQL, but start with a language that would have
explicit scopes (such as invoice => { ... }) and a data-flow model that would
force the user to think in terms of mapping operations.
