3 minutes
Column names in SQL
Most modern databases use data model named Relational Algebra (RA), which defines a relation as a set of tuples and a few operation you can do on relations.
I want to point out a specific detail about the tuples: its fields (also known as attributes or columns) have a unique identifier within the tuple. In practical terms, columns have names.
This also holds for SQL, which is used to express RA in databases. But it might not be obvious that SQL requires all columns to have a name. For example, this is a valid query:
SELECT 1 AS a, 2
First column here serves as an example of a named column and the second is trying to express a named column. But if you execue this query on PostgreSQL, the result will be:
a | ?column? |
---|---|
1 | 2 |
When I first saw this, my assumption was that the empty columns in the result
are just formatted to show ?column?
and could might as well show <unnamed>
.
But this is not the case; this is actual inferred column name. Which implies that this name can be used to refer to the tuple field:
SELECT `?column?`
FROM (SELECT 1) t
Which is weird.
This artifact in language design comes duality of SQL’s design. It seems like it wants to adhere to RA formalization, but also be user friendly and infer column names in a lot of places.
And to be fair, it does succeed in that; many SQL expression automatically infer reasonable column names:
SELECT
t.a, -- a
pg_catalog.count(t.a) OVER (), -- count
t.a::text, -- a
a + 1 -- ?column?
FROM (SELECT 1 AS a) t;
… here, the columns will infer following names: a
, count
, a
, ?column?
.
First two make sense, since it makes sense name the column after the identifiers
or function name. It also make sense to not care about some operations such as
casting. And sometimes, you just cannot assume a reasonable name and default to
?column?
.
Now here comes my opinion: this behavior hard to unpredict and overall causes more harm than good.
First, infering names might lead to relations with duplicate names, which then cannot be queried and will fail with ambigious errors. This is even harder to debug when joining-in a relation that does not explcity define column names, but has a conflicting name infered.
It also causes hidden effects of editing SELECT projections. For example, when
one changes a projection from a::int
to a::int + 1
that edit changes infered
name from a
to ?column?
.
I would prefer these rules would be kept simple and infer column name only in the most simple case of a plain identifier. Anything more complex should result in a truly unnamed column.