Improving inferred query column names
Hi,
A common annoyance when writing ad-hoc analytics queries is column naming once
aggregates are used.
Useful column names:
SELECT reads, writes FROM pg_stat_io;
column names: reads, writes
Not useful column names:
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: sum, sum
So i often end up manually writing:
SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM pg_stat_io;
Of course we can't infer useful column names for everything, but for something
like this, it should't be too hard to do better. E.g. by combining the
function name with the column name in the argument, if a single plain column
is the argument.
I think on a green field it'd be clearly better to do something like the
above. What does give me pause is that it seems quite likely to break
existing queries, and to a lesser degree, might break applications relying on
inferred column names
Can anybody think of a good way out of that? It's not like that problem is
going to go away at some point...
Greetings,
Andres Freund
That is a good idea for simple cases, I'm just curious how it would look
like for more complex cases (you can have all kinds of expressions as
parameters for aggregate function calls).
If it works only for simple cases, I think it would be confusing and not
very helpful.
Wouldn't it make more sense to just deduplicate the names by adding
numerical postfixes, like sum_1, sum_2?
For backwards compatibility I guess you can have a GUC flag controlling
that behavior that can be set into backwards compatibility mode if required.
The previous functionality can be declared deprecated and removed (with the
flag) once the current version becomes unsupported.
(or with a different deprecation policy, I'm not sure what is the general
rule for breaking changes and deprecation currently).
If there is a clearly defined deprecation policy and a backwards
compatibility option, it should be good, no? Just my 2 cents.
-Vladimir Churyukin
On Sat, Feb 11, 2023 at 11:24 AM Andres Freund <andres@anarazel.de> wrote:
Show quoted text
Hi,
A common annoyance when writing ad-hoc analytics queries is column naming
once
aggregates are used.Useful column names:
SELECT reads, writes FROM pg_stat_io;
column names: reads, writesNot useful column names:
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: sum, sumSo i often end up manually writing:
SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM
pg_stat_io;Of course we can't infer useful column names for everything, but for
something
like this, it should't be too hard to do better. E.g. by combining the
function name with the column name in the argument, if a single plain
column
is the argument.I think on a green field it'd be clearly better to do something like the
above. What does give me pause is that it seems quite likely to break
existing queries, and to a lesser degree, might break applications relying
on
inferred column namesCan anybody think of a good way out of that? It's not like that problem is
going to go away at some point...Greetings,
Andres Freund
On Sat, Feb 11, 2023 at 3:47 PM Vladimir Churyukin <vladimir@churyukin.com>
wrote:
For backwards compatibility I guess you can have a GUC flag controlling
that behavior that can be set into backwards compatibility mode if required.
The previous functionality can be declared deprecated and removed (with
the flag) once the current version becomes unsupported.
Seems more like a per-session setting than a GUC.
Here's a suggestion off the top of my head.
We create a session setting inferred_column_name_template.
The template takes a formatting directive %N which is just a counter
SET inferred_column_name_template = 'col_%N'
which would give you col_1, col_2, regardless of what kind of expression
the columns were
We could introduce another directive, %T
SET inferred_column_name_template = '%T_%N'
which prints the datatype short name of the column. In this case, %N would
increment per datatype, so text_1, integer_1, text_2, timestamptz_1, text_3
Getting fancier, we could introduce something less datatype centric, %F
SET inferred_column_name_template = '%F_%N'
Which would walk the following waterfall and stop on the first match
1. The datatype short name if the expression is explicitly casted
(either CAST or ::)
2. the name of the function if the outermost expression was a function
(aggregate, window, or scalar), so sum_1, substr_1
3. 'case' if the outermost expression was case
4. 'expr' if the expression was effectively an operator ( SELECT 3+4,
'a' || 'b' etc)
5. the datatype short name for anything that doesn't match any of the
previous, and for explicit casts
Keeping track of all the %N counters could get silly, so maybe a %P which
is simply the numeric column position of the column, so your result set
would go like: id, name, col_3, last_login, col_5.
We would have to account for the case where the user left either %N or %P
out of the template, so one of them would be an implied suffix if both were
absent, or we maybe go with
SET inferred_column_name_prefix = '%F_';
SET inferred_column_name_counter = 'position'; /* position, counter,
per_type_counter */
Or we just cook up a few predefined naming schemes, and let the user pick
from those.
One caution I have is that I have seen several enterprise app database
designs that have lots of user-customizable columns with names like
varchar1, numeric4, etc. Presumably the user would know their environment
and not pick a confusing template.
On 11.02.23 20:24, Andres Freund wrote:
Not useful column names:
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: sum, sumSo i often end up manually writing:
SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM pg_stat_io;Of course we can't infer useful column names for everything, but for something
like this, it should't be too hard to do better. E.g. by combining the
function name with the column name in the argument, if a single plain column
is the argument.I think on a green field it'd be clearly better to do something like the
above. What does give me pause is that it seems quite likely to break
existing queries, and to a lesser degree, might break applications relying on
inferred column namesCan anybody think of a good way out of that? It's not like that problem is
going to go away at some point...
I think we should just do it and not care about what breaks. There has
never been any guarantee about these.
FWIW, "most" other SQL implementations appear to generate column names like
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: "SUM(reads)", "SUM(writes)"
(various capitalization of course).
We had a colleague look into this a little while ago, and it got pretty
tedious to implement this for all the expression types. And, you know,
the bikeshedding ...
But I'm all in favor of improving this.
On Mon, Feb 20, 2023 at 8:08 AM Peter Eisentraut <
peter.eisentraut@enterprisedb.com> wrote:
On 11.02.23 20:24, Andres Freund wrote:
I think on a green field it'd be clearly better to do something like the
above. What does give me pause is that it seems quite likely to break
existing queries, and to a lesser degree, might break applicationsrelying on
inferred column names
Can anybody think of a good way out of that? It's not like that problem
is
going to go away at some point...
I think we should just do it and not care about what breaks. There has
never been any guarantee about these.
I'm going to toss a -1 into the ring but if this does go through a strong
request that it be disabled via a GUC. The ugliness of that option is why
we shouldn't do this.
Defacto reality is still a reality we are on the hook for.
I too find the legacy design choice to be annoying but not so much that
changing it seems like a good idea.
David J.
On 20.02.23 16:17, David G. Johnston wrote:
I think we should just do it and not care about what breaks. There has
never been any guarantee about these.I'm going to toss a -1 into the ring but if this does go through a
strong request that it be disabled via a GUC. The ugliness of that
option is why we shouldn't do this.Defacto reality is still a reality we are on the hook for.
I too find the legacy design choice to be annoying but not so much that
changing it seems like a good idea.
Well, a small backward compatibility GUC might not be too cumbersome.
Hi,
On 2023-02-20 16:08:00 +0100, Peter Eisentraut wrote:
On 11.02.23 20:24, Andres Freund wrote:
I think we should just do it and not care about what breaks. There has
never been any guarantee about these.FWIW, "most" other SQL implementations appear to generate column names like
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: "SUM(reads)", "SUM(writes)"
Hm, personally I don't like leaving in parens in the names, that makes it
unnecessarily hard to reference the columns. sum_reads imo is more usable
than than "SUM(reads)".
We had a colleague look into this a little while ago, and it got pretty
tedious to implement this for all the expression types.
Hm, any chance that colleague could be pointed at this discussion and chime
in? It doesn't immediately look that hard to do substantially better than
today. Of course there's an approximately endless amount of effort that could
be poured into this, but even some fairly basic improvements seem like a big
win.
And, you know, the bikeshedding ...
Indeed. I already started above :)
Greetings,
Andres Freund
Hi,
On 2023-02-11 12:47:04 -0800, Vladimir Churyukin wrote:
That is a good idea for simple cases, I'm just curious how it would look
like for more complex cases (you can have all kinds of expressions as
parameters for aggregate function calls).
If it works only for simple cases, I think it would be confusing and not
very helpful.
I don't think it needs to be perfect to be helpful.
Wouldn't it make more sense to just deduplicate the names by adding
numerical postfixes, like sum_1, sum_2?
That'd be considerably worse than what we do today imo, because any reordering
/ added aggregate would lead to everything else changing as well.
Greetings,
Andres Freund
On Wed, Feb 22, 2023, 12:40 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-02-11 12:47:04 -0800, Vladimir Churyukin wrote:
That is a good idea for simple cases, I'm just curious how it would look
like for more complex cases (you can have all kinds of expressions as
parameters for aggregate function calls).
If it works only for simple cases, I think it would be confusing and not
very helpful.I don't think it needs to be perfect to be helpful.
It doesn't need to be perfect, but it needs to be consistent. So far you
proposed a rule to replace () with _. What is the plan for expressions, how
to convert them to names (with deduplication I guess?, because there could
be 2 similar expressions mapped to the same name potentially).
Wouldn't it make more sense to just deduplicate the names by adding
numerical postfixes, like sum_1, sum_2?That'd be considerably worse than what we do today imo, because any
reordering
/ added aggregate would lead to everything else changing as well.
Ok, that I kinda agree with. Not necessarily worse overall, but worse for
some cases. Well, the proposal above about keeping the names exactly the
same as the full expressions is probably the best we can do then. It will
take care of possible duplications and won't be position-sensitive. And
will be consistent. The only issue is somewhat unusual column names that
you will have to use quotes to refer to. But is that a real issue?
-Vladimir Churyukin
Show quoted text
Vladimir Churyukin <vladimir@churyukin.com> writes:
It doesn't need to be perfect, but it needs to be consistent. So far you
proposed a rule to replace () with _. What is the plan for expressions, how
to convert them to names (with deduplication I guess?, because there could
be 2 similar expressions mapped to the same name potentially).
I do not think we need to do anything for arbitrary expressions.
The proposal so far was just to handle a function call wrapped
around something else by converting to the function name followed
by whatever we'd emit for the something else. You cannot realistically
handle, say, operator expressions without emitting names that will
require quoting, which doesn't seem attractive.
And no, deduplication isn't on the table at all here.
regards, tom lane
Hi,
On 2023-02-22 16:38:51 -0500, Tom Lane wrote:
Vladimir Churyukin <vladimir@churyukin.com> writes:
It doesn't need to be perfect, but it needs to be consistent. So far you
proposed a rule to replace () with _. What is the plan for expressions, how
to convert them to names (with deduplication I guess?, because there could
be 2 similar expressions mapped to the same name potentially).I do not think we need to do anything for arbitrary expressions.
Exactly. It's not like they have a useful name today. Nor are they unique.
The proposal so far was just to handle a function call wrapped
around something else by converting to the function name followed
by whatever we'd emit for the something else.
Just to showcase that better, what I think we're discussing is changing:
SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
┌──────┬────────┬──────────┐
│ sum │ sum │ ?column? │
├──────┼────────┼──────────┤
│ 2774 │ 257896 │ 2 │
└──────┴────────┴──────────┘
(1 row)
to
SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
┌──────────────┬───────────────┬──────────┐
│ sum_relpages │ sum_reltuples │ ?column? │
├──────────────┼───────────────┼──────────┤
│ 2774 │ 257896 │ 2 │
└──────────────┴───────────────┴──────────┘
(1 row)
You cannot realistically
handle, say, operator expressions without emitting names that will
require quoting, which doesn't seem attractive.
Well, it doesn't require much to be better than "?column?", which already
requires quoting...
We could just do something like printing <left>_<funcname>_<right>. So
something like avg(reltuples / relpages) would end up as
avg_reltuples_float48div_relpages.
Whether that's worth it, or whether column name lengths would be too painful,
IDK.
And no, deduplication isn't on the table at all here.
+1
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2023-02-22 16:38:51 -0500, Tom Lane wrote:
The proposal so far was just to handle a function call wrapped
around something else by converting to the function name followed
by whatever we'd emit for the something else.
SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
┌──────────────┬───────────────┬──────────┐
│ sum_relpages │ sum_reltuples │ ?column? │
├──────────────┼───────────────┼──────────┤
So far so good, but what about multi-argument functions?
Do we do "f_x_y_z", and truncate wherever? How well will this
work with nested function calls?
You cannot realistically
handle, say, operator expressions without emitting names that will
require quoting, which doesn't seem attractive.
Well, it doesn't require much to be better than "?column?", which already
requires quoting...
I think the point of "?column?" is to use something that nobody's going
to want to reference that way, quoted or otherwise. The SQL spec says
(in SQL:2021, it's 7.16 <query specification> syntax rule 18) that if the
column expression is anything more complex than a simple column reference
(or SQL parameter reference, which I think we don't support) then the
column name is implementation-dependent, which is standards-ese for
"here be dragons".
BTW, SQL92 and SQL99 had a further constraint:
c) Otherwise, the <column name> of the i-th column of the <query
specification> is implementation-dependent and different
from the <column name> of any column, other than itself, of
a table referenced by any <table reference> contained in the
SQL-statement.
We never tried to implement that literally, and now I'm glad we didn't
bother, because recent spec versions only say "implementation-dependent",
full stop. In any case, the spec is clearly in the camp of "don't depend
on these column names".
We could just do something like printing <left>_<funcname>_<right>. So
something like avg(reltuples / relpages) would end up as
avg_reltuples_float48div_relpages.
Whether that's worth it, or whether column name lengths would be too painful,
IDK.
I think you'd soon be hitting NAMEDATALEN limits ...
And no, deduplication isn't on the table at all here.
+1
I remembered while looking at the spec that duplicate column names
in SELECT output are not only allowed but *required* by the spec.
If you write, say, "SELECT 1 AS x, 2 AS x, ..." then the column
names of those two columns are both "x", no wiggle room at all.
So I see little point in trying to deduplicate generated names,
even aside from the points you made.
regards, tom lane
On 2/22/23 23:03, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
We could just do something like printing <left>_<funcname>_<right>. So
something like avg(reltuples / relpages) would end up as
avg_reltuples_float48div_relpages.
Whether that's worth it, or whether column name lengths would be too painful,
IDK.I think you'd soon be hitting NAMEDATALEN limits ...
<flameproof_suit>
Probably an unpalatable idea, but if we did something like
md5('avg(reltuples / relpages)') for the column name, it would be
(reasonably) unique and deterministic. Not pretty, but possibly useful
in some cases.
</flameproof_suit>
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 22.02.23 21:38, Andres Freund wrote:
On 2023-02-20 16:08:00 +0100, Peter Eisentraut wrote:
On 11.02.23 20:24, Andres Freund wrote:
I think we should just do it and not care about what breaks. There has
never been any guarantee about these.FWIW, "most" other SQL implementations appear to generate column names like
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: "SUM(reads)", "SUM(writes)"Hm, personally I don't like leaving in parens in the names, that makes it
unnecessarily hard to reference the columns. sum_reads imo is more usable
than than "SUM(reads)".
If you want something without special characters, the example you gave
is manageable, but what are you going to do with
SELECT a, b, a * b, a / b FROM ...
or
SELECT a, b, SUM(a * b) FROM ...
and so on. What would be the actual rule to produce the output you want?
I think a question here is what "usable" means in this context.
If you want a name that you can refer to (in a client API, for example),
you should give it a name explicitly.
I think the uses for the automatic names are that they look pretty and
meaningful in visual output (psql, pgadmin, etc.). In that context, I
think it is ok to use special characters without limitation, since you
are just going to look at the thing, not type it back in.