Unclear guarantees about sort order on https://www.postgresql.org/docs/current/queries-order.html

Started by PG Bug reporting formover 2 years ago3 messagesdocs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/16/queries-order.html
Description:

The document only says this about unsorted queries:

After a query has produced an output table (after the select list has been

processed) it can optionally be sorted. If sorting is not chosen, the rows
will be returned in an unspecified order. The actual order in that case will
depend on the scan and join plan types and the order on disk, but it must
not be relied on. A particular output ordering can only be guaranteed if the
sort step is explicitly chosen.

It mentions "If sorting is not chosen". This sort of implies that if you
pick a sort the output order is predictable. However I believe that the only
actual guarantee is if the sort columns selected produce a unique value.

For example if you do `ORDER BY name` and have two rows with the same name I
don't think the order of those rows is predictable.

I think the docs should be updated to either:

1. Clearly state that the order **is** consent as long as any sort clause is
present, and specify what that order is.
2. Update the quoted sentence to refer to "If sorting is not chosen or the
sort columns do not form a unique key" instead of just "If sorting is not
chosen".

#2Erik Wienhold
ewie@ewie.name
In reply to: PG Bug reporting form (#1)
Re: Unclear guarantees about sort order on https://www.postgresql.org/docs/current/queries-order.html

On 2023-10-04 16:24 +0200, PG Doc comments form write:

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/16/queries-order.html
Description:

The document only says this about unsorted queries:

After a query has produced an output table (after the select list has been
processed) it can optionally be sorted. If sorting is not chosen, the rows
will be returned in an unspecified order. The actual order in that case will
depend on the scan and join plan types and the order on disk, but it must
not be relied on. A particular output ordering can only be guaranteed if the
sort step is explicitly chosen.

It mentions "If sorting is not chosen". This sort of implies that if you
pick a sort the output order is predictable. However I believe that the only
actual guarantee is if the sort columns selected produce a unique value.

For example if you do `ORDER BY name` and have two rows with the same name I
don't think the order of those rows is predictable.

Right, without an explicit sorting the order of rows is unpredictable.
When only sorting over some columns/expressions then the ordering is
only predictable in those columns/expressions. The order of rows with
ties within the same sorted group is also unpredictable. This is also
implied on the same page after the first example: "When more than one
expression is specified, the later values are used to sort rows that are
equal according to the earlier values." This implies that without later
values, those rows remain in unpredictable order.

The SQL standards linked by [1]https://wiki.postgresql.org/wiki/Developer_FAQ#Where_can_I_get_a_copy_of_the_SQL_standards.3F provide some definitions:

SQL:2003, Part 2, Annex C, 26b:

"The relative ordering of two rows that are not distinct with respect to
the <sort specification> is implementation-dependent."

SQL:2011, Part 2, Annex C, 24e:

"If a <query expression> immediately contains an <order by clause> and
a group of two or more rows in the table specified by that <query
expression> contain values that are not distinct in all sort keys
specified in the <order by clause>, then the ordering of these rows in
that group is implementation-dependent."

I find the first one quite succinct.

I think the docs should be updated to either:

1. Clearly state that the order **is** consent as long as any sort clause is
present, and specify what that order is.

What do you mean with "any sort clause"? Any sort clause at all? Or a
sort clause that covers all columns? If the order should be predictable
it must be specified by the client to some degree. And if the client
does not care about rows with ties than he should not be required to
specify a more specific sorting.

2. Update the quoted sentence to refer to "If sorting is not chosen or the
sort columns do not form a unique key" instead of just "If sorting is not
chosen".

I think "unique key" is misleading in this case because sorting still
leaves duplicates. I'd go with something that mentions "sorted groups"
and "tie breaks" or some form of the quoted SQL:2003 definition.

[1]: https://wiki.postgresql.org/wiki/Developer_FAQ#Where_can_I_get_a_copy_of_the_SQL_standards.3F

--
Erik

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Erik Wienhold (#2)
Re: Unclear guarantees about sort order on https://www.postgresql.org/docs/current/queries-order.html

On Wed, Oct 4, 2023 at 6:37 PM Erik Wienhold <ewie@ewie.name> wrote:

On 2023-10-04 16:24 +0200, PG Doc comments form write:

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/16/queries-order.html
Description:

The document only says this about unsorted queries:

After a query has produced an output table (after the select list has

been

processed) it can optionally be sorted. If sorting is not chosen, the

rows

will be returned in an unspecified order. The actual order in that

case will

depend on the scan and join plan types and the order on disk, but it

must

not be relied on. A particular output ordering can only be guaranteed

if the

sort step is explicitly chosen.

It mentions "If sorting is not chosen". This sort of implies that if you
pick a sort the output order is predictable. However I believe that the

only

actual guarantee is if the sort columns selected produce a unique value.

For example if you do `ORDER BY name` and have two rows with the same

name I

don't think the order of those rows is predictable.

"The relative ordering of two rows that are not distinct with respect to
the <sort specification> is implementation-dependent."

The OP is assuming a promise of a deterministic ordering of all output rows
and such a promise is only possible if the order by clause columns uniquely
identify every row in the output. This is because all the order by
promises is that output ordering will conform to the order by
specification, and indeed if it is under-specified such that multiple rows
match a given bin, then there is no deterministic relative ordering among
those rows.

I don't feel that the wording makes any such inference regarding
determinism of row output due to the mere presence of an order by clause.
Nor doesn't such determinism in the face of an under-specific clause even
make logical sense. I'm mostly inclined to leave the wording alone given
this single report. My only complaints are style-istic at this point.

That said, maybe a final sentence:

Assuming every output row can be uniquely identified by some subset of the
output columns, that subset must all be listed within the order by clause
if you wish to ensure a fully deterministic ordering.

David J.