[docs] revise ORDER BY documentation

Started by Dian Fayalmost 2 years ago1 messages

di@nmfay.com

almost 2 years ago

1 attachment(s)

I recently encountered some odd behavior with a query both selecting and
sorting by `random()`. When I posted about it on pgsql-bugs ^1, David
Johnston and Tom Lane provided some very detailed explanations as to
what was happening, but weren't sure whether or where information about
it could live comfortably in the docs. I think it's a useful addition;
it's not an everyday occurrence but I'm very much not the first person
to run into it. After a bit of looking, I think I've found a reasonable
location.

This patch revises
https://www.postgresql.org/docs/current/queries-order.html to discuss
sort expressions and options separately, and fits a caveat based on
Tom's suggested language (with an example) into the former section.

There are a few other minor tweaks included here:

- note that `*` is not an expression
- consolidate output column examples
- mention non-column sort expressions

I did write a query demonstrating the `group by` case Tom mentioned, but
expect that one's a lot less common.

1: /messages/by-id/CZHAF947QQQO.27MAUK2SVMBXW@nmfay.com

Attachments:

0001-Revise-ORDER-BY-documentation.patchtext/x-patch; charset=utf-8; name=0001-Revise-ORDER-BY-documentation.patchDownload

From cb336d5f5e5e23704e14f42eb09d4bf3f1c7e10e Mon Sep 17 00:00:00 2001
From: Dian M Fay <dian.m.fay@gmail.com>
Date: Sat, 2 Mar 2024 22:43:05 -0500
Subject: [PATCH] Revise ORDER BY documentation

Discuss sort expressions and options separately, and add clarifications
around `*` not being an expression, the use of non-column sort
expressions, and the potentially unexpected behavior of volatile
expressions appearing both in ORDER BY and in the select list.
---
 doc/src/sgml/queries.sgml | 161 +++++++++++++++++++++++---------------
 1 file changed, 99 insertions(+), 62 deletions(-)

diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml
index 648b283b06..066d0a17ab 100644
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -1822,75 +1822,112 @@ SELECT <replaceable>select_list</replaceable>
     ORDER BY <replaceable>sort_expression1</replaceable> <optional>ASC | DESC</optional> <optional>NULLS { FIRST | LAST }</optional>
              <optional>, <replaceable>sort_expression2</replaceable> <optional>ASC | DESC</optional> <optional>NULLS { FIRST | LAST }</optional> ...</optional>
 </synopsis>
-   The sort expression(s) can be any expression that would be valid in the
-   query's select list.  An example is:
+   When more than one expression is specified, the later values are used to
+   sort rows that are equal according to the earlier values.
+  </para>
+
+  <sect2 id="queries-order-sort-expressions">
+   <title>Types of Sort Expression</title>
+
+   <para>
+    A <replaceable>sort_expression</replaceable> can be any expression (see
+    <xref linkend="sql-expressions"/>; <literal>*</literal> is not an
+    expression) that would be valid in the query's select list.  This is most
+    often a column or an operation on a column belonging to a relation
+    referenced in the query, such as:
 <programlisting>
 SELECT a, b FROM table1 ORDER BY a + b, c;
 </programlisting>
-   When more than one expression is specified,
-   the later values are used to sort rows that are equal according to the
-   earlier values.  Each expression can be followed by an optional
-   <literal>ASC</literal> or <literal>DESC</literal> keyword to set the sort direction to
-   ascending or descending.  <literal>ASC</literal> order is the default.
-   Ascending order puts smaller values first, where
-   <quote>smaller</quote> is defined in terms of the
-   <literal>&lt;</literal> operator.  Similarly, descending order is
-   determined with the <literal>&gt;</literal> operator.
-    <footnote>
-     <para>
-      Actually, <productname>PostgreSQL</productname> uses the <firstterm>default B-tree
-      operator class</firstterm> for the expression's data type to determine the sort
-      ordering for <literal>ASC</literal> and <literal>DESC</literal>.  Conventionally,
-      data types will be set up so that the <literal>&lt;</literal> and
-      <literal>&gt;</literal> operators correspond to this sort ordering,
-      but a user-defined data type's designer could choose to do something
-      different.
-     </para>
-    </footnote>
-  </para>
+    Columns do not have to be projected in the select list to be used in
+    <literal>ORDER BY</literal>. Non-column expressions may also be used;
+    <literal>ORDER BY random()</literal> is a common technique to shuffle
+    output records.
+   </para>
 
-  <para>
-   The <literal>NULLS FIRST</literal> and <literal>NULLS LAST</literal> options can be
-   used to determine whether nulls appear before or after non-null values
-   in the sort ordering.  By default, null values sort as if larger than any
-   non-null value; that is, <literal>NULLS FIRST</literal> is the default for
-   <literal>DESC</literal> order, and <literal>NULLS LAST</literal> otherwise.
-  </para>
+   <para>
+    A <replaceable>sort_expression</replaceable> can also be the column label
+    or number of an output column.  The output column name must stand alone,
+    that is, it cannot be used in an expression.  For example:
+<programlisting>
+SELECT a + b AS sum, c FROM table1 ORDER BY sum;     -- sorts by a + b
+SELECT a, max(b) FROM table1 GROUP BY a ORDER BY 1;  -- sorts by a
+SELECT a + b AS sum, c FROM table1 ORDER BY sum + c; -- error
+</programlisting>
+    This restriction on expressions involving output columns is made to reduce
+    ambiguity.  There is still ambiguity if an <literal>ORDER BY</literal> item
+    is a simple name that could match either an output column name or a column
+    from the table expression.  The output column is used in such cases.  This
+    would only cause confusion if you use <literal>AS</literal> to rename an
+    output column to match some other table column's name.
+   </para>
 
-  <para>
-   Note that the ordering options are considered independently for each
-   sort column.  For example <literal>ORDER BY x, y DESC</literal> means
-   <literal>ORDER BY x ASC, y DESC</literal>, which is not the same as
-   <literal>ORDER BY x DESC, y DESC</literal>.
-  </para>
+   <para>
+    When a <replaceable>sort_expression</replaceable> or grouping expression
+    (see <xref linkend="queries-group"/>) matches an expression or
+    subexpression in the select list, the query parser may interpret the former
+    as a consistent value even if the expression or subexpression is volatile.
+     This can cause unexpected behavior, for example with
+    <literal>random()</literal>:
+<programlisting>
+SELECT s, random(), random() FROM generate_series(1, 10) AS s
+ORDER BY random();
+</programlisting>
+    Here, <literal>random()</literal> is executed once per row in the
+    <literal>ORDER BY</literal>. The <literal>random()</literal> invocations in
+    the <command>SELECT</command> list refer to <emphasis>that</emphasis> value
+    for <literal>random()</literal> instead of executing the function again.
+   </para>
+  </sect2>
 
-  <para>
-   A <replaceable>sort_expression</replaceable> can also be the column label or number
-   of an output column, as in:
-<programlisting>
-SELECT a + b AS sum, c FROM table1 ORDER BY sum;
-SELECT a, max(b) FROM table1 GROUP BY a ORDER BY 1;
-</programlisting>
-   both of which sort by the first output column.  Note that an output
-   column name has to stand alone, that is, it cannot be used in an expression
-   &mdash; for example, this is <emphasis>not</emphasis> correct:
-<programlisting>
-SELECT a + b AS sum, c FROM table1 ORDER BY sum + c;          -- wrong
-</programlisting>
-   This restriction is made to reduce ambiguity.  There is still
-   ambiguity if an <literal>ORDER BY</literal> item is a simple name that
-   could match either an output column name or a column from the table
-   expression.  The output column is used in such cases.  This would
-   only cause confusion if you use <literal>AS</literal> to rename an output
-   column to match some other table column's name.
-  </para>
+  <sect2 id="queries-order-sort-options">
+   <title>Sort Expression Options</title>
 
-  <para>
-   <literal>ORDER BY</literal> can be applied to the result of a
-   <literal>UNION</literal>, <literal>INTERSECT</literal>, or <literal>EXCEPT</literal>
-   combination, but in this case it is only permitted to sort by
-   output column names or numbers, not by expressions.
-  </para>
+   <para>
+    Each expression can be followed by an optional <literal>ASC</literal> or
+    <literal>DESC</literal> keyword to set the sort direction to ascending or
+    descending.  <literal>ASC</literal> order is the default.
+    Ascending order puts smaller values first, where
+    <quote>smaller</quote> is defined in terms of the
+    <literal>&lt;</literal> operator.  Similarly, descending order is
+    determined with the <literal>&gt;</literal> operator.
+     <footnote>
+      <para>
+       Actually, <productname>PostgreSQL</productname> uses the <firstterm>default B-tree
+       operator class</firstterm> for the expression's data type to determine the sort
+       ordering for <literal>ASC</literal> and <literal>DESC</literal>.  Conventionally,
+       data types will be set up so that the <literal>&lt;</literal> and
+       <literal>&gt;</literal> operators correspond to this sort ordering,
+       but a user-defined data type's designer could choose to do something
+       different.
+      </para>
+     </footnote>
+   </para>
+
+   <para>
+    The <literal>NULLS FIRST</literal> and <literal>NULLS LAST</literal> options can be
+    used to determine whether nulls appear before or after non-null values
+    in the sort ordering.  By default, null values sort as if larger than any
+    non-null value; that is, <literal>NULLS FIRST</literal> is the default for
+    <literal>DESC</literal> order, and <literal>NULLS LAST</literal> otherwise.
+   </para>
+
+   <para>
+    Note that the ordering options are considered independently for each
+    sort column.  For example <literal>ORDER BY x, y DESC</literal> means
+    <literal>ORDER BY x ASC, y DESC</literal>, which is not the same as
+    <literal>ORDER BY x DESC, y DESC</literal>.
+   </para>
+  </sect2>
+
+  <sect2 id="queries-order-set-operations">
+   <title><literal>ORDER BY</literal> and Set Operations</title>
+   <para>
+    <literal>ORDER BY</literal> can be applied to the result of a
+    <literal>UNION</literal>, <literal>INTERSECT</literal>, or <literal>EXCEPT</literal>
+    combination.  In this case it is only permitted to sort by output column
+    names or numbers, not by expressions.
+   </para>
+  </sect2>
  </sect1>
 
 
-- 
2.44.0