Changes to pg_dump/psql following collation "C" in the catalog

Started by Daniel Veriteabout 7 years ago9 messages
#1Daniel Verite
daniel@manitou-mail.org

Hi,

One consequence of using the "C" collation in the catalog versus
the db collation is that pg_dump -t with a regexp may not find
the same tables as before. It happens when these conditions are
all met:
- the collation of the database is not "C"
- the regexp has locale-dependant parts
- the names to match include characters that are sensitive to
locale-dependant matching

For instance a table named "rapport_journée_12" in an fr_FR.UTF-8 db
used to be found by pg_dump -t 'rapport_\w+_\d+', and is now missed
in the devel version.

It seems that to fix that, we could qualify the references to columns such
as "relname" and "schema_name" with COLLATE "default" clauses in the
queries that use pattern-matching in client-side tools, AFAICS
pg_dump and psql.

Before going any further with this idea, is there agreement that it's an
issue to address and does this look like the best way to do that?

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#1)
Re: Changes to pg_dump/psql following collation "C" in the catalog

"Daniel Verite" <daniel@manitou-mail.org> writes:

One consequence of using the "C" collation in the catalog versus
the db collation is that pg_dump -t with a regexp may not find
the same tables as before. It happens when these conditions are
all met:
- the collation of the database is not "C"
- the regexp has locale-dependant parts
- the names to match include characters that are sensitive to
locale-dependant matching

Hm, interesting.

It seems that to fix that, we could qualify the references to columns such
as "relname" and "schema_name" with COLLATE "default" clauses in the
queries that use pattern-matching in client-side tools, AFAICS
pg_dump and psql.

Seems reasonable. I was initially worried that this might interfere with
query optimization, but some experimentation says that the planner
successfully derives prefix index clauses anyway (which is correct,
because matching a fixed regex prefix doesn't depend on locale).

It might be better to attach the COLLATE clause to the pattern constant
instead of the column name; that'd be less likely to break if sent to
an older server.

Before going any further with this idea, is there agreement that it's an
issue to address and does this look like the best way to do that?

That is a question worth asking. We're going to be forcing people to get
used to this when working directly in SQL, so I don't know if masking it
in a subset of tools is really a win or not.

regards, tom lane

#3Daniel Verite
daniel@manitou-mail.org
In reply to: Tom Lane (#2)
1 attachment(s)
Re: Changes to pg_dump/psql following collation "C" in the catalog

Tom Lane wrote:

"Daniel Verite" <daniel@manitou-mail.org> writes:

One consequence of using the "C" collation in the catalog versus
the db collation is that pg_dump -t with a regexp may not find
the same tables as before. It happens when these conditions are
all met:
- the collation of the database is not "C"
- the regexp has locale-dependant parts
- the names to match include characters that are sensitive to
locale-dependant matching

Hm, interesting.

It seems that to fix that, we could qualify the references to columns such
as "relname" and "schema_name" with COLLATE "default" clauses in the
queries that use pattern-matching in client-side tools, AFAICS
pg_dump and psql.

Seems reasonable. I was initially worried that this might interfere with
query optimization, but some experimentation says that the planner
successfully derives prefix index clauses anyway (which is correct,
because matching a fixed regex prefix doesn't depend on locale).

It might be better to attach the COLLATE clause to the pattern constant
instead of the column name; that'd be less likely to break if sent to
an older server.

Before going any further with this idea, is there agreement that it's an
issue to address and does this look like the best way to do that?

That is a question worth asking. We're going to be forcing people to get
used to this when working directly in SQL, so I don't know if masking it
in a subset of tools is really a win or not.

I think psql and pg_dump need to adjust, just like the 3rd party tools
will, at least those that support collation-aware search in the catalog.
PFA a patch that implements the slight changes needed.

I'll add an entry for it in the next CF.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Attachments:

processSQLNamePattern-using-db-collation.difftext/x-patch; name=processSQLNamePattern-using-db-collation.diffDownload
diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c
index 5c1732a..69ac6f9 100644
--- a/src/fe_utils/string_utils.c
+++ b/src/fe_utils/string_utils.c
@@ -808,6 +808,8 @@ appendReloptionsArray(PQExpBuffer buffer, const char *reloptions,
  * to limit the set of objects returned.  The WHERE clauses are appended
  * to the already-partially-constructed query in buf.  Returns whether
  * any clause was added.
+ * The pattern matching uses the collation of the database through explicit
+ * COLLATE "default" clauses.
  *
  * conn: connection query will be sent to (consulted for escaping rules).
  * buf: output parameter.
@@ -971,17 +973,18 @@ processSQLNamePattern(PGconn *conn, PQExpBuffer buf, const char *pattern,
 				appendPQExpBuffer(buf,
 								  "(%s OPERATOR(pg_catalog.~) ", namevar);
 				appendStringLiteralConn(buf, namebuf.data, conn);
+				appendPQExpBufferStr(buf, " COLLATE \"default\"");
 				appendPQExpBuffer(buf,
 								  "\n        OR %s OPERATOR(pg_catalog.~) ",
 								  altnamevar);
 				appendStringLiteralConn(buf, namebuf.data, conn);
-				appendPQExpBufferStr(buf, ")\n");
+				appendPQExpBufferStr(buf, " COLLATE \"default\" )\n");
 			}
 			else
 			{
 				appendPQExpBuffer(buf, "%s OPERATOR(pg_catalog.~) ", namevar);
 				appendStringLiteralConn(buf, namebuf.data, conn);
-				appendPQExpBufferChar(buf, '\n');
+				appendPQExpBufferStr(buf, " COLLATE \"default\"\n");
 			}
 		}
 	}
@@ -997,7 +1000,7 @@ processSQLNamePattern(PGconn *conn, PQExpBuffer buf, const char *pattern,
 			WHEREAND();
 			appendPQExpBuffer(buf, "%s OPERATOR(pg_catalog.~) ", schemavar);
 			appendStringLiteralConn(buf, schemabuf.data, conn);
-			appendPQExpBufferChar(buf, '\n');
+			appendPQExpBufferStr(buf, " COLLATE \"default\"\n");
 		}
 	}
 	else
#4Chapman Flack
chap@anastigmatix.net
In reply to: Daniel Verite (#3)
Re: Changes to pg_dump/psql following collation "C" in the catalog

"Daniel Verite" <daniel@manitou-mail.org> writes:

One consequence of using the "C" collation in the catalog versus
the db collation

As an intrigued Person Following At Home, I was happy when I found
this little three-message thread had more context in [1]/messages/by-id/15938.1544377821@sss.pgh.pa.us and [2]/messages/by-id/5978.1544030694@sss.pgh.pa.us. :)

-Chap

[1]: /messages/by-id/15938.1544377821@sss.pgh.pa.us
[2]: /messages/by-id/5978.1544030694@sss.pgh.pa.us

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#3)
Re: Changes to pg_dump/psql following collation "C" in the catalog

"Daniel Verite" <daniel@manitou-mail.org> writes:

I think psql and pg_dump need to adjust, just like the 3rd party tools
will, at least those that support collation-aware search in the catalog.
PFA a patch that implements the slight changes needed.
I'll add an entry for it in the next CF.

Hm, if that's as much as we have to touch, I think there's a good
argument for squeezing it into v12 rather than waiting. The point
here is mostly to avoid a behavior change from pre-v12, but if we
allow v12 to have a different behavior, it's questionable whether
we'd want v13 to change it back.

Just looking at the patch, I wonder whether it doesn't need some
server-version checks. At the very least this would break with
pre-9.1 servers, which lack COLLATE altogether.

regards, tom lane

#6Daniel Verite
daniel@manitou-mail.org
In reply to: Chapman Flack (#4)
Re: Changes to pg_dump/psql following collation "C" in the catalog

Chapman Flack wrote:

"Daniel Verite" <daniel@manitou-mail.org> writes:

One consequence of using the "C" collation in the catalog versus
the db collation

As an intrigued Person Following At Home, I was happy when I found
this little three-message thread had more context in [1] and [2]. :)

-Chap

[1] /messages/by-id/15938.1544377821@sss.pgh.pa.us
[2] /messages/by-id/5978.1544030694@sss.pgh.pa.us

Yes. The concrete issue that the patch addresses can be illustrated
with this example:

psql (12devel)
Type "help" for help.

postgres=# show lc_collate ;
lc_collate
-------------
fr_FR.UTF-8
(1 row)

postgres=# create table année_2018();
CREATE TABLE

postgres=# \dt '\\w+_\\d+'
psql: error: Did not find any relation named "\w+_\d+".

In previous versions it would have found the table with the accent
in the name. With 12devel it doesn't, because the match is done with
the collation of the column, now "C", which does not consider the
accented character to be a letter.
This also affects pg_dump with the -t and -n switches that
accept patterns and I think pretty much all \d* commands
that accept patterns too.

The purpose of the fix is for the patterns to give the same results as
before. It's done by simply adding explicit collate clauses to use the
collation of the database for these comparisons.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

#7Daniel Verite
daniel@manitou-mail.org
In reply to: Tom Lane (#5)
1 attachment(s)
Re: Changes to pg_dump/psql following collation "C" in the catalog

Tom Lane wrote:

Hm, if that's as much as we have to touch, I think there's a good
argument for squeezing it into v12 rather than waiting. The point
here is mostly to avoid a behavior change from pre-v12

Yes. I was mentioning the next CF because ISTM that nowadays
non-committers are expected to file patches in there, committers
picking up patches both in the current and next CF based on
their evaluation of priorities.
But if you plan to process this one shortly, a CF entry is probably
superfluous.

Just looking at the patch, I wonder whether it doesn't need some
server-version checks. At the very least this would break with
pre-9.1 servers, which lack COLLATE altogether.

PFA a new version adding the clause for only 12 and up, since the
previous versions are not concerned, and as you mention, really old
versions would fail otherwise.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Attachments:

processSQLNamePattern-using-db-collation-v2.difftext/x-patch; name=processSQLNamePattern-using-db-collation-v2.diffDownload
diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c
index 5c1732a..fe88d73 100644
--- a/src/fe_utils/string_utils.c
+++ b/src/fe_utils/string_utils.c
@@ -808,6 +808,8 @@ appendReloptionsArray(PQExpBuffer buffer, const char *reloptions,
  * to limit the set of objects returned.  The WHERE clauses are appended
  * to the already-partially-constructed query in buf.  Returns whether
  * any clause was added.
+ * The pattern matching uses the collation of the database through explicit
+ * COLLATE "default" clauses when needed (server version 12 and above).
  *
  * conn: connection query will be sent to (consulted for escaping rules).
  * buf: output parameter.
@@ -971,16 +973,22 @@ processSQLNamePattern(PGconn *conn, PQExpBuffer buf, const char *pattern,
 				appendPQExpBuffer(buf,
 								  "(%s OPERATOR(pg_catalog.~) ", namevar);
 				appendStringLiteralConn(buf, namebuf.data, conn);
+				if (PQserverVersion(conn) >= 120000)
+					appendPQExpBufferStr(buf, " COLLATE \"default\"");
 				appendPQExpBuffer(buf,
 								  "\n        OR %s OPERATOR(pg_catalog.~) ",
 								  altnamevar);
 				appendStringLiteralConn(buf, namebuf.data, conn);
+				if (PQserverVersion(conn) >= 120000)
+					appendPQExpBufferStr(buf, " COLLATE \"default\"");
 				appendPQExpBufferStr(buf, ")\n");
 			}
 			else
 			{
 				appendPQExpBuffer(buf, "%s OPERATOR(pg_catalog.~) ", namevar);
 				appendStringLiteralConn(buf, namebuf.data, conn);
+				if (PQserverVersion(conn) >= 120000)
+					appendPQExpBufferStr(buf, " COLLATE \"default\"");
 				appendPQExpBufferChar(buf, '\n');
 			}
 		}
@@ -997,6 +1005,8 @@ processSQLNamePattern(PGconn *conn, PQExpBuffer buf, const char *pattern,
 			WHEREAND();
 			appendPQExpBuffer(buf, "%s OPERATOR(pg_catalog.~) ", schemavar);
 			appendStringLiteralConn(buf, schemabuf.data, conn);
+			if (PQserverVersion(conn) >= 120000)
+				appendPQExpBufferStr(buf, " COLLATE \"default\"");
 			appendPQExpBufferChar(buf, '\n');
 		}
 	}
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#7)
Re: Changes to pg_dump/psql following collation "C" in the catalog

"Daniel Verite" <daniel@manitou-mail.org> writes:

Hm, if that's as much as we have to touch, I think there's a good
argument for squeezing it into v12 rather than waiting. The point
here is mostly to avoid a behavior change from pre-v12

Yes. I was mentioning the next CF because ISTM that nowadays
non-committers are expected to file patches in there, committers
picking up patches both in the current and next CF based on
their evaluation of priorities.

Yeah, it's good practice to make a CF entry to ensure the patch doesn't
slip through the cracks. There's an awful lot of traffic on pgsql-hackers
these days...

Just looking at the patch, I wonder whether it doesn't need some
server-version checks. At the very least this would break with
pre-9.1 servers, which lack COLLATE altogether.

PFA a new version adding the clause for only 12 and up, since the
previous versions are not concerned, and as you mention, really old
versions would fail otherwise.

Pushed with some fiddling with the comments, and changing the collation
names to be schema-qualified for paranoia's sake.

regards, tom lane

#9Daniel Verite
daniel@manitou-mail.org
In reply to: Tom Lane (#8)
Re: Changes to pg_dump/psql following collation "C" in the catalog

Tom Lane wrote:

PFA a new version adding the clause for only 12 and up, since the
previous versions are not concerned, and as you mention, really old
versions would fail otherwise.

Pushed with some fiddling with the comments, and changing the collation
names to be schema-qualified for paranoia's sake.

Thanks !

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite