Giving the shared catalogues a defined encoding

Started by Thomas Munroabout 1 year ago5 messages

thomas.munro@gmail.com

about 1 year ago

1 attachment(s)

Hello hackers,

Here's a WIP patch that started on a bugs thread[1]/messages/by-id/CA+hUKGKKNAc599Vp7kFAnLE1=V=ceYujz_YQoSNrvNFGaJ6i7w@mail.gmail.com.

Problem #1: You can have two databases with different encodings, and
they both pretend that pg_database, pg_authid, pg_db_role_setting etc
are in the local database encoding. That doesn't work too well:
non-ASCII text can be reinterpreted in the wrong encoding.

There's no problem if you only use one encoding everywhere (probably
UTF8). There's also no problem if you use multiple database
encodings, but put only ASCII in the shared catalogues (because ASCII
is a subset of every supported server encoding). This patch is about
formalising and enforcing those two working arrangements, hopefully
invisibly to most users. There's still an escape hatch mode if you
need it, e.g. for a non-conforming pg_upgrade'd system.

The patch invents a new setting CLUSTER CATALOG ENCODING, which can be
inspected with SHOW and changed with ALTER SYSTEM. It has three
possible values:

DATABASE: The shared catalogs use the same encoding as this database,
and all databases in this cluster, and all databases have to use the
default encoding configured at initdb time. Database names and roles
names are free to use any characters you like in that one single
encoding. This is the default.

ASCII: The shared catalogs are restricted to 7-bit ASCII, but in
exchange, databases with different encodings are allowed to co-exist.

UNDEFINED: The old behavior, no restrictions.

There's some documentation in the patch to explain that again in more
words, and a regression transcript showing the behaviour, ie things
you can and can't do in each mode, and how the transitions between
modes can be blocked until you make certain changes.

Problem #2: When dealing with new connections, we currently have
trouble with non-ASCII database and role names because the encoding is
undefined for both the catalogue and the network message. With this
patch, at least the catalogue encoding is defined (unless UNDEFINED),
so there's a pathway to straighten that out.

I am open to better terminology, models, etc. The command seems
verbose, but I hope you'd almost never need to run it, so being clear
seemed better than being brief. I had just CATALOG ENCODING in the
previous version, but then it's not clear that it only affects a few
special catalogues (pg_class et al are always in database encoding, as
they're not shared). I tried SHARED CATALOG ENCODING, but that's
not really a SQL word or concept. CLUSTER is, so here I'm trying
that. On the other hand CLUSTER is a bit overloaded. I had explicit
encoding names eg SET ... TO UTF8 in the previous version, but it
seems easier to call it DATABASE encoding given it had to match the
database anyway if not ASCII/UNDEFINED... There could be other ways
to express all this, though. It does still store the real encoding in
the control file, UTF8 -> 6 or whatever, in case that is useful. When
I was using real encoding names in the syntax, I also had SQL_ASCII
for ASCII mode, but that was quite confusing because SQL_ASCII is well
documented as accepting anything at all, whereas here we need 7-bit
ASCII.

Feedback welcome.

[1]: /messages/by-id/CA+hUKGKKNAc599Vp7kFAnLE1=V=ceYujz_YQoSNrvNFGaJ6i7w@mail.gmail.com

Attachments:

v2-0001-Formalize-the-encoding-of-the-shared-catalogs.patchapplication/octet-stream; name=v2-0001-Formalize-the-encoding-of-the-shared-catalogs.patchDownload

From 5a0c410452f54b22a7a4361c90d7bb4a93c83d01 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 30 Nov 2024 20:00:19 +1300
Subject: [PATCH v2] Formalize the encoding of the shared catalogs.

The encoding of shared catalogs was previously undefined.  Each database
just naively used its own encoding, and early connection phases worked
with raw bytes and hoped for the best.  Database names, role names, and
more could be corrupted in various multi-encoding scenarios.

This commit introduces a new setting CLUSTER CATALOG ENCODING to define
and enforce the encoding.  It can be specified at initdb time, and
changed later with ALTER SYSTEM if the conditions for the new setting
are met.  Three settings are available:

DATABASE:  The shared catalogs use the same encoding as all databases,
           and all databases must use the same encoding.  Database names
           and role names are free to include non-ASCII characters.  This
           is the default.

ASCII:     The shared catalogs are restricted to 7-bit ASCII.  Databases
           with different encodings are allowed to co-exist, because
           ASCII is a subset of all supported encodings.

UNDEFINED: The old behavior.  Not recommended, and perhaps one day we'll
           consider removing it.

Work in progress!

Discussion: https://postgr.es/m/CA%2BhUKGKKNAc599Vp7kFAnLE1%3DV%3DceYujz_YQoSNrvNFGaJ6i7w%40mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                    |   2 +
 doc/src/sgml/charset.sgml                     |  66 ++-
 doc/src/sgml/ref/alter_system.sgml            |  19 +-
 doc/src/sgml/ref/initdb.sgml                  |  21 +
 src/backend/access/rmgrdesc/xlogdesc.c        |  10 +
 src/backend/access/transam/xlog.c             |  42 +-
 src/backend/bootstrap/bootstrap.c             |   9 +-
 src/backend/catalog/Makefile                  |   1 +
 src/backend/catalog/encoding.c                | 432 ++++++++++++++++++
 src/backend/catalog/meson.build               |   1 +
 src/backend/catalog/pg_db_role_setting.c      |   5 +
 src/backend/commands/alter.c                  |  18 +
 src/backend/commands/comment.c                |   3 +
 src/backend/commands/dbcommands.c             |  28 ++
 src/backend/commands/seclabel.c               |   3 +
 src/backend/commands/subscriptioncmds.c       |   6 +
 src/backend/commands/tablespace.c             |   4 +
 src/backend/commands/user.c                   |   4 +
 src/backend/parser/gram.y                     |  14 +
 src/backend/replication/logical/origin.c      |   2 +
 src/backend/tcop/utility.c                    |  11 +-
 src/backend/utils/misc/guc_tables.c           |  12 +
 src/bin/initdb/initdb.c                       |  22 +-
 src/bin/pg_controldata/pg_controldata.c       |   3 +
 src/bin/pg_resetwal/pg_resetwal.c             |   2 +
 src/bin/pg_upgrade/controldata.c              |  16 +
 src/bin/pg_upgrade/pg_upgrade.c               |  17 +
 src/bin/pg_upgrade/pg_upgrade.h               |   1 +
 src/bin/psql/tab-complete.in.c                |  12 +-
 src/bin/scripts/t/020_createdb.pl             |   4 +-
 src/include/access/xlog.h                     |   5 +-
 src/include/access/xlog_internal.h            |   6 +
 src/include/catalog/catalog.h                 |   3 +
 src/include/catalog/pg_control.h              |   5 +-
 src/include/commands/alter.h                  |   2 +
 src/include/nodes/parsenodes.h                |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/encoding/Makefile            |  17 +
 .../expected/cluster_catalog_encoding.out     | 208 +++++++++
 src/test/modules/encoding/meson.build         |  14 +
 .../encoding/sql/cluster_catalog_encoding.sql | 137 ++++++
 src/test/modules/meson.build                  |   1 +
 src/test/regress/expected/database.out        |  16 +-
 src/test/regress/pg_regress.c                 |   2 +
 src/test/regress/sql/database.sql             |  16 +-
 src/tools/pgindent/typedefs.list              |   1 +
 46 files changed, 1197 insertions(+), 28 deletions(-)
 create mode 100644 src/backend/catalog/encoding.c
 create mode 100644 src/test/modules/encoding/Makefile
 create mode 100644 src/test/modules/encoding/expected/cluster_catalog_encoding.out
 create mode 100644 src/test/modules/encoding/meson.build
 create mode 100644 src/test/modules/encoding/sql/cluster_catalog_encoding.sql

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index bf3cee08a93..1a7c0b9565e 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -36,6 +36,8 @@
    database creation and are thereafter database-specific. A few
    catalogs are physically shared across all databases in a cluster;
    these are noted in the descriptions of the individual catalogs.
+   This has implications for their character set; see
+   <xref linkend="cluster-catalog-encoding"/> for details.
   </para>
 
   <table id="catalog-table">
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 00e1986849a..bbfba46eb0c 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -2222,7 +2222,10 @@ initdb -E EUC_JP
 
     <para>
      You can specify a non-default encoding at database creation time,
-     provided that the encoding is compatible with the selected locale:
+     provided that the encoding is compatible with the selected locale,
+     and <literal>CLUSTER CATALOG ENCODING</literal> is set to
+     <literal>ASCII</literal> (or <literal>UNDEFINED</literal>, not
+     recommended).
 
 <screen>
 createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean
@@ -2401,6 +2404,67 @@ RESET client_encoding;
     </para>
    </sect2>
 
+   <sect2 id="cluster-catalog-encoding">
+    <title>Character Sets and Catalogs Shared by the Whole Cluster</title>
+    <para>
+     The names of databases, roles and a small number of other object type
+     are
+     stored in <link linkend="catalogs-overview">shared catalogs</link>,
+     and are visible from all the databases in a cluster.
+     Since databases can use different encodings, a trade-off is required to
+     make sure that they use compatible character sets.
+     Two main configurations are available, controlled by the system setting
+     <literal>CLUSTER CATALOG ENCODING</literal>:
+
+     <itemizedlist>
+      <listitem>
+       <para>
+        <literal>DATABASE</literal>:  To allow non-ASCII characters to be used in
+        database and role names, all databases must use the same encoding.
+        This is the default setting used by <command>initdb</command>.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        <literal>ASCII</literal>: To allow databases with different
+        encodings to co-exist, the shared catalogs must use only ASCII
+        characters.  It is a subset of all supported server encodings, so
+        conforming strings are also valid in every database's encoding without
+        conversion (or risk of conversion failure).
+       </para>
+      </listitem>
+     </itemizedlist>
+    </para>
+    <para>
+     A third setting <literal>UNDEFINED</literal> allows
+     databases with different encodings to co-exist while also allowing
+     non-ASCII characters in database and role names.
+     Corruption can occcur when in this mode.  It is provided to
+     support upgrading from older PostgreSQL releases, but is not
+     recommended for new deployments.
+    </para>
+    <para>
+     <literal>CLUSTER CATALOG ENCODING</literal> can be set with the
+     <xref linkend="app-initdb-option-cluster-catalog-encoding"/> option to <command>initdb</command>, or
+     changed later using the <xref linkend="sql-altersystem"/>
+     command.  To change to
+     <literal>ASCII</literal>, shared catalogs must have no existing non-ASCII
+     characters.
+     To change to <literal>DATABASE</literal> encoding, all existing databases must be using
+     that encoding.
+     It is always possible to change to <literal>UNDEFINED</literal> without restriction,
+     but not recommended.
+    </para>
+    <para>
+     Note that when <literal>CLUSTER CATALOG ENCODING</literal> is set to
+     <literal>ASCII</literal>,
+     it only restricts the names and properties of a small number of kinds of database
+     objects that have cluster-wide visibility.  Most objects such as tables,
+     indexes and functions are stored in per-database catalogs and can always
+     use the full character set of the database's encoding.
+    </para>
+   </sect2>
+
    <sect2 id="multibyte-conversions-supported">
     <title>Available Character Set Conversions</title>
 
diff --git a/doc/src/sgml/ref/alter_system.sgml b/doc/src/sgml/ref/alter_system.sgml
index 1bde66d6ad2..52b7100a52e 100644
--- a/doc/src/sgml/ref/alter_system.sgml
+++ b/doc/src/sgml/ref/alter_system.sgml
@@ -25,6 +25,8 @@ ALTER SYSTEM SET <replaceable class="parameter">configuration_parameter</replace
 
 ALTER SYSTEM RESET <replaceable class="parameter">configuration_parameter</replaceable>
 ALTER SYSTEM RESET ALL
+
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO { DATABASE | ASCII | UNKNOWN }
 </synopsis>
  </refsynopsisdiv>
 
@@ -32,7 +34,7 @@ ALTER SYSTEM RESET ALL
   <title>Description</title>
 
   <para>
-   <command>ALTER SYSTEM</command> is used for changing server configuration
+   <command>ALTER SYSTEM { SET | RESET }</command> is used for changing server configuration
    parameters across the entire database cluster.  It can be more convenient
    than the traditional method of manually editing
    the <filename>postgresql.conf</filename> file.
@@ -54,6 +56,20 @@ ALTER SYSTEM RESET ALL
    or sending a <systemitem>SIGHUP</systemitem> signal to the main server process.
   </para>
 
+  <para>
+   <command>ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO</command> selects the
+   character set used for role names, database names and the properties of
+   certain other database objects that are shared
+   by all databases in the cluster.  <literal>DATABASE</literal> means that database
+   encoding is used, and all databases are required to use the same encoding.
+   <literal>ASCII</literal> means that only the ASCII character set can used,
+   but databases can use any encoding.
+   <literal>UNDEFINED</literal> disables enforcement of character set restrictions and is
+   not recommended.  See <xref linkend="cluster-catalog-encoding"/> for details.
+   Unlike other <command>ALTER SYSTEM SET</command> commands, this change takes
+   effect immediately, if the required conditions required are met.
+  </para>
+
   <para>
    Only superusers and users granted <literal>ALTER SYSTEM</literal> privilege
    on a parameter can change it using <command>ALTER SYSTEM</command>.  Also, since
@@ -96,6 +112,7 @@ ALTER SYSTEM RESET ALL
      </para>
     </listitem>
    </varlistentry>
+
   </variablelist>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 0c32114cf70..d3bead500ba 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -132,6 +132,10 @@ PostgreSQL documentation
 
   <para>
    To alter the default encoding, use the <option>--encoding</option>.
+   To specify a different encoding for the shared catalogs with
+   <option>--cluster-catalog-encoding</option>; this affects the character
+   set available for database and role names, and the ability to create
+   databases with different encodings.
    More details can be found in <xref linkend="multibyte"/>.
   </para>
 
@@ -227,6 +231,23 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-cluster-catalog-encoding">
+      <term><option>-C <replaceable class="parameter">encoding</replaceable></option></term>
+      <term><option>--cluster-catalog-encoding=<replaceable class="parameter">encoding</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the initial <literal>CLUSTER CATALOG ENCODING</literal> setting.
+       </para>
+       <para>
+        By default, shared catalog encoding is set to <literal>DATABASE</literal>.
+        Valid options are <literal>DATABASE</literal>, <literal>ASCII</literal>
+        and <literal>UNDEFINED</literal>.  The value can be changed later with
+        the <literal>ALTER SYSTEM</literal> command.
+        See <xref linkend="cluster-catalog-encoding"/> for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-allow-group-access">
       <term><option>-g</option></term>
       <term><option>--allow-group-access</option></term>
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 363294d6234..c65bd741dae 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -134,6 +134,13 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec.wal_log_hints ? "on" : "off",
 						 xlrec.track_commit_timestamp ? "on" : "off");
 	}
+	else if (info == XLOG_CLUSTER_CATALOG_ENCODING_CHANGE)
+	{
+		xl_cluster_catalog_encoding_change xlrec;
+
+		memcpy(&xlrec, rec, sizeof(xl_cluster_catalog_encoding_change));
+		appendStringInfo(buf, "encoding=%d", xlrec.encoding);
+	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
 		bool		fpw;
@@ -197,6 +204,9 @@ xlog_identify(uint8 info)
 		case XLOG_PARAMETER_CHANGE:
 			id = "PARAMETER_CHANGE";
 			break;
+		case XLOG_CLUSTER_CATALOG_ENCODING_CHANGE:
+			id = "CLUSTER_CATALOG_ENCODING_CHANGE";
+			break;
 		case XLOG_RESTORE_POINT:
 			id = "RESTORE_POINT";
 			break;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bcab..52c675dbe8a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -70,6 +70,7 @@
 #include "common/file_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "mb/pg_wchar.h"
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -5030,7 +5031,8 @@ XLOGShmemInit(void)
  * and the initial XLOG segment.
  */
 void
-BootStrapXLOG(uint32 data_checksum_version)
+BootStrapXLOG(int cluster_catalog_encoding,
+			  uint32 data_checksum_version)
 {
 	CheckPoint	checkPoint;
 	char	   *buffer;
@@ -5173,6 +5175,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 
 	/* Now create pg_control */
 	InitControlFile(sysidentifier, data_checksum_version);
+	ControlFile->cluster_catalog_encoding = cluster_catalog_encoding;
 	ControlFile->time = checkPoint.time;
 	ControlFile->checkPoint = checkPoint.redo;
 	ControlFile->checkPointCopy = checkPoint;
@@ -8557,6 +8560,13 @@ xlog_redo(XLogReaderState *record)
 		/* Check to see if any parameter change gives a problem on recovery */
 		CheckRequiredParameterValues();
 	}
+	else if (info == XLOG_CLUSTER_CATALOG_ENCODING_CHANGE)
+	{
+		xl_cluster_catalog_encoding_change xlrec;
+
+		memcpy(&xlrec, XLogRecGetData(record), sizeof(xlrec));
+		ControlFile->cluster_catalog_encoding = xlrec.encoding;
+	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
 		bool		fpw;
@@ -9510,3 +9520,33 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Get the shared catalog encoding.  This should only be called when a lock is
+ * held on one of the shared catalog tables.
+ */
+int
+GetClusterCatalogEncoding(void)
+{
+	return ControlFile->cluster_catalog_encoding;
+}
+
+/*
+ * Set CLUSTER CATALOG ENCODING.  This should only be called with
+ * AccessExclusiveLock held on *all* shared catalog tables that contain text.
+ */
+void
+SetClusterCatalogEncoding(int encoding)
+{
+	xl_cluster_catalog_encoding_change xlrec;
+
+	START_CRIT_SECTION();
+	MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+	xlrec.encoding = encoding;
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+	XLogFlush(XLogInsert(RM_XLOG_ID, XLOG_CLUSTER_CATALOG_ENCODING_CHANGE));
+	ControlFile->cluster_catalog_encoding = encoding;
+	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index d31a67599c9..cc1bfd2ce04 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -202,6 +202,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	int			flag;
 	char	   *userDoption = NULL;
 	uint32		bootstrap_data_checksum_version = 0;	/* No checksum */
+	int			cluster_catalog_encoding = -1;
 
 	Assert(!IsUnderPostmaster);
 
@@ -217,7 +218,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:C:d:D:Fkr:X:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -266,6 +267,9 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 					pfree(debugstr);
 				}
 				break;
+			case 'C':
+				cluster_catalog_encoding = atoi(optarg);
+				break;
 			case 'F':
 				SetConfigOption("fsync", "false", PGC_POSTMASTER, PGC_S_ARGV);
 				break;
@@ -341,7 +345,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
-	BootStrapXLOG(bootstrap_data_checksum_version);
+	BootStrapXLOG(cluster_catalog_encoding,
+				  bootstrap_data_checksum_version);
 
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 1589a75fd53..ed460a95e62 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -17,6 +17,7 @@ OBJS = \
 	aclchk.o \
 	catalog.o \
 	dependency.o \
+	encoding.o \
 	heap.o \
 	index.o \
 	indexing.o \
diff --git a/src/backend/catalog/encoding.c b/src/backend/catalog/encoding.c
new file mode 100644
index 00000000000..8f1d69a39de
--- /dev/null
+++ b/src/backend/catalog/encoding.c
@@ -0,0 +1,432 @@
+/*-------------------------------------------------------------------------
+ *
+ * encoding.c
+ *		Shared catalog encoding management.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/catalog/encoding.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/table.h"
+#include "access/xlog.h"
+#include "catalog/catalog.h"
+#include "catalog/pg_authid.h"
+#include "catalog/pg_database.h"
+#include "catalog/pg_db_role_setting.h"
+#include "catalog/pg_namespace.h"
+#include "catalog/pg_parameter_acl.h"
+#include "catalog/pg_replication_origin.h"
+#include "catalog/pg_shdescription.h"
+#include "catalog/pg_shseclabel.h"
+#include "catalog/pg_subscription.h"
+#include "catalog/pg_tablespace.h"
+#include "commands/dbcommands.h"
+#include "common/string.h"
+#include "mb/pg_wchar.h"
+#include "miscadmin.h"
+#include "storage/lmgr.h"
+#include "utils/builtins.h"
+
+/*
+ * Check if a NULL-terminated string can be inserted into a shared catalog.
+ * The caller must hold a lock on the shared catalog table, to block
+ * AlterSystemSetClusterCatalogEncoding().
+ */
+void
+ValidateClusterCatalogString(Relation rel, const char *s)
+{
+	/*
+	 * The main reason for taking the rel argument is to make sure that caller
+	 * remembered to lock the catalog before validating strings to be
+	 * inserted.  But we might as well check it's a shared relation since we
+	 * have it.
+	 */
+	Assert(rel->rd_rel->relisshared);
+
+	/*
+	 * If using SQL_ASCII, then we have to make sure this string is clean
+	 * 7-bit ASCII, so that it is valid in every supported encoding.
+	 */
+	if (GetClusterCatalogEncoding() != PG_SQL_ASCII)
+	{
+		/*
+		 * Otherwise, either we're in UNKNOWN mode where anything goes, or all
+		 * databases are using the same encoding and matches the shared
+		 * catalog encoding.  We don't have to validate anything.
+		 */
+		Assert(GetClusterCatalogEncoding() == -1 ||
+			   GetClusterCatalogEncoding() == GetDatabaseEncoding());
+		return;
+	}
+
+	if (!pg_is_ascii(s))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+				 errmsg("the string \"%s\" contains invalid characters", s),
+				 errdetail("CLUSTER CATALOG ENCODING is set to ASCII."),
+				 errhint("Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.")));
+}
+
+/*
+ * Try to change the cluster catalog encoding, if all the conditions are met.
+ */
+void
+AlterSystemSetClusterCatalogEncoding(const char *encoding_name)
+{
+	Relation	rel;
+	SysScanDesc scan;
+	HeapTuple	tup;
+	int			encoding;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("permission denied")));
+
+	/* Decode the name. */
+	if (pg_strcasecmp(encoding_name, "DATABASE") == 0)
+		encoding = GetDatabaseEncoding();
+	else if (pg_strcasecmp(encoding_name, "ASCII") == 0)
+		encoding = PG_SQL_ASCII;
+	else if (pg_strcasecmp(encoding_name, "UNDEFINED") == 0)
+		encoding = -1;
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid shared catalog encoding: %s",
+						encoding_name)));
+
+	/*
+	 * Lock all of the shared catalog tables containing name or text values,
+	 * to prevent concurrent updates.  This is the set of shared catalogs that
+	 * contain text.  If new shared catalogs are invented that hold text they
+	 * will need to be handled here too.  For every validation that we perform
+	 * below, there must also be corresponding calls to
+	 * ValidateClusterCatalogString() in the commands that CREATE or ALTER
+	 * these database objects.
+	 */
+	LockRelationOid(AuthIdRelationId, AccessExclusiveLock);
+	LockRelationOid(DatabaseRelationId, AccessExclusiveLock);
+	LockRelationOid(DbRoleSettingRelationId, AccessExclusiveLock);
+	LockRelationOid(ParameterAclRelationId, AccessExclusiveLock);
+	LockRelationOid(ReplicationOriginRelationId, AccessExclusiveLock);
+	LockRelationOid(SharedDescriptionRelationId, AccessExclusiveLock);
+	LockRelationOid(SharedSecLabelRelationId, AccessExclusiveLock);
+	LockRelationOid(SubscriptionRelationId, AccessExclusiveLock);
+	LockRelationOid(TableSpaceRelationId, AccessExclusiveLock);
+
+	/* No change? */
+	if (GetClusterCatalogEncoding() == encoding)
+		return;
+
+	if (encoding == -1)
+	{
+		/* There are no encoding restrictions for UNDEFINED.  Good luck. */
+	}
+	else if (encoding == PG_SQL_ASCII)
+	{
+		/* Make sure all shared catalogs contain only pure 7-bit ASCII. */
+
+		/* pg_authid */
+		rel = table_open(AuthIdRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			Form_pg_authid authid = (Form_pg_authid) GETSTRUCT(tup);
+
+			if (!pg_is_ascii(NameStr(authid->rolname)))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("existing role name \"%s\" contains invalid characters",
+								NameStr(authid->rolname)),
+						 errhint("Consider ALTER ROLE ... RENAME TO ... using ASCII characters.")));
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+
+		/* pg_database */
+		rel = table_open(DatabaseRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			Form_pg_database db = (Form_pg_database) GETSTRUCT(tup);
+
+			if (!pg_is_ascii(NameStr(db->datname)))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("existing database name \"%s\" contains invalid characters",
+								NameStr(db->datname)),
+						 errhint("Consider ALTER DATABASE ... RENAME TO ... using ASCII characters.")));
+
+			/*
+			 * Locale-related text fields requiring heap tuple deforming
+			 * should already have been validated as pure ASCII, so we don't
+			 * have to work harder here.
+			 *
+			 * XXX That's only true for the LC_ stuff; what about ICU, should
+			 * it get the same treatment, or be checked here?
+			 */
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+
+		/* pg_db_role_setting */
+		rel = table_open(DbRoleSettingRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			bool		isnull;
+			Datum		setconfig;
+
+			setconfig = heap_getattr(tup, Anum_pg_db_role_setting_setconfig,
+									 RelationGetDescr(rel), &isnull);
+			if (!isnull)
+			{
+				List	   *gucNames;
+				List	   *gucValues;
+				ListCell   *lc1;
+				ListCell   *lc2;
+
+				TransformGUCArray(DatumGetArrayTypeP(setconfig), &gucNames, &gucValues);
+				forboth(lc1, gucNames, lc2, gucValues)
+				{
+					char	   *name = lfirst(lc1);
+					char	   *value = lfirst(lc2);
+
+					if (!pg_is_ascii(name) || !pg_is_ascii(value))
+					{
+						Datum		db_id;
+						Datum		role_id;
+
+						db_id = heap_getattr(tup, Anum_pg_db_role_setting_setdatabase,
+											 RelationGetDescr(rel), &isnull);
+						role_id = heap_getattr(tup, Anum_pg_db_role_setting_setrole,
+											   RelationGetDescr(rel), &isnull);
+
+						if (DatumGetObjectId(db_id) == InvalidOid)
+							ereport(ERROR,
+									(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+									 errmsg("role \"%s\" has setting \"%s\" with value \"%s\" that contains invalid characters",
+											GetUserNameFromId(DatumGetObjectId(role_id), false),
+											name,
+											value),
+									 errhint("Consider ALTER ROLE ... SET ... TO ... using ASCII characters.")));
+						else
+							ereport(ERROR,
+									(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+									 errmsg("role \"%s\" has setting \"%s\" with value \"%s\" in database \"%s\" that contains invalid characters",
+											GetUserNameFromId(DatumGetObjectId(role_id), false),
+											name,
+											value,
+											get_database_name(DatumGetObjectId(db_id))),
+									 errhint("Consider ALTER ROLE ... IN DATABASE ... SET ... TO ... using ASCII characters.")));
+					}
+					pfree(name);
+					pfree(value);
+				}
+				list_free(gucNames);
+				list_free(gucValues);
+			}
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+
+		/* pg_parameter_acl */
+		rel = table_open(ParameterAclRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			bool		isnull;
+			char	   *parname;
+
+			parname = TextDatumGetCString(heap_getattr(tup, Anum_pg_parameter_acl_parname,
+													   RelationGetDescr(rel), &isnull));
+
+			/*
+			 * This probably shouldn't happen as they are GUC names, so it's
+			 * hard to suggest a useful hint.
+			 */
+			if (!pg_is_ascii(parname))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("existing ACL parameter name name \"%s\" contains invalid characters",
+								parname)));
+			pfree(parname);
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+
+		/* pg_replication_origin */
+		rel = table_open(ReplicationOriginRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			bool		isnull;
+			char	   *s;
+
+			s = TextDatumGetCString(heap_getattr(tup, Anum_pg_replication_origin_roname,
+												 RelationGetDescr(rel), &isnull));
+			if (!pg_is_ascii(s))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("replication origin \"%s\" contains invalid characters",
+								s),
+						 errhint("Consider recreating the replication origin using ASCII characters.")));
+			pfree(s);
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+
+		/* pg_shdescription */
+		rel = table_open(SharedDescriptionRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			bool		isnull;
+			char	   *s;
+
+			s = TextDatumGetCString(heap_getattr(tup, Anum_pg_shdescription_description,
+												 RelationGetDescr(rel), &isnull));
+			if (!pg_is_ascii(s))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("comment \"%s\" on a shared database object contains invalid characters",
+								s),
+						 errhint("Consider COMMENT ON ... IS ... using ASCII characters.")));
+			pfree(s);
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+
+		/* pg_shseclabel */
+		rel = table_open(SharedSecLabelRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			bool		isnull;
+			char	   *s;
+
+			s = TextDatumGetCString(heap_getattr(tup, Anum_pg_shseclabel_provider,
+												 RelationGetDescr(rel), &isnull));
+			if (!pg_is_ascii(s))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("security label provider name \"%s\" contains invalid characters",
+								s),
+						 errhint("This security label provider cannot be used with CATALOG ENCODING set to ASCII.")));
+			pfree(s);
+
+			s = TextDatumGetCString(heap_getattr(tup, Anum_pg_shseclabel_label,
+												 RelationGetDescr(rel), &isnull));
+			if (!pg_is_ascii(s))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("a security label on a shared database object contains invalid characters"),
+						 errhint("Security labels applied to shared database objects must be representable in the CATALOG ENCODING.")));
+			pfree(s);
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+
+		/* pg_subscription */
+		rel = table_open(SubscriptionRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			bool		isnull;
+			char	   *name;
+			char	   *s;
+
+			name = NameStr(*DatumGetName(heap_getattr(tup, Anum_pg_subscription_subname,
+													  RelationGetDescr(rel), &isnull)));
+			if (!pg_is_ascii(name))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("existing subscription name \"%s\" contains invalid characters",
+								name),
+						 errhint("Consider ALTER SUBSCRIPTION ... RENAME TO ... using ASCII characters.")));
+
+			s = TextDatumGetCString(heap_getattr(tup, Anum_pg_subscription_subconninfo,
+												 RelationGetDescr(rel), &isnull));
+			if (!pg_is_ascii(s))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("existing subscription \"%s\" has connection string \"%s\" containing invalid characters",
+								name, s),
+						 errhint("Consider ALTER SUBSCRIPTION ... CONNECTION ... using ASCII characters.")));
+			pfree(s);
+
+			/*
+			 * subsynccommit, subslotname and suborigin have their own
+			 * validation that requires ASCII, so no check for now.
+			 */
+
+			/* XXX TODO check subpublications, a text[] */
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+
+		/* pg_tablespace */
+		rel = table_open(TableSpaceRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			Form_pg_tablespace ts = (Form_pg_tablespace) GETSTRUCT(tup);
+
+			if (!pg_is_ascii(NameStr(ts->spcname)))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("existing tablespace name \"%s\" contains invalid characters",
+								NameStr(ts->spcname)),
+						 errhint("Consider ALTER TABLESPACE ... RENAME TO ... using ASCII characters.")));
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+	}
+	else
+	{
+		/* Make sure all databases are using this encoding. */
+		rel = table_open(DatabaseRelationId, NoLock);
+		scan = systable_beginscan(rel, InvalidOid, false, NULL, 0, NULL);
+		while ((tup = systable_getnext(scan)))
+		{
+			Form_pg_database db = (Form_pg_database) GETSTRUCT(tup);
+
+			if (db->encoding != encoding)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("database \"%s\" has incompatible encoding %s",
+								NameStr(db->datname),
+								pg_encoding_to_char(db->encoding))));
+		}
+		systable_endscan(scan);
+		table_close(rel, NoLock);
+	}
+
+	/* If we made it this far, we are allowed to change it. */
+	SetClusterCatalogEncoding(encoding);
+}
+
+const char *
+show_cluster_catalog_encoding(void)
+{
+	int			encoding;
+
+	encoding = GetClusterCatalogEncoding();
+	if (encoding == -1)
+		return "UNDEFINED";
+	else if (encoding == PG_SQL_ASCII)
+		return "ASCII";
+	else
+		return "DATABASE";
+}
diff --git a/src/backend/catalog/meson.build b/src/backend/catalog/meson.build
index 2f3ded8a0e7..ba7ac5ce35f 100644
--- a/src/backend/catalog/meson.build
+++ b/src/backend/catalog/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
   'aclchk.c',
   'catalog.c',
   'dependency.c',
+  'encoding.c',
   'heap.c',
   'index.c',
   'indexing.c',
diff --git a/src/backend/catalog/pg_db_role_setting.c b/src/backend/catalog/pg_db_role_setting.c
index 8c20f519fc0..d45bd2053c7 100644
--- a/src/backend/catalog/pg_db_role_setting.c
+++ b/src/backend/catalog/pg_db_role_setting.c
@@ -46,6 +46,11 @@ AlterSetting(Oid databaseid, Oid roleid, VariableSetStmt *setstmt)
 							  NULL, 2, scankey);
 	tuple = systable_getnext(scan);
 
+	if (setstmt->name)
+		ValidateClusterCatalogString(rel, setstmt->name);
+	if (valuestr)
+		ValidateClusterCatalogString(rel, valuestr);
+
 	/*
 	 * There are three cases:
 	 *
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index a45f3bb6b83..09cfc8775f7 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -1038,3 +1038,21 @@ AlterObjectOwner_internal(Oid classId, Oid objectId, Oid new_ownerId)
 
 	table_close(rel, RowExclusiveLock);
 }
+
+/*
+ * Main entry point for ALTER SYSTEM command.
+ */
+void
+AlterSystem(AlterSystemStmt *stmt)
+{
+	if (stmt->setstmt)
+	{
+		/* ALTER SYSTEM [RE]SET ... */
+		AlterSystemSetConfigFile(stmt);
+	}
+	else if (stmt->encoding_name)
+	{
+		/* ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ... */
+		AlterSystemSetClusterCatalogEncoding(stmt->encoding_name);
+	}
+}
diff --git a/src/backend/commands/comment.c b/src/backend/commands/comment.c
index e9d50fc7d87..248208490d1 100644
--- a/src/backend/commands/comment.c
+++ b/src/backend/commands/comment.c
@@ -277,6 +277,9 @@ CreateSharedComments(Oid oid, Oid classoid, const char *comment)
 
 	shdescription = table_open(SharedDescriptionRelationId, RowExclusiveLock);
 
+	if (comment)
+		ValidateClusterCatalogString(shdescription, comment);
+
 	sd = systable_beginscan(shdescription, SharedDescriptionObjIndexId, true,
 							NULL, 2, skey);
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index aa91a396967..f24dbb64c25 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1429,6 +1429,32 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	Assert((dblocprovider != COLLPROVIDER_LIBC && dblocale) ||
 		   (dblocprovider == COLLPROVIDER_LIBC && !dblocale));
 
+	/*
+	 * Check encoding of strings going into shared catalog.  Locales have
+	 * already been verified as ASCII by checklocale() so we skip those.
+	 */
+	ValidateClusterCatalogString(pg_database_rel, dbname);
+	if (dblocale)
+		ValidateClusterCatalogString(pg_database_rel, dblocale);
+	if (dbicurules)
+		ValidateClusterCatalogString(pg_database_rel, dbicurules);
+	if (dbcollversion)
+		ValidateClusterCatalogString(pg_database_rel, dbcollversion);
+
+	/*
+	 * Check encoding of the contents of the data, for compatibility with the
+	 * shared catalogs.
+	 */
+	if (GetClusterCatalogEncoding() != -1 &&
+		GetClusterCatalogEncoding() != PG_SQL_ASCII &&
+		GetClusterCatalogEncoding() != encoding)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+				 errmsg("encoding \"%s\" is not compatible with CLUSTER CATALOG ENCODING \"%s\"",
+						pg_encoding_to_char(encoding),
+						pg_encoding_to_char(GetClusterCatalogEncoding())),
+				 errhint("Consider ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII.")));
+
 	/* Form tuple */
 	new_record[Anum_pg_database_oid - 1] = ObjectIdGetDatum(dboid);
 	new_record[Anum_pg_database_datname - 1] =
@@ -1889,6 +1915,8 @@ RenameDatabase(const char *oldname, const char *newname)
 	 */
 	rel = table_open(DatabaseRelationId, RowExclusiveLock);
 
+	ValidateClusterCatalogString(rel, newname);
+
 	if (!get_db_info(oldname, AccessExclusiveLock, &db_id, NULL, NULL, NULL,
 					 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL))
 		ereport(ERROR,
diff --git a/src/backend/commands/seclabel.c b/src/backend/commands/seclabel.c
index 5607273bf9f..81856578806 100644
--- a/src/backend/commands/seclabel.c
+++ b/src/backend/commands/seclabel.c
@@ -363,6 +363,9 @@ SetSharedSecurityLabel(const ObjectAddress *object,
 
 	pg_shseclabel = table_open(SharedSecLabelRelationId, RowExclusiveLock);
 
+	ValidateClusterCatalogString(pg_shseclabel, provider);
+	ValidateClusterCatalogString(pg_shseclabel, label);
+
 	scan = systable_beginscan(pg_shseclabel, SharedSecLabelObjectIndexId, true,
 							  NULL, 3, keys);
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 03e97730e73..2012c73e233 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -552,6 +552,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	bits32		supported_opts;
 	SubOpts		opts = {0};
 	AclResult	aclresult;
+	ListCell   *l;
 
 	/*
 	 * Parse and check options.
@@ -619,6 +620,11 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 
 	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
 
+	ValidateClusterCatalogString(rel, stmt->subname);
+	ValidateClusterCatalogString(rel, stmt->conninfo);
+	foreach(l, stmt->publication)
+		ValidateClusterCatalogString(rel, strVal(lfirst(l)));
+
 	/* Check if name is used */
 	subid = GetSysCacheOid2(SUBSCRIPTIONNAME, Anum_pg_subscription_oid,
 							MyDatabaseId, CStringGetDatum(stmt->subname));
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 8ebbd935b0c..ef9856574f0 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -311,6 +311,8 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 	 */
 	rel = table_open(TableSpaceRelationId, RowExclusiveLock);
 
+	ValidateClusterCatalogString(rel, stmt->tablespacename);
+
 	if (IsBinaryUpgrade)
 	{
 		/* Use binary-upgrade override for tablespace oid */
@@ -941,6 +943,8 @@ RenameTableSpace(const char *oldname, const char *newname)
 	/* Search pg_tablespace */
 	rel = table_open(TableSpaceRelationId, RowExclusiveLock);
 
+	ValidateClusterCatalogString(rel, newname);
+
 	ScanKeyInit(&entry[0],
 				Anum_pg_tablespace_spcname,
 				BTEqualStrategyNumber, F_NAMEEQ,
diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index e7ade898a47..e91cf231625 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -371,6 +371,8 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
 	pg_authid_rel = table_open(AuthIdRelationId, RowExclusiveLock);
 	pg_authid_dsc = RelationGetDescr(pg_authid_rel);
 
+	ValidateClusterCatalogString(pg_authid_rel, stmt->role);
+
 	if (OidIsValid(get_role_oid(stmt->role, true)))
 		ereport(ERROR,
 				(errcode(ERRCODE_DUPLICATE_OBJECT),
@@ -1350,6 +1352,8 @@ RenameRole(const char *oldname, const char *newname)
 	rel = table_open(AuthIdRelationId, RowExclusiveLock);
 	dsc = RelationGetDescr(rel);
 
+	ValidateClusterCatalogString(rel, newname);
+
 	oldtuple = SearchSysCache1(AUTHNAME, CStringGetDatum(oldname));
 	if (!HeapTupleIsValid(oldtuple))
 		ereport(ERROR,
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 67eb96396af..57495a10d24 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1985,6 +1985,13 @@ VariableShowStmt:
 					n->name = "session_authorization";
 					$$ = (Node *) n;
 				}
+			| SHOW CLUSTER CATALOG_P ENCODING
+				{
+					VariableShowStmt *n = makeNode(VariableShowStmt);
+
+					n->name = "cluster_catalog_encoding";
+					$$ = (Node *) n;
+				}
 			| SHOW ALL
 				{
 					VariableShowStmt *n = makeNode(VariableShowStmt);
@@ -11540,6 +11547,13 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *) n;
 				}
+			| ALTER SYSTEM_P SET CLUSTER CATALOG_P ENCODING TO name
+				{
+					AlterSystemStmt *n = makeNode(AlterSystemStmt);
+
+					n->encoding_name = $8;
+					$$ = (Node *) n;
+				}
 		;
 
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index baf696d8e68..dc4866cff1b 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -286,6 +286,8 @@ replorigin_create(const char *roname)
 
 	rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
 
+	ValidateClusterCatalogString(rel, roname);
+
 	for (roident = InvalidOid + 1; roident < PG_UINT16_MAX; roident++)
 	{
 		bool		nulls[Natts_pg_replication_origin];
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f28bf371059..e6e6e6ab824 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -218,6 +218,8 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 
 		case T_AlterSystemStmt:
 			{
+				AlterSystemStmt *stmt = (AlterSystemStmt *) parsetree;
+
 				/*
 				 * Surprisingly, ALTER SYSTEM meets all our definitions of
 				 * read-only: it changes nothing that affects the output of
@@ -227,8 +229,13 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				 *
 				 * So, despite the fact that it writes to a file, it's read
 				 * only!
+				 *
+				 * XXX ^ that's about ALTER SYSTEM SET only
 				 */
-				return COMMAND_IS_STRICTLY_READ_ONLY;
+				if (stmt->setstmt)
+					return COMMAND_IS_STRICTLY_READ_ONLY;
+				else
+					return COMMAND_IS_NOT_READ_ONLY;
 			}
 
 		case T_CallStmt:
@@ -868,7 +875,7 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 
 		case T_AlterSystemStmt:
 			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
-			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
+			AlterSystem((AlterSystemStmt *) parsetree);
 			break;
 
 		case T_VariableSetStmt:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8cf1afbad20..31b012e2926 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -621,6 +621,7 @@ char	   *role_string;
 /* should be static, but guc.c needs to get at this */
 bool		in_hot_standby_guc;
 
+static char *dummy = "";
 
 /*
  * Displayable names for context types (enum GucContext)
@@ -4813,6 +4814,17 @@ struct config_string ConfigureNamesString[] =
 		check_restrict_nonsystem_relation_kind, assign_restrict_nonsystem_relation_kind, NULL
 	},
 
+	{
+		{"cluster_catalog_encoding", PGC_INTERNAL, PRESET_OPTIONS,
+			gettext_noop("The encoding of text in system catalogs that are shared by all databases in the cluster."),
+			NULL,
+			GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE | GUC_RUNTIME_COMPUTED
+		},
+		&dummy,
+		"",
+		NULL, NULL, show_cluster_catalog_encoding
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783e..1e0e0342727 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -137,6 +137,7 @@ static char *share_path = NULL;
 /* values to be obtained from arguments */
 static char *pg_data = NULL;
 static char *encoding = NULL;
+static char *cluster_catalog_encoding = NULL;
 static char *locale = NULL;
 static char *lc_collate = NULL;
 static char *lc_ctype = NULL;
@@ -173,6 +174,7 @@ static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 /* internal vars */
 static const char *progname;
 static int	encodingid;
+static int	cluster_catalog_encodingid;
 static char *bki_file;
 static char *hba_file;
 static char *ident_file;
@@ -1593,6 +1595,7 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -C %d", cluster_catalog_encodingid);
 	if (data_checksums)
 		appendPQExpBuffer(&cmd, " -k");
 	if (debug)
@@ -2748,6 +2751,19 @@ setup_locale_encoding(void)
 	else
 		encodingid = get_encoding_id(encoding);
 
+	/* Choose initial value of CLUSTER CATALOG ENCODING. */
+	if (cluster_catalog_encoding == NULL ||
+		pg_strcasecmp(cluster_catalog_encoding, "DATABASE") == 0)
+		cluster_catalog_encodingid = encodingid;
+	else if (pg_strcasecmp(cluster_catalog_encoding, "ASCII") == 0)
+		cluster_catalog_encodingid = PG_SQL_ASCII;
+	else if (pg_strcasecmp(cluster_catalog_encoding, "UNDEFINED") == 0)
+		cluster_catalog_encodingid = -1;
+	printf(_("The initial cluster catalog encoding has been set to \"%s\".\n"),
+		   cluster_catalog_encodingid == -1 ? "UNDEFINED" :
+		   cluster_catalog_encodingid == PG_SQL_ASCII ? "ASCII" :
+		   pg_encoding_to_char(cluster_catalog_encodingid));
+
 	if (!check_locale_encoding(lc_ctype, encodingid) ||
 		!check_locale_encoding(lc_collate, encodingid))
 		exit(1);				/* check_locale_encoding printed the error */
@@ -3146,6 +3162,7 @@ main(int argc, char *argv[])
 	static struct option long_options[] = {
 		{"pgdata", required_argument, NULL, 'D'},
 		{"encoding", required_argument, NULL, 'E'},
+		{"cluster-catalog-encoding", required_argument, NULL, 'C'},
 		{"locale", required_argument, NULL, 1},
 		{"lc-collate", required_argument, NULL, 2},
 		{"lc-ctype", required_argument, NULL, 3},
@@ -3224,7 +3241,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:c:dD:E:gkL:nNsST:U:WX:",
+	while ((c = getopt_long(argc, argv, "A:c:C:dD:E:gkL:nNsST:U:WX:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -3272,6 +3289,9 @@ main(int argc, char *argv[])
 			case 'E':
 				encoding = pg_strdup(optarg);
 				break;
+			case 'C':
+				cluster_catalog_encoding = pg_strdup(optarg);
+				break;
 			case 'W':
 				pwprompt = true;
 				break;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca7..91ae8a920ce 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -27,6 +27,7 @@
 #include "common/controldata_utils.h"
 #include "common/logging.h"
 #include "getopt_long.h"
+#include "mb/pg_wchar.h"
 #include "pg_getopt.h"
 
 static void
@@ -325,6 +326,8 @@ main(int argc, char *argv[])
 		   (ControlFile->float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile->data_checksum_version);
+	printf(_("Cluster catalog encoding:              %d\n"),
+		   ControlFile->cluster_catalog_encoding);
 	printf(_("Mock authentication nonce:            %s\n"),
 		   mock_auth_nonce_str);
 	return 0;
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d89..c7d02cdb37d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -779,6 +779,8 @@ PrintControlValues(bool guessed)
 		   (ControlFile.float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile.data_checksum_version);
+	printf(_("Cluster catalog encoding:             %d\n"),
+		   ControlFile.cluster_catalog_encoding);
 }
 
 
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index 854c6887a23..7b32848707d 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -61,6 +61,7 @@ get_control_data(ClusterInfo *cluster)
 	bool		got_large_object = false;
 	bool		got_date_is_int = false;
 	bool		got_data_checksum_version = false;
+	bool		got_cluster_catalog_encoding = false;
 	bool		got_cluster_state = false;
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
@@ -501,6 +502,16 @@ get_control_data(ClusterInfo *cluster)
 			cluster->controldata.data_checksum_version = str2uint(p);
 			got_data_checksum_version = true;
 		}
+		else if ((p = strstr(bufin, "Cluster catalog encoding")) != NULL)
+		{
+			p = strchr(p, ':');
+
+			if (p == NULL || strlen(p) <= 1)
+				pg_fatal("%d: controldata retrieval problem", __LINE__);
+			p++;
+			cluster->controldata.cluster_catalog_encoding = atoi(p);
+			got_cluster_catalog_encoding = true;
+		}
 	}
 
 	rc = pclose(output);
@@ -561,6 +572,11 @@ get_control_data(ClusterInfo *cluster)
 		}
 	}
 
+	if (GET_MAJOR_VERSION(cluster->major_version) < 1800)
+		cluster->controldata.cluster_catalog_encoding = -1; /* UNDEFINED */
+	else if (!got_cluster_catalog_encoding)
+		pg_fatal("Expected cluster catalog encoding from version 18+");
+
 	/* verify that we got all the mandatory pg_control data */
 	if (!got_xid || !got_oid ||
 		!got_multi || !got_oldestxid ||
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f8..d79f7c83802 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -467,6 +467,23 @@ set_locale_and_encoding(void)
 	PQfreemem(datctype_literal);
 	PQfreemem(datlocale_literal);
 
+	/*
+	 * Copy the CLUSTER CATALOG ENCODING.  This will be UNDEFINED if the
+	 * source database is from before v18.
+	 */
+	if (old_cluster.controldata.cluster_catalog_encoding == -1)
+		PQclear(executeQueryOrDie(conn_new_template1,
+								  "ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO UNDEFINED"));
+	else if (old_cluster.controldata.cluster_catalog_encoding == 0)
+		PQclear(executeQueryOrDie(conn_new_template1,
+								  "ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII"));
+	else if (old_cluster.controldata.cluster_catalog_encoding != locale->db_encoding)
+		pg_fatal("Source database has template0 encoding %d, but cluster_catalog_encoding %d",
+				 locale->db_encoding, old_cluster.controldata.cluster_catalog_encoding);
+	else
+		PQclear(executeQueryOrDie(conn_new_template1,
+								  "ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO DATABASE"));
+
 	PQfinish(conn_new_template1);
 
 	check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4b..e5baf647051 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -245,6 +245,7 @@ typedef struct
 	bool		date_is_int;
 	bool		float8_pass_by_value;
 	uint32		data_checksum_version;
+	int			cluster_catalog_encoding;
 } ControlData;
 
 /*
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index bbd08770c3d..ad70f7f8340 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -2551,11 +2551,18 @@ match_previous_words(int pattern_id,
 	/* ALTER SYSTEM SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
 		COMPLETE_WITH("SET", "RESET");
-	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
+	else if (Matches("ALTER", "SYSTEM", "SET"))
+		COMPLETE_WITH_QUERY_VERBATIM_PLUS(Query_for_list_of_alter_system_set_vars,
+										  "CLUSTER CATALOG ENCODING TO");
+	else if (Matches("ALTER", "SYSTEM", "RESET"))
 		COMPLETE_WITH_QUERY_VERBATIM_PLUS(Query_for_list_of_alter_system_set_vars,
 										  "ALL");
+	else if (Matches("ALTER", "SYSTEM", "SET", "CLUSTER"))
+		COMPLETE_WITH("TO", "CATALOG ENCODING TO");
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
 		COMPLETE_WITH("TO");
+	else if (Matches("ALTER", "SYSTEM", "SET", "CLUSTER", "CATALOG", "ENCODING", "TO"))
+		COMPLETE_WITH("DATABASE", "ASCII", "UNDEFINED");
 	/* ALTER VIEW <name> */
 	else if (Matches("ALTER", "VIEW", MatchAny))
 		COMPLETE_WITH("ALTER COLUMN", "OWNER TO", "RENAME", "RESET", "SET");
@@ -4882,10 +4889,13 @@ match_previous_words(int pattern_id,
 										  "ALL");
 	else if (Matches("SHOW"))
 		COMPLETE_WITH_QUERY_VERBATIM_PLUS(Query_for_list_of_show_vars,
+										  "CLUSTER CATALOG ENCODING",
 										  "SESSION AUTHORIZATION",
 										  "ALL");
 	else if (Matches("SHOW", "SESSION"))
 		COMPLETE_WITH("AUTHORIZATION");
+	else if (Matches("SHOW", "CLUSTER"))
+		COMPLETE_WITH("CATALOG ENCODING");
 	/* Complete "SET TRANSACTION" */
 	else if (Matches("SET", "TRANSACTION"))
 		COMPLETE_WITH("SNAPSHOT", "ISOLATION LEVEL", "READ", "DEFERRABLE", "NOT DEFERRABLE");
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 4a0e2c883a1..a2e0d0944f7 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -12,8 +12,10 @@ program_help_ok('createdb');
 program_version_ok('createdb');
 program_options_handling_ok('createdb');
 
+# Because we're using different encodings in the same cluster, we need shared
+# catalog encoding set to ASCII for this test.
 my $node = PostgreSQL::Test::Cluster->new('main');
-$node->init;
+$node->init(extra => [ '--cluster-catalog-encoding=ASCII' ]);
 $node->start;
 
 $node->issues_sql_like(
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067b..ef686479415 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -234,7 +234,8 @@ extern bool DataChecksumsEnabled(void);
 extern XLogRecPtr GetFakeLSNForUnloggedRel(void);
 extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
-extern void BootStrapXLOG(uint32 data_checksum_version);
+extern void BootStrapXLOG(int cluster_catalog_encoding,
+						  uint32 data_checksum_version);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
 extern WalLevel GetActiveWalLevelOnStandby(void);
@@ -253,6 +254,8 @@ extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
 extern TimeLineID GetWALInsertionTimeLineIfSet(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern int	GetClusterCatalogEncoding(void);
+extern void SetClusterCatalogEncoding(int encoding);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index d9cf51a0f9f..6bef50c14ec 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -305,6 +305,12 @@ typedef struct xl_end_of_recovery
 	int			wal_level;
 } xl_end_of_recovery;
 
+/* Change of CLUSTER CATALOG ENCODING. */
+typedef struct xl_cluster_catalog_encoding_change
+{
+	int			encoding;
+} xl_cluster_catalog_encoding_change;
+
 /*
  * The functions in xloginsert.c construct a chain of XLogRecData structs
  * to represent the final WAL record.
diff --git a/src/include/catalog/catalog.h b/src/include/catalog/catalog.h
index a8dd304b1ad..5ebd62319c0 100644
--- a/src/include/catalog/catalog.h
+++ b/src/include/catalog/catalog.h
@@ -43,5 +43,8 @@ extern Oid	GetNewOidWithIndex(Relation relation, Oid indexId,
 extern RelFileNumber GetNewRelFileNumber(Oid reltablespace,
 										 Relation pg_class,
 										 char relpersistence);
+extern void ValidateClusterCatalogString(Relation rel, const char *s);
+extern void AlterSystemSetClusterCatalogEncoding(const char *encoding_name);
+extern const char *show_cluster_catalog_encoding(void);
 
 #endif							/* CATALOG_H */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e80ff8e4140..96ad1007a65 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -77,7 +77,7 @@ typedef struct CheckPoint
 #define XLOG_END_OF_RECOVERY			0x90
 #define XLOG_FPI_FOR_HINT				0xA0
 #define XLOG_FPI						0xB0
-/* 0xC0 is used in Postgres 9.5-11 */
+#define XLOG_CLUSTER_CATALOG_ENCODING_CHANGE	0xC0
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
 #define XLOG_CHECKPOINT_REDO			0xE0
 
@@ -221,6 +221,9 @@ typedef struct ControlFileData
 	/* Are data pages protected by checksums? Zero if no checksum version */
 	uint32		data_checksum_version;
 
+	/* A pg_enc value, or -1 for UNKNOWN. */
+	int			cluster_catalog_encoding;
+
 	/*
 	 * Random nonce, used in authentication requests that need to proceed
 	 * based on values that are cluster-unique, like a SASL exchange that
diff --git a/src/include/commands/alter.h b/src/include/commands/alter.h
index f00af75beff..295d6726ebc 100644
--- a/src/include/commands/alter.h
+++ b/src/include/commands/alter.h
@@ -31,4 +31,6 @@ extern ObjectAddress ExecAlterOwnerStmt(AlterOwnerStmt *stmt);
 extern void AlterObjectOwner_internal(Oid classId, Oid objectId,
 									  Oid new_ownerId);
 
+extern void AlterSystem(AlterSystemStmt *stmt);
+
 #endif							/* ALTER_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 0f9462493e3..5698783571f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3841,6 +3841,7 @@ typedef struct AlterSystemStmt
 {
 	NodeTag		type;
 	VariableSetStmt *setstmt;	/* SET subcommand */
+	const char *encoding_name;	/* CATALOG ENCODING subcommand */
 } AlterSystemStmt;
 
 /* ----------------------
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c0d3cf0e14b..5e8cf098185 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -10,6 +10,7 @@ SUBDIRS = \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
+		  encoding \
 		  libpq_pipeline \
 		  plsample \
 		  spgist_name_ops \
diff --git a/src/test/modules/encoding/Makefile b/src/test/modules/encoding/Makefile
new file mode 100644
index 00000000000..2e4d04a7bea
--- /dev/null
+++ b/src/test/modules/encoding/Makefile
@@ -0,0 +1,17 @@
+# src/test/modules/encoding/Makefile
+
+REGRESS = encoding
+
+NO_INSTALLCHECK = 1
+ENCODING = UTF8
+NO_LOCALE = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/encoding
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+endif
diff --git a/src/test/modules/encoding/expected/cluster_catalog_encoding.out b/src/test/modules/encoding/expected/cluster_catalog_encoding.out
new file mode 100644
index 00000000000..c575241c896
--- /dev/null
+++ b/src/test/modules/encoding/expected/cluster_catalog_encoding.out
@@ -0,0 +1,208 @@
+-- Exercise the ValidateClusterCatalogString() calls that should cover all
+-- entry points (a few cases have ASCII-only validation of their own and give
+-- slightly different error messages).
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+SHOW CLUSTER CATALOG ENCODING;
+ cluster_catalog_encoding 
+--------------------------
+ ASCII
+(1 row)
+
+-- pg_authid
+CREATE USER regress_astérix;
+ERROR:  the string "regress_astérix" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+CREATE USER regress_fred;
+ALTER USER regress_fred RENAME TO regress_astérix;
+ERROR:  the string "regress_astérix" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+DROP USER regress_fred;
+-- pg_database
+CREATE DATABASE regression_café;
+ERROR:  the string "regression_café" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+ALTER DATABASE template1 RENAME TO regression_café;
+ERROR:  the string "regression_café" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+CREATE DATABASE regression_ok TEMPLATE template0 LOCALE 'français';
+WARNING:  locale name "français" contains non-ASCII characters
+ERROR:  invalid LC_COLLATE locale name: "français"
+HINT:  If the locale name is specific to ICU, use ICU_LOCALE.
+-- pg_db_role_setting
+CREATE USER regress_fred;
+ALTER ROLE regress_fred SET application_name TO 'café';
+ERROR:  the string "café" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+DROP USER regress_fred;
+-- pg_parameter_acl
+-- XXX
+-- pg_replication_origin
+SELECT pg_replication_origin_create('regress_café');
+ERROR:  the string "regress_café" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+-- pg_shdescription
+COMMENT ON DATABASE template0 IS 'café';
+ERROR:  the string "café" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+-- non-shared objects are OK, because non-shared catalog
+COMMENT ON TABLE pg_catalog.pg_class IS 'café';
+COMMENT ON TABLE pg_catalog.pg_class IS NULL;
+-- pg_shseclabel
+-- XXX
+-- pg_subscription
+CREATE SUBSCRIPTION regress_café CONNECTION 'dbname=crême' PUBLICATION brûlée;
+ERROR:  the string "regress_café" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=crême' PUBLICATION brûlée;
+ERROR:  the string "dbname=crême" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=ok'    PUBLICATION brûlée;
+ERROR:  the string "brûlée" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=ok'    PUBLICATION ok      WITH (slot_name = 'café');
+ERROR:  replication slot name "café" contains invalid character
+HINT:  Replication slot names may only contain lower case letters, numbers, and the underscore character.
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=ok'    PUBLICATION ok      WITH (synchronous_commit = 'café');
+ERROR:  invalid value for parameter "synchronous_commit": "café"
+HINT:  Available values: local, remote_write, remote_apply, on, off.
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=ok'    PUBLICATION ok      WITH (origin = 'café');
+ERROR:  unrecognized origin value: "café"
+-- pg_tablespace
+SET allow_in_place_tablespaces = 'on';
+CREATE TABLESPACE regress_café LOCATION '';
+ERROR:  the string "regress_café" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+CREATE TABLESPACE regress_ok LOCATION '';
+ALTER TABLESPACE regress_ok RENAME TO regress_café;
+ERROR:  the string "regress_café" contains invalid characters
+DETAIL:  CLUSTER CATALOG ENCODING is set to ASCII.
+HINT:  Consider ALTER SYSTEM SET CATALOG ENCODING TO DATABASE.
+DROP TABLESPACE regress_ok;
+-- Check that we can create a new database with a different encoding,
+-- while the shared catalog encoding is ASCII
+CREATE DATABASE regression_latin1 TEMPLATE template0 LOCALE 'C' ENCODING LATIN1;
+-- Check that we can't change the shared catalog encoding to UTF8, because that
+-- LATIN1 database is in the way, then drop it so we can.
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO DATABASE;
+ERROR:  database "regression_latin1" has incompatible encoding LATIN1
+DROP DATABASE regression_latin1;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO DATABASE;
+-- Test that we can now do each of those things that failed before, and that
+-- those things block us from going back to ASCII.
+-- pg_authid
+CREATE USER regress_astérix;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+ERROR:  existing role name "regress_astérix" contains invalid characters
+HINT:  Consider ALTER ROLE ... RENAME TO ... using ASCII characters.
+DROP USER regress_astérix;
+-- pg_database
+CREATE DATABASE regression_café;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+ERROR:  existing database name "regression_café" contains invalid characters
+HINT:  Consider ALTER DATABASE ... RENAME TO ... using ASCII characters.
+DROP DATABASE regression_café;
+-- but we can't make a LATIN1 database while we have UTF8 catalogs
+CREATE DATABASE regression_latin1 TEMPLATE template0 LOCALE 'C' ENCODING LATIN1;
+ERROR:  encoding "LATIN1" is not compatible with CLUSTER CATALOG ENCODING "UTF8"
+HINT:  Consider ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII.
+-- pg_db_role_setting
+CREATE USER regress_fred;
+ALTER ROLE regress_fred SET application_name TO 'café';
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+ERROR:  role "regress_fred" has setting "application_name" with value "café" that contains invalid characters
+HINT:  Consider ALTER ROLE ... SET ... TO ... using ASCII characters.
+DROP USER regress_fred;
+-- pg_parameter_acl
+-- XXX
+-- pg_replication_origin
+SELECT pg_replication_origin_create('regress_café');
+ pg_replication_origin_create 
+------------------------------
+                            1
+(1 row)
+
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+ERROR:  replication origin "regress_café" contains invalid characters
+HINT:  Consider recreating the replication origin using ASCII characters.
+SELECT pg_replication_origin_drop('regress_café');
+ pg_replication_origin_drop 
+----------------------------
+ 
+(1 row)
+
+-- pg_shdescription
+COMMENT ON DATABASE template0 IS 'café';
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+ERROR:  comment "café" on a shared database object contains invalid characters
+HINT:  Consider COMMENT ON ... IS ... using ASCII characters.
+COMMENT ON DATABASE template0 IS 'unmodifiable empty database';
+-- pg_shseclabel
+-- XXX
+-- pg_subscription
+-- XXX
+-- pg_tablespace
+CREATE TABLESPACE regress_café LOCATION '';
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+ERROR:  existing tablespace name "regress_café" contains invalid characters
+HINT:  Consider ALTER TABLESPACE ... RENAME TO ... using ASCII characters.
+DROP TABLESPACE regress_café;
+-- We dropped everything that was in the way, so we should be able to go back
+-- to ASCII now.
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+-- Try out UNDEFINED mode, which is the only way to have a non-ASCII database
+-- name and mutiple encodings at the same time
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO UNDEFINED;
+SHOW CLUSTER CATALOG ENCODING;
+ cluster_catalog_encoding 
+--------------------------
+ UNDEFINED
+(1 row)
+
+CREATE DATABASE regression_café ENCODING UTF8;
+CREATE DATABASE regression_latin1 TEMPLATE template0 LOCALE 'C' ENCODING LATIN1;
+-- We can't switch to ASCII from this state
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+ERROR:  existing database name "regression_café" contains invalid characters
+HINT:  Consider ALTER DATABASE ... RENAME TO ... using ASCII characters.
+SHOW CLUSTER CATALOG ENCODING;
+ cluster_catalog_encoding 
+--------------------------
+ UNDEFINED
+(1 row)
+
+-- We also can't switch to UTF8 from this state
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO DATABASE;
+ERROR:  database "regression_latin1" has incompatible encoding LATIN1
+SHOW CLUSTER CATALOG ENCODING;
+ cluster_catalog_encoding 
+--------------------------
+ UNDEFINED
+(1 row)
+
+-- If we get rid of the LATIN1 database, we can go to UTF8
+DROP DATABASE regression_latin1;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO DATABASE;
+SHOW CLUSTER CATALOG ENCODING;
+ cluster_catalog_encoding 
+--------------------------
+ DATABASE
+(1 row)
+
+-- We still can't go back to ASCII unless we also get rid of the non-ASCII
+-- database name.
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+ERROR:  existing database name "regression_café" contains invalid characters
+HINT:  Consider ALTER DATABASE ... RENAME TO ... using ASCII characters.
+DROP DATABASE regression_café;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
diff --git a/src/test/modules/encoding/meson.build b/src/test/modules/encoding/meson.build
new file mode 100644
index 00000000000..5f2254c87d8
--- /dev/null
+++ b/src/test/modules/encoding/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'encoding',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'cluster_catalog_encoding',
+    ],
+    'regress_args': ['--no-locale', '--encoding=UTF8'],
+    'runningcheck': false,
+  },
+}
diff --git a/src/test/modules/encoding/sql/cluster_catalog_encoding.sql b/src/test/modules/encoding/sql/cluster_catalog_encoding.sql
new file mode 100644
index 00000000000..983c2fdfd27
--- /dev/null
+++ b/src/test/modules/encoding/sql/cluster_catalog_encoding.sql
@@ -0,0 +1,137 @@
+-- Exercise the ValidateClusterCatalogString() calls that should cover all
+-- entry points (a few cases have ASCII-only validation of their own and give
+-- slightly different error messages).
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+SHOW CLUSTER CATALOG ENCODING;
+
+-- pg_authid
+CREATE USER regress_astérix;
+CREATE USER regress_fred;
+ALTER USER regress_fred RENAME TO regress_astérix;
+DROP USER regress_fred;
+
+-- pg_database
+CREATE DATABASE regression_café;
+ALTER DATABASE template1 RENAME TO regression_café;
+CREATE DATABASE regression_ok TEMPLATE template0 LOCALE 'français';
+
+-- pg_db_role_setting
+CREATE USER regress_fred;
+ALTER ROLE regress_fred SET application_name TO 'café';
+DROP USER regress_fred;
+
+-- pg_parameter_acl
+-- XXX
+
+-- pg_replication_origin
+SELECT pg_replication_origin_create('regress_café');
+
+-- pg_shdescription
+COMMENT ON DATABASE template0 IS 'café';
+-- non-shared objects are OK, because non-shared catalog
+COMMENT ON TABLE pg_catalog.pg_class IS 'café';
+COMMENT ON TABLE pg_catalog.pg_class IS NULL;
+
+-- pg_shseclabel
+-- XXX
+
+-- pg_subscription
+CREATE SUBSCRIPTION regress_café CONNECTION 'dbname=crême' PUBLICATION brûlée;
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=crême' PUBLICATION brûlée;
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=ok'    PUBLICATION brûlée;
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=ok'    PUBLICATION ok      WITH (slot_name = 'café');
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=ok'    PUBLICATION ok      WITH (synchronous_commit = 'café');
+CREATE SUBSCRIPTION regress_ok   CONNECTION 'dbname=ok'    PUBLICATION ok      WITH (origin = 'café');
+
+-- pg_tablespace
+SET allow_in_place_tablespaces = 'on';
+CREATE TABLESPACE regress_café LOCATION '';
+CREATE TABLESPACE regress_ok LOCATION '';
+ALTER TABLESPACE regress_ok RENAME TO regress_café;
+DROP TABLESPACE regress_ok;
+
+-- Check that we can create a new database with a different encoding,
+-- while the shared catalog encoding is ASCII
+CREATE DATABASE regression_latin1 TEMPLATE template0 LOCALE 'C' ENCODING LATIN1;
+
+-- Check that we can't change the shared catalog encoding to UTF8, because that
+-- LATIN1 database is in the way, then drop it so we can.
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO DATABASE;
+DROP DATABASE regression_latin1;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO DATABASE;
+
+-- Test that we can now do each of those things that failed before, and that
+-- those things block us from going back to ASCII.
+
+-- pg_authid
+CREATE USER regress_astérix;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+DROP USER regress_astérix;
+
+-- pg_database
+CREATE DATABASE regression_café;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+DROP DATABASE regression_café;
+-- but we can't make a LATIN1 database while we have UTF8 catalogs
+CREATE DATABASE regression_latin1 TEMPLATE template0 LOCALE 'C' ENCODING LATIN1;
+
+-- pg_db_role_setting
+CREATE USER regress_fred;
+ALTER ROLE regress_fred SET application_name TO 'café';
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+DROP USER regress_fred;
+
+-- pg_parameter_acl
+-- XXX
+
+-- pg_replication_origin
+SELECT pg_replication_origin_create('regress_café');
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+SELECT pg_replication_origin_drop('regress_café');
+
+-- pg_shdescription
+COMMENT ON DATABASE template0 IS 'café';
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+COMMENT ON DATABASE template0 IS 'unmodifiable empty database';
+
+-- pg_shseclabel
+-- XXX
+
+-- pg_subscription
+-- XXX
+
+-- pg_tablespace
+CREATE TABLESPACE regress_café LOCATION '';
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+DROP TABLESPACE regress_café;
+
+-- We dropped everything that was in the way, so we should be able to go back
+-- to ASCII now.
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+
+-- Try out UNDEFINED mode, which is the only way to have a non-ASCII database
+-- name and mutiple encodings at the same time
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO UNDEFINED;
+SHOW CLUSTER CATALOG ENCODING;
+CREATE DATABASE regression_café ENCODING UTF8;
+CREATE DATABASE regression_latin1 TEMPLATE template0 LOCALE 'C' ENCODING LATIN1;
+
+-- We can't switch to ASCII from this state
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+SHOW CLUSTER CATALOG ENCODING;
+
+-- We also can't switch to UTF8 from this state
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO DATABASE;
+SHOW CLUSTER CATALOG ENCODING;
+
+-- If we get rid of the LATIN1 database, we can go to UTF8
+DROP DATABASE regression_latin1;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO DATABASE;
+SHOW CLUSTER CATALOG ENCODING;
+
+-- We still can't go back to ASCII unless we also get rid of the non-ASCII
+-- database name.
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+DROP DATABASE regression_café;
+ALTER SYSTEM SET CLUSTER CATALOG ENCODING TO ASCII;
+
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c829b619530..1e4dc8b3bfd 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -5,6 +5,7 @@ subdir('commit_ts')
 subdir('delay_execution')
 subdir('dummy_index_am')
 subdir('dummy_seclabel')
+subdir('encoding')
 subdir('gin')
 subdir('injection_points')
 subdir('ldap_password_func')
diff --git a/src/test/regress/expected/database.out b/src/test/regress/expected/database.out
index 454db91ec09..04279a2870c 100644
--- a/src/test/regress/expected/database.out
+++ b/src/test/regress/expected/database.out
@@ -1,15 +1,15 @@
 CREATE DATABASE regression_tbd
-	ENCODING utf8 LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
-ALTER DATABASE regression_tbd RENAME TO regression_utf8;
-ALTER DATABASE regression_utf8 SET TABLESPACE regress_tblspace;
-ALTER DATABASE regression_utf8 RESET TABLESPACE;
-ALTER DATABASE regression_utf8 CONNECTION_LIMIT 123;
+	LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
+ALTER DATABASE regression_tbd RENAME TO regression_xyz;
+ALTER DATABASE regression_xyz SET TABLESPACE regress_tblspace;
+ALTER DATABASE regression_xyz RESET TABLESPACE;
+ALTER DATABASE regression_xyz CONNECTION_LIMIT 123;
 -- Test PgDatabaseToastTable.  Doing this with GRANT would be slow.
 BEGIN;
 UPDATE pg_database
 SET datacl = array_fill(makeaclitem(10, 10, 'USAGE', false), ARRAY[5e5::int])
-WHERE datname = 'regression_utf8';
+WHERE datname = 'regression_xyz';
 -- load catcache entry, if nothing else does
-ALTER DATABASE regression_utf8 RESET TABLESPACE;
+ALTER DATABASE regression_xyz RESET TABLESPACE;
 ROLLBACK;
-DROP DATABASE regression_utf8;
+DROP DATABASE regression_xyz;
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 0e40ed32a21..8a405f27cbc 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2341,6 +2341,8 @@ regression_main(int argc, char *argv[],
 				appendStringInfoString(&cmd, " --debug");
 			if (nolocale)
 				appendStringInfoString(&cmd, " --no-locale");
+			if (encoding)
+				appendStringInfo(&cmd, " --encoding %s", encoding);
 			if (initdb_extra_opts_env)
 				appendStringInfo(&cmd, " %s", initdb_extra_opts_env);
 			appendStringInfo(&cmd, " > \"%s/log/initdb.log\" 2>&1", outputdir);
diff --git a/src/test/regress/sql/database.sql b/src/test/regress/sql/database.sql
index 0367c0e37ab..dd84f866e4a 100644
--- a/src/test/regress/sql/database.sql
+++ b/src/test/regress/sql/database.sql
@@ -1,17 +1,17 @@
 CREATE DATABASE regression_tbd
-	ENCODING utf8 LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
-ALTER DATABASE regression_tbd RENAME TO regression_utf8;
-ALTER DATABASE regression_utf8 SET TABLESPACE regress_tblspace;
-ALTER DATABASE regression_utf8 RESET TABLESPACE;
-ALTER DATABASE regression_utf8 CONNECTION_LIMIT 123;
+	LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
+ALTER DATABASE regression_tbd RENAME TO regression_xyz;
+ALTER DATABASE regression_xyz SET TABLESPACE regress_tblspace;
+ALTER DATABASE regression_xyz RESET TABLESPACE;
+ALTER DATABASE regression_xyz CONNECTION_LIMIT 123;
 
 -- Test PgDatabaseToastTable.  Doing this with GRANT would be slow.
 BEGIN;
 UPDATE pg_database
 SET datacl = array_fill(makeaclitem(10, 10, 'USAGE', false), ARRAY[5e5::int])
-WHERE datname = 'regression_utf8';
+WHERE datname = 'regression_xyz';
 -- load catcache entry, if nothing else does
-ALTER DATABASE regression_utf8 RESET TABLESPACE;
+ALTER DATABASE regression_xyz RESET TABLESPACE;
 ROLLBACK;
 
-DROP DATABASE regression_utf8;
+DROP DATABASE regression_xyz;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2d4c870423a..f2984dcb876 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4087,6 +4087,7 @@ xl_btree_unlink_page
 xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
+xl_cluster_catalog_encoding_change
 xl_commit_ts_truncate
 xl_dbase_create_file_copy_rec
 xl_dbase_create_wal_log_rec
-- 
2.39.5 (Apple Git-154)

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Thomas Munro (#1)

Re: Giving the shared catalogues a defined encoding

Thomas Munro <thomas.munro@gmail.com> writes:

Problem #1: You can have two databases with different encodings, and
they both pretend that pg_database, pg_authid, pg_db_role_setting etc
are in the local database encoding. That doesn't work too well:
non-ASCII text can be reinterpreted in the wrong encoding.

There's no problem if you only use one encoding everywhere (probably
UTF8). There's also no problem if you use multiple database
encodings, but put only ASCII in the shared catalogues (because ASCII
is a subset of every supported server encoding). This patch is about
formalising and enforcing those two working arrangements, hopefully
invisibly to most users. There's still an escape hatch mode if you
need it, e.g. for a non-conforming pg_upgrade'd system.

Over in the discussion of bug #18735, I've come to the realization
that these problems apply equally to the filesystem path names that
the server deals with: not only the data directory path, but the
path to the installation files [1]/messages/by-id/2840430.1733510664@sss.pgh.pa.us. Can we apply the same sort of
restrictions to those? I'm envisioning that initdb would check
either encoding-validity or all-ASCII-ness of those path names
depending on which mode it's setting the server up in.

The patch invents a new setting CLUSTER CATALOG ENCODING, which can be
inspected with SHOW and changed with ALTER SYSTEM.

Changing the catalog encoding would also have to re-verify the
suitability of the paths. Of course this isn't 100% bulletproof
since someone could rename those directories later. But I think
that's in "if you break it you get to keep both pieces" territory.

regards, tom lane

[1]: /messages/by-id/2840430.1733510664@sss.pgh.pa.us

Thomas Munro

thomas.munro@gmail.com

about 1 year ago

In reply to: Tom Lane (#2)

Re: Giving the shared catalogues a defined encoding

On Sat, Dec 7, 2024 at 7:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Over in the discussion of bug #18735, I've come to the realization
that these problems apply equally to the filesystem path names that
the server deals with: not only the data directory path, but the
path to the installation files [1]. Can we apply the same sort of
restrictions to those? I'm envisioning that initdb would check
either encoding-validity or all-ASCII-ness of those path names
depending on which mode it's setting the server up in.

Rabbit hole engaged. I am working on a generalisation, renamed to
just CLUSTER ENCODING, covering all text that is not in a database but
might be visible though database glasses (and probably more). Looking
into pathnames and GUCs now. More soon.

Thomas Munro

thomas.munro@gmail.com

about 1 year ago

In reply to: Tom Lane (#2)

Re: Giving the shared catalogues a defined encoding

On Sat, Dec 7, 2024 at 7:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Over in the discussion of bug #18735, I've come to the realization
that these problems apply equally to the filesystem path names that
the server deals with: not only the data directory path, but the
path to the installation files [1]. Can we apply the same sort of
restrictions to those? I'm envisioning that initdb would check
either encoding-validity or all-ASCII-ness of those path names
depending on which mode it's setting the server up in.

Here are some things I have learned about pathname encoding:

* Some systems enforce an encoding: macOS always requires UTF-8, ext4
does too if you turn on case insensitivity, zfs has a utf8only option,
and a few less interesting-to-us ones have relevant mount options. On
the first three at least: open("cafe\xe9", ...) -> EILSEQ, independent
of user space notions like locales.

* Implications of such a system with non-ASCII data directory:
* there is only one valid configuration at initdb time:
--encoding=UTF8 --cluster-encoding=DATABASE, which is probably the
default anyway, so doesn't need to be spelled out
* --cluster-encoding=ASCII would fail with the attached patch
* --encoding=EUCXXX/MULE --encoding=DATABASE might fail if those
encodings have picky verification
* --encoding=LATIN1 --cluster-encoding=DATABASE would be wrong but
bogusly pass verification (LATIN1 and other single-byte encodings have
no invalid sequences), and SHOW data_directory would expose the
familiar UTF8-through-LATIN1 glasses distortion "café"→"cafÃ©".
All of that is perfectly reasonable I think, I just want to highlight
the cascading effect of the new constraint: Apple's file system
restricts your *database* encoding, with this design, unless you stick
to plain ASCII pathnames. It is an interesting case to compare with
when untangling the Windows mess, see below...

* Traditional Unix filesystems eg ext4/xfs/ufs/... just don't care:
beyond '/' being special, the encoding is in the eye of the beholder.
(According to POSIX 2024 you shouldn't have <newline> in a path
component either, see Austin Group issue #251 and others about
attempts to screw things down a bit and document EILSEQ in various
interfaces). That's cool, just make sure --encoding matches what
you're actually using, accept default --cluster-encoding=DATABASE, and
everything should be OK.

* Windows has a completely different model. Pathnames are really
UTF-16 in the kernel and on disk. All char strings exchanged with the
system have a defined encoding, but it was non-obvious to this humble
Unix hacker what it is in each case. I don't have Windows, so I spent
the afternoon firing test code at CI[1]https://github.com/macdice/hello-windows/blob/env/test.c[2]https://cirrus-ci.com/task/5463497838952448 to figure some of it out.
Some cases relevant to initdb: environ[] and argv[] are in ACP
encoding, even if the parent process used a different encoding by
calling setlocale() before putenv() or system() etc, or used the
UTF-16 variants _wputenv() or _wsystem(). You can also access them as
UTF-16 if you want. So that's how the data directory pathname arrives
into initdb/postgres. Then to use it, the filesystem functions seem
to be in two classes: the POSIXoid ones in the C runtime with
lowercase names like mkdir() are affected by setlocale() and use its
encoding, while the NT native ones with CamelCase like CreateFile()
don't care about locales and keep using the ACP. That sounded like a
problem because we mix them: our open() is really a wrapper on
CreateFile() and yet elsewhere we also use unlink() etc, but
fortunately we keep the server locked in "C" locale and then the
lowercase ones appear to use the ACP in that case anyway, so the
difference is not revealed (it might upset frontend programs though?).

* Consequence: It is senseless to check if getenv("PGDATA") or
argv[]-supplied paths can be validated with the cluster encoding, on
Windows. The thing to do instead would be to check if they can be
converted from ACP to the cluster encoding, and then store the
converted version for display as SHOW data_directory, but keep the ACP
versions we need for filesystem APIs, and probably likewise for lots
of other things, and then plug all the bugs and edge cases that fall
out of that, for the rest of time... or adopt wchar_t interfaces
everywhere perhaps via wrappers that use database encoding... or other
variations which all sound completely out of scope...

* What I'm wondering is whether we can instead achieve coherence along
the lines of the Apple UTF-8 case I described above, but with an extra
step: if you want to use non-ASCII paths *you have to make your ACP
match the database and cluster encoding*. So either you go all-in on
your favourite 80s encoding like WIN1252 that matches your ACP
(impossible for 932 AKA SJIS), or you switch your system's ACP to
UTF-8. Alternatively, I believe postgres.exe could even be built in a
way that makes its ACP always UTF-8[3]https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page (I guess the loader has to help
with that as it affects the way it sets up environ[] and argv[] before
main() runs). I don't know all the consequences though. And I don't
know what exact rules would be best, but something like that would be
in keeping with the general philosophy of this project: just figure
out how to block the combinations that don't work correctly.

Changing the catalog encoding would also have to re-verify the
suitability of the paths.

Yeah, my current development version does pick that up, but ...

postgres=# alter system set cluster encoding to ascii;
ERROR: configuration parameter "hba_file" has invalid value
"/home/tmunro/projects/postgresql/build/café/pg_hba.conf"
DETAIL: Configuration parameter values defined at scopes affecting
all sessions must be valid in CLUSTER ENCODING

... I should probably check it explicitly so I can make a better error
message. That one is arbitrarily picking on a computed GUC it
happened to hit first in hash table order, and not data_directory
itself. (HBA content is also an interesting topic.)

Of course this isn't 100% bulletproof
since someone could rename those directories later. But I think
that's in "if you break it you get to keep both pieces" territory.

If you rename the data directory, my in-development patch still
notices though, and warns at startup (ERROR seems unsuitable here).
One fun aspect is that you have to finish recovery first, to have a
consistent CLUSTER ENCODING value before validation.

[424912]: LOG: database system is ready to accept connections
invalid value "/home/tmunro/projects/postgresql/build/café"
[424912]: LOG: database system is ready to accept connections
affecting all sessions must be valid in CLUSTER ENCODING
[424912]: LOG: database system is ready to accept connections

The GUC validation is strongly enforced with ERROR in other contexts.
For plain SET, I'm trying out a scheme for marking the GUCs that are
shareable like application_name, and being liberal with the rest for
interactive values that can't escape from the database.

[1]: https://github.com/macdice/hello-windows/blob/env/test.c
[2]: https://cirrus-ci.com/task/5463497838952448
[3]: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

Nico Williams

nico@cryptonector.com

9 months ago

In reply to: Thomas Munro (#4)

Re: Giving the shared catalogues a defined encoding

On Tue, Dec 10, 2024 at 02:29:09AM +1300, Thomas Munro wrote:

Here are some things I have learned about pathname encoding:

* Some systems enforce an encoding: macOS always requires UTF-8, ext4
does too if you turn on case insensitivity, zfs has a utf8only option,
and a few less interesting-to-us ones have relevant mount options. On
the first three at least: open("cafe\xe9", ...) -> EILSEQ, independent
of user space notions like locales.

Watch out: OS X normalizes to NFD on create. I.e., it doesn't preserve
form on disk. ZFS on OS X follows the when-in-Rome principle and does
this too. NFD was a poor choice because all input methods tend to
produce forms closer to NFC, including OS X's input methods. This means
that a `memcmp()`-like string comparison of a user input and an
_equivalent_ filename obtained from a directory listing may not match.

Elsewhere ZFS is form-preserving, with form-insensitive matching. This
interops well because input methods tend to produce the same forms for
the same strings. (Normalizing to NFC on create would probably have
been good enough, but at the time I insisted on form-preserving /
form-insensitive, and I still think that was the best option.)

If the ZFS utf8only is off it still does the form-preserving / form-
insensitive thing for path components that are not invalid UTF-8.

[...]
All of that is perfectly reasonable I think, I just want to highlight
the cascading effect of the new constraint: Apple's file system
restricts your *database* encoding, with this design, unless you stick
to plain ASCII pathnames. It is an interesting case to compare with
when untangling the Windows mess, see below...

* Traditional Unix filesystems eg ext4/xfs/ufs/... just don't care:
beyond '/' being special, the encoding is in the eye of the beholder.

Correct, though there's another ASCII codepoint that all Unix
filesystems always treat specially: NUL :) (And as you point out newline
in filenames can be a problem and is discouraged but generally not
forbidden or treated specially by the filesystem system calls nor the
filesystems themselves.)

I call this "just-use-8" behavior.

No Unix C library bothers to implement a UTF-8 convention for paths by
doing codeset conversions when running in non-UTF-8 locales. So the
only reasonable way to do I18N interop on Unix is to stick *strictly* to
UTF-8 locales only.

* Windows has a completely different model. Pathnames are really
UTF-16 in the kernel and on disk. All char strings exchanged with the
system have a defined encoding, but it was non-obvious to this humble
Unix hacker what it is in each case. I don't have Windows, so I spent
the afternoon firing test code at CI[1][2] to figure some of it out.

Historically on Windows NT and up the filesystem system calls and the
filesystems themselves are "just-use-16", with the convention that the
applications and the C runtime will be using UTF-16.

Since nowadays Unix systems strongly prefer UTF-8 locales, the Windows
convention is not that different from the Unix one in practice.

* What I'm wondering is whether we can instead achieve coherence along
the lines of the Apple UTF-8 case I described above, but with an extra
step: if you want to use non-ASCII paths *you have to make your ACP
match the database and cluster encoding*. So either you go all-in on
your favourite 80s encoding like WIN1252 that matches your ACP
(impossible for 932 AKA SJIS), or you switch your system's ACP to
UTF-8. Alternatively, I believe postgres.exe could even be built in a
way that makes its ACP always UTF-8[3] (I guess the loader has to help
with that as it affects the way it sets up environ[] and argv[] before
main() runs). I don't know all the consequences though. And I don't
know what exact rules would be best, but something like that would be
in keeping with the general philosophy of this project: just figure
out how to block the combinations that don't work correctly.

Can you insist that for new DBs `initdb`/`createdb`/`postgres` run only
in Unicode locales on Windows, and with the UTF-8 codepage? Do you have
to support older Windows releases?

(HBA content is also an interesting topic.)

I bet.

Nico
--