BUG #18545: \dt breaks transaction, calling error when executed in SET SESSION AUTHORIZATION
The following bug has been logged on the website:
Bug reference: 18545
Logged by: Andrey Rachitskiy
Email address: therealgofman@mail.ru
PostgreSQL version: 16.3
Operating system: Debian 12
Description:
\dt breaks transaction, calling error when executed in SET SESSION
AUTHORIZATION. This happens in two cases, with different SET.
Case one:
postgres@debian-test:~$ psql -U postgres
psql (16.3)
Type "help" for help.
postgres=# \set VERBOSITY verbose
postgres=# BEGIN;
BEGIN
postgres=*# CREATE USER regress_priv_user8;
CREATE ROLE
postgres=*# SET SESSION AUTHORIZATION regress_priv_user8;
SET
postgres=*> \dt+;
Did not find any relations.
postgres=*> SET LOCAL debug_parallel_query = 1;
SET
postgres=*> \dt+;
ERROR: 22023: role "regress_priv_user8" does not exist
CONTEXT: while setting parameter "session_authorization" to
"regress_priv_user8"
parallel worker
LOCATION: call_string_check_hook, guc.c:6734
postgres=!#
\q
Case two:
postgres@debian-test:~$ psql -U postgres
psql (16.3)
Type "help" for help.
postgres=# \set VERBOSITY verbose
postgres=# BEGIN;
BEGIN
postgres=*# CREATE USER regress_priv_user8;
CREATE ROLE
postgres=*# SET SESSION AUTHORIZATION regress_priv_user8;
SET
postgres=*> \dt+
Did not find any relations.
postgres=*> set local parallel_setup_cost = 0;
SET
postgres=*> set local min_parallel_table_scan_size = 0;
SET
postgres=*> \dt+
ERROR: 22023: role "regress_priv_user8" does not exist
CONTEXT: while setting parameter "session_authorization" to
"regress_priv_user8"
parallel worker
LOCATION: call_string_check_hook, guc.c:6734
postgres=!#
\q
PG Bug reporting form <noreply@postgresql.org> writes:
postgres=# BEGIN;
BEGIN
postgres=*# CREATE USER regress_priv_user8;
CREATE ROLE
postgres=*# SET SESSION AUTHORIZATION regress_priv_user8;
SET
postgres=*> SET LOCAL debug_parallel_query = 1;
SET
postgres=*> \dt+;
ERROR: 22023: role "regress_priv_user8" does not exist
CONTEXT: while setting parameter "session_authorization" to "regress_priv_user8"
parallel worker
So this has exactly nothing to do with \dt+; any parallel query
will hit it. The problem is that parallel workers do
RestoreGUCState() before they've restored the leader's snapshot.
Thus, in this example where session_authorization refers to an
uncommitted pg_authid entry, the workers don't see that entry.
It seems likely that similar failures are possible with other
GUCs that perform catalog lookups.
I experimented with two different ways to fix this:
1. Run RestoreGUCState() outside a transaction, thus preventing
catalog lookups. Assume that individual GUC check hooks that
would wish to do a catalog lookup will cope. Unfortunately,
some of them don't and would need fixed; check_role and
check_session_authorization for two.
2. Delay RestoreGUCState() into the parallel worker's main
transaction, after we've restored the leader's snapshot.
This turns out to break a different set of check hooks, notably
check_transaction_deferrable.
I think that the blast radius of option 2 is probably smaller than
option 1's, because it should only matter to check hooks that think
they should run before the transaction has set a snapshot, and there
are few of those. check_transaction_read_only already had a guard,
but I added similar ones to check_transaction_isolation and
check_transaction_deferrable.
The attached draft patch also contains changes to prevent
check_session_authorization from doing anything during parallel
worker startup. That's left over from experimenting with option 1,
and is not strictly necessary with option 2. I left it in anyway
because it's saving some unnecessary work. (For some reason,
check_role seems not to fail if you modify the test case to use
SET ROLE. I did not figure out why not. I kind of want to modify
check_role to be a no-op too when InitializingParallelWorker,
but did not touch that here pending more investigation.)
Another thing I'm wondering about is whether to postpone
RestoreLibraryState similarly. Its current placement is said
to be "before restoring GUC values", so it looks a little out
of place now. Moving it into the main transaction would save
one StartTransactionCommand/CommitTransactionCommand pair
during parallel worker start, which is worth something.
But I think the real argument for it is that if any loaded
libraries try to do catalog lookups during load, we'd rather
that they see the same catalog state the leader does.
As against that, it feels like there's a nonzero risk of
breaking some third-party code if we move that call.
Thoughts?
regards, tom lane
Attachments:
fix-bug-18545-wip.patchtext/x-diff; charset=us-ascii; name=fix-bug-18545-wip.patchDownload
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb5..06331d906a 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1410,10 +1410,6 @@ ParallelWorkerMain(Datum main_arg)
libraryspace = shm_toc_lookup(toc, PARALLEL_KEY_LIBRARY, false);
StartTransactionCommand();
RestoreLibraryState(libraryspace);
-
- /* Restore GUC values from launching backend. */
- gucspace = shm_toc_lookup(toc, PARALLEL_KEY_GUC, false);
- RestoreGUCState(gucspace);
CommitTransactionCommand();
/* Crank up a transaction state appropriate to a parallel worker. */
@@ -1455,6 +1451,14 @@ ParallelWorkerMain(Datum main_arg)
*/
InvalidateSystemCaches();
+ /*
+ * Restore GUC values from launching backend. We can't do this earlier,
+ * because GUCs that do catalog lookups (such as session_authorization)
+ * need to see the same database state as the leader.
+ */
+ gucspace = shm_toc_lookup(toc, PARALLEL_KEY_GUC, false);
+ RestoreGUCState(gucspace);
+
/*
* Restore current role id. Skip verifying whether session user is
* allowed to become this role and blindly restore the leader's state for
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 9345131711..b2a00ae0f0 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -577,14 +577,16 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
* We allow idempotent changes at any time, but otherwise this can only be
* changed in a toplevel transaction that has not yet taken a snapshot.
*
- * As in check_transaction_read_only, allow it if not inside a transaction.
+ * As in check_transaction_read_only, allow it if not inside a transaction,
+ * or if restoring state in a parallel worker.
*/
bool
check_transaction_isolation(int *newval, void **extra, GucSource source)
{
int newXactIsoLevel = *newval;
- if (newXactIsoLevel != XactIsoLevel && IsTransactionState())
+ if (newXactIsoLevel != XactIsoLevel &&
+ IsTransactionState() && !InitializingParallelWorker)
{
if (FirstSnapshotSet)
{
@@ -619,6 +621,10 @@ check_transaction_isolation(int *newval, void **extra, GucSource source)
bool
check_transaction_deferrable(bool *newval, void **extra, GucSource source)
{
+ /* Just accept the value when restoring state in a parallel worker */
+ if (InitializingParallelWorker)
+ return true;
+
if (IsSubTransaction())
{
GUC_check_errcode(ERRCODE_ACTIVE_SQL_TRANSACTION);
@@ -811,6 +817,15 @@ check_session_authorization(char **newval, void **extra, GucSource source)
if (*newval == NULL)
return true;
+ /*
+ * Do nothing if initializing a parallel worker: ParallelWorkerMain is
+ * responsible for setting the active role. We just need to absorb the
+ * leader's value of session_authorization in case some user code looks at
+ * it.
+ */
+ if (InitializingParallelWorker)
+ return true;
+
if (!IsTransactionState())
{
/*
@@ -886,7 +901,7 @@ assign_session_authorization(const char *newval, void *extra)
{
role_auth_extra *myextra = (role_auth_extra *) extra;
- /* Do nothing for the boot_val default of NULL */
+ /* Do nothing for the boot_val default of NULL, and in parallel workers */
if (!myextra)
return;
Hi, Tom! Thank you for work on the subject. After applying patch, problem is no longer reproducible.
---
Best regards,
Andrey Rachitskiy
Postgres Professional: http://postgrespro.com
Суббота, 20 июля 2024, 0:04 +05:00 от Tom Lane <tgl@sss.pgh.pa.us>:
PG Bug reporting form < noreply@postgresql.org > writes:postgres=# BEGIN;
BEGIN
postgres=*# CREATE USER regress_priv_user8;
CREATE ROLE
postgres=*# SET SESSION AUTHORIZATION regress_priv_user8;
SET
postgres=*> SET LOCAL debug_parallel_query = 1;
SET
postgres=*> \dt+;
ERROR: 22023: role "regress_priv_user8" does not exist
CONTEXT: while setting parameter "session_authorization" to "regress_priv_user8"
parallel workerSo this has exactly nothing to do with \dt+; any parallel query
will hit it. The problem is that parallel workers do
RestoreGUCState() before they've restored the leader's snapshot.
Thus, in this example where session_authorization refers to an
uncommitted pg_authid entry, the workers don't see that entry.
It seems likely that similar failures are possible with other
GUCs that perform catalog lookups.I experimented with two different ways to fix this:
1. Run RestoreGUCState() outside a transaction, thus preventing
catalog lookups. Assume that individual GUC check hooks that
would wish to do a catalog lookup will cope. Unfortunately,
some of them don't and would need fixed; check_role and
check_session_authorization for two.2. Delay RestoreGUCState() into the parallel worker's main
transaction, after we've restored the leader's snapshot.
This turns out to break a different set of check hooks, notably
check_transaction_deferrable.I think that the blast radius of option 2 is probably smaller than
option 1's, because it should only matter to check hooks that think
they should run before the transaction has set a snapshot, and there
are few of those. check_transaction_read_only already had a guard,
but I added similar ones to check_transaction_isolation and
check_transaction_deferrable.The attached draft patch also contains changes to prevent
check_session_authorization from doing anything during parallel
worker startup. That's left over from experimenting with option 1,
and is not strictly necessary with option 2. I left it in anyway
because it's saving some unnecessary work. (For some reason,
check_role seems not to fail if you modify the test case to use
SET ROLE. I did not figure out why not. I kind of want to modify
check_role to be a no-op too when InitializingParallelWorker,
but did not touch that here pending more investigation.)Another thing I'm wondering about is whether to postpone
RestoreLibraryState similarly. Its current placement is said
to be "before restoring GUC values", so it looks a little out
of place now. Moving it into the main transaction would save
one StartTransactionCommand/CommitTransactionCommand pair
during parallel worker start, which is worth something.
But I think the real argument for it is that if any loaded
libraries try to do catalog lookups during load, we'd rather
that they see the same catalog state the leader does.
As against that, it feels like there's a nonzero risk of
breaking some third-party code if we move that call.Thoughts?
regards, tom lane
=?UTF-8?B?0JDQvdC00YDQtdC5INCg0LDRh9C40YbQutC40Lk=?= <therealgofman@mail.ru> writes:
Hi, Tom! Thank you for work on the subject. After applying patch, problem is no longer reproducible.
Thanks for checking. I realized that the idea of making
check_session_authorization a no-op was wrong as presented:
if we don't set up an "extra" struct then guc.c would be
unable to restore the setting later, in case say a function
that's run inside the parallel query has a SET
session_authorization clause. It's probably possible to
revive that idea with more work, but it's not essential to the
bug fix and we're getting close to the August minor releases.
So I pushed the core bug fix, and I'll take another look at
that part later.
regards, tom lane
I wrote:
So I pushed the core bug fix, and I'll take another look at
that part later.
... or not; the buildfarm didn't like that much. It'll
have to wait till after these releases, because I'm
overdue to get to work on the release notes.
regards, tom lane
I wrote:
So I pushed the core bug fix, and I'll take another look at
that part later.
... or not; the buildfarm didn't like that much. It'll
have to wait till after these releases, because I'm
overdue to get to work on the release notes.
The reason the buildfarm found something I'd missed in testing
is that I didn't run check-world with debug_parallel_query set,
which was a bad idea for a patch messing with parallel query
mechanics :-(. Mea culpa.
However, what the farm found is that assign_client_encoding
is flat-out broken. It's ignoring the first commandment
for GUC hooks, which is "Thy assign hooks shalt not fail".
(If we didn't need that, there wouldn't be a separation
between check hooks and assign hooks in the first place.)
Because it's throwing an error at the wrong time, it spits
up if it sees an (irrelevant) rollback of client_encoding
during cleanup of a failed parallel worker. We didn't see
this before because GUCRestoreState was being run in a separate
mini-transaction that (usually at least) doesn't fail.
The attached correction basically just moves that test into
check_client_encoding where it should have been to begin with.
After applying this, I can un-revert f5f30c22e and everything
passes.
However, this episode definitely gives me pause about back-patching
f5f30c22e as I did before. It seems not impossible that there
are extensions with similarly mis-coded assign hooks, and if
so those are going to need to be fixed. (I did check that none
of the other core GUCs have this problem. assign_recovery_target
and friends would, except that they are for PGC_POSTMASTER variables
that won't be getting changed by GUCRestoreState. Anyway they're a
known kluge, and this patch isn't making it worse.) So we'd better
treat this as a minor API change, which means we probably shouldn't
put it in stable branches.
What I'm currently thinking is to apply in HEAD and perhaps v17,
but not further back. Given that this bug has existed since
the beginning of parallel query yet wasn't reported till now,
it's not sufficiently problematic to take any risk for in
stable branches.
Any opinions about whether it's too late to do this in v17?
Post-beta3 is pretty late, for sure, but maybe we could get
away with it. And we are fixing a bug here.
regards, tom lane
Attachments:
fix-assign-client-encoding-breakage.patchtext/x-diff; charset=us-ascii; name=fix-assign-client-encoding-breakage.patchDownload
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index f44d942aa4..6ba6d08b24 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -690,6 +690,24 @@ check_client_encoding(char **newval, void **extra, GucSource source)
/* Get the canonical name (no aliases, uniform case) */
canonical_name = pg_encoding_to_char(encoding);
+ /*
+ * Parallel workers send data to the leader, not the client. They always
+ * send data using the database encoding; therefore, we should never
+ * actually change the client encoding in a parallel worker. However,
+ * during parallel worker startup, we want to accept the leader's
+ * client_encoding setting so that anyone who looks at the value in the
+ * worker sees the same value that they would see in the leader. A change
+ * other than during startup, for example due to a SET clause attached to
+ * a function definition, should be rejected, as there is nothing we can
+ * do inside the worker to make it take effect.
+ */
+ if (IsParallelWorker() && !InitializingParallelWorker)
+ {
+ GUC_check_errcode(ERRCODE_INVALID_TRANSACTION_STATE);
+ GUC_check_errdetail("Cannot change \"client_encoding\" during a parallel operation.");
+ return false;
+ }
+
/*
* If we are not within a transaction then PrepareClientEncoding will not
* be able to look up the necessary conversion procs. If we are still
@@ -700,11 +718,15 @@ check_client_encoding(char **newval, void **extra, GucSource source)
* It seems like a bad idea for client_encoding to change that way anyhow,
* so we don't go out of our way to support it.
*
+ * In a parallel worker, we might as well skip PrepareClientEncoding since
+ * we're not going to use its results.
+ *
* Note: in the postmaster, or any other process that never calls
* InitializeClientEncoding, PrepareClientEncoding will always succeed,
* and so will SetClientEncoding; but they won't do anything, which is OK.
*/
- if (PrepareClientEncoding(encoding) < 0)
+ if (!IsParallelWorker() &&
+ PrepareClientEncoding(encoding) < 0)
{
if (IsTransactionState())
{
@@ -758,28 +780,11 @@ assign_client_encoding(const char *newval, void *extra)
int encoding = *((int *) extra);
/*
- * Parallel workers send data to the leader, not the client. They always
- * send data using the database encoding.
+ * In a parallel worker, we never override the client encoding that was
+ * set by ParallelWorkerMain().
*/
if (IsParallelWorker())
- {
- /*
- * During parallel worker startup, we want to accept the leader's
- * client_encoding setting so that anyone who looks at the value in
- * the worker sees the same value that they would see in the leader.
- */
- if (InitializingParallelWorker)
- return;
-
- /*
- * A change other than during startup, for example due to a SET clause
- * attached to a function definition, should be rejected, as there is
- * nothing we can do inside the worker to make it take effect.
- */
- ereport(ERROR,
- (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
- errmsg("cannot change \"client_encoding\" during a parallel operation")));
- }
+ return;
/* We do not expect an error if PrepareClientEncoding succeeded */
if (SetClientEncoding(encoding) < 0)
On Sun, Aug 4, 2024 at 3:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Any opinions about whether it's too late to do this in v17?
Post-beta3 is pretty late, for sure, but maybe we could get
away with it. And we are fixing a bug here.
If this isn't going to appear in the beta3 build I'd say it's probably too
late given the target audience for waiting on this is extension authors.
If it is going into beta3 then I'd vote to allow it.
Feels like this dynamic should be covered as part of our recent attempt to
better communicate our policies to our extension authoring community.
https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-GUIDANCE-ABI-MNINOR-VERSIONS
Something like:
Namely, beta releases do not constitute a minor release under our policies
and updates to our API/ABIs can happen during beta at any point - whether
for features newly added in the under- development major version or not.
Extension authors are thus encouraged to test their extensions against the
RC build at minimum should they wish for their extension to be ready when
the initial release comes out. Tests against beta versions are very
helpful to all interested parties but there is no guarantee that tests that
pass any given beta release will pass when performed against the release
candidate. For the release candidate we will use the same patching policy
as for a normal minor release. Any exceptions will necessitate a second
release candidate.
The above wording allows us to put this patch into beta3, which I'd be fine
with. But I'd also be fine with adding wording like: "Changes introduced
after the final beta is released for testing will [generally?] be limited
to fixing items conforming to the Open Item policy." Probably favor the
latter too by a small margin.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes:
On Sun, Aug 4, 2024 at 3:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Any opinions about whether it's too late to do this in v17?
Post-beta3 is pretty late, for sure, but maybe we could get
away with it. And we are fixing a bug here.
If this isn't going to appear in the beta3 build I'd say it's probably too
late given the target audience for waiting on this is extension authors.
If it is going into beta3 then I'd vote to allow it.
Nope, it's definitely not going into beta3; it's about two days
too late for that.
I agree fixing it in HEAD only is the more conservative course.
To do otherwise, we'd have to rank the #18545 bug as fairly
important, and I'm not sure I buy that given how long it took to
notice it. But I was curious to see if anyone felt differently.
regards, tom lane
On Sun, Aug 4, 2024 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
However, what the farm found is that assign_client_encoding
is flat-out broken. It's ignoring the first commandment
for GUC hooks, which is "Thy assign hooks shalt not fail".
(If we didn't need that, there wouldn't be a separation
between check hooks and assign hooks in the first place.)
Interesting. Looks like my mistake, dating to
10c0558ffefcd12bf1d3dc35587eba41d1ce4571. I'm honestly kind of
surprised that nobody discovered this problem for 8 years. I would
have expected it to cause more problems.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Aug 4, 2024 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
However, what the farm found is that assign_client_encoding
is flat-out broken. It's ignoring the first commandment
for GUC hooks, which is "Thy assign hooks shalt not fail".
Interesting. Looks like my mistake, dating to
10c0558ffefcd12bf1d3dc35587eba41d1ce4571. I'm honestly kind of
surprised that nobody discovered this problem for 8 years. I would
have expected it to cause more problems.
Yeah, it's a bit accidental that that's not reachable up to now.
Or I think it's not reachable, anyway. If we find out differently
we can back-patch 0ae5b763e, but for now I refrained.
regards, tom lane
This commit seems to trigger elog(), not reproducible in the
parent commit.
6e086fa2e77 Allow parallel workers to cope with a newly-created session user ID.
postgres=# SET min_parallel_table_scan_size=0; CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
ERROR: pg_attribute catalog is missing 26 attribute(s) for relation OID 70321
postgres=# \errverbose
ERROR: XX000: pg_attribute catalog is missing 26 attribute(s) for relation OID 70321
LOCATION: RelationBuildTupleDesc, relcache.c:658
This is not completely deterministic:
postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
CLUSTER
postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
CLUSTER
postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
CLUSTER
postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
CLUSTER
postgres=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
ERROR: pg_attribute catalog is missing 26 attribute(s) for relation OID 70391
But I think this will be reproducible in any database with a nontrivial
number of attributes.
Justin Pryzby <pryzby@telsasoft.com> writes:
This commit seems to trigger elog(), not reproducible in the
parent commit.
Yeah, I can reproduce that. Will take a look tomorrow.
regards, tom lane
Justin Pryzby <pryzby@telsasoft.com> writes:
This commit seems to trigger elog(), not reproducible in the
parent commit.
6e086fa2e77 Allow parallel workers to cope with a newly-created session user ID.
postgres=# SET min_parallel_table_scan_size=0; CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
ERROR: pg_attribute catalog is missing 26 attribute(s) for relation OID 70321
I've been poking at this all day, and I still have little idea what's
going on. I've added a bunch of throwaway instrumentation, and have
managed to convince myself that the problem is that parallel heap
scan is broken. The scans done to rebuild pg_attribute's indexes
seem to sometimes miss heap pages or visit pages twice (in different
workers). I have no idea why this is, and even less idea how
6e086fa2e is provoking it. As you say, the behavior isn't entirely
reproducible, but I couldn't make it happen at all after reverting
6e086fa2e's changes in transam/parallel.c, so apparently there is
some connection.
Another possibly useful data point is that for me it reproduces
fairly well (more than one time in two) on x86_64 Linux, but
I could not make it happen on macOS ARM64. If it's a race
condition, which smells plausible, that's perhaps not hugely
surprising.
regards, tom lane
I wrote:
Justin Pryzby <pryzby@telsasoft.com> writes:
This commit seems to trigger elog(), not reproducible in the
parent commit.
6e086fa2e77 Allow parallel workers to cope with a newly-created session user ID.
postgres=# SET min_parallel_table_scan_size=0; CLUSTER pg_attribute USING pg_attribute_relid_attnum_index;
ERROR: pg_attribute catalog is missing 26 attribute(s) for relation OID 70321
I've been poking at this all day, and I still have little idea what's
going on.
Got it, after a good deal more head-scratching. Here's the relevant
parts of ParallelWorkerMain:
/*
* We've changed which tuples we can see, and must therefore invalidate
* system caches.
*/
InvalidateSystemCaches();
/*
* Restore GUC values from launching backend. We can't do this earlier,
* because GUC check hooks that do catalog lookups need to see the same
* database state as the leader.
*/
gucspace = shm_toc_lookup(toc, PARALLEL_KEY_GUC, false);
RestoreGUCState(gucspace);
...
/* Restore relmapper state. */
relmapperspace = shm_toc_lookup(toc, PARALLEL_KEY_RELMAPPER_STATE, false);
RestoreRelationMap(relmapperspace);
InvalidateSystemCaches blows away the worker's relcache. Then
RestoreGUCState causes some catalog lookups (tracing shows that
restoring default_text_search_config is what triggers this on my
setup), and in particular pg_attribute's relcache entry will get
constructed to support that. Then we wheel in a new set of
relation map entries *without doing anything about what that
might invalidate*.
In the given test case, the globally-visible relmap says that
pg_attribute's relfilenode is, say, XXXX. But we are busy rewriting
it, so the parent process has an "active" relmap entry that says
pg_attribute's relfilenode is YYYY. Given the above, the worker
process will have built a pg_attribute relcache entry that contains
XXXX, and even though it now knows YYYY is the value it should be
using, that information never makes it to the worker's relcache.
The upshot of this is that when the parallel heap scan machinery
doles out some block numbers for the parent process to read, and
some other block numbers for the worker to read, the worker is
reading those block numbers from the pre-clustering copy of
pg_attribute, which most likely doesn't match the post-clustering
image. This accounts for the missing and duplicate tuples I was
seeing in the scan output.
Of course, the reason 6e086fa2e made this visible is that before
that, any catalog reads triggered by RestoreGUCState were done
in an earlier transaction, and then we would blow away the ensuing
relcache entries in InvalidateSystemCaches. So there was no bug
as long as you assume that the "..." code doesn't cause any
catalog reads. I'm not too sure of that though --- it's certainly
not very comfortable to assume that functions like SetCurrentRoleId
and SetTempNamespaceState will never attempt a catalog lookup.
The code has another hazard too, which is that this all implies
that the GUC-related catalog lookups will be done against the
globally-visible relmap state not whatever is active in the parent
process. I have not tried to construct a POC showing that that
can give incorrect answers (that is, different from what the
parent thinks), but it seems plausible that it could.
So the fix seems clear to me: RestoreRelationMap needs to happen
before anything that could result in catalog lookups. I'm kind
of inclined to move up the adjacent restores of non-transactional
low-level stuff too, particularly RestoreReindexState which has
direct impact on how catalog lookups are done.
Independently of that, it's annoying that the parallel heap scan
machinery failed to notice that it was handing out block numbers
for two different relfilenodes. I'm inclined to see if we can
put some Asserts in there that would detect that. This particular
bug would have been far easier to diagnose that way, and it hardly
seems unlikely that "worker is reading the wrong relation" could
happen with other mistakes in future.
regards, tom lane
On Thu, Sep 19, 2024 at 05:35:33PM -0400, Tom Lane wrote:
So the fix seems clear to me: RestoreRelationMap needs to happen
before anything that could result in catalog lookups. I'm kind
of inclined to move up the adjacent restores of non-transactional
low-level stuff too, particularly RestoreReindexState which has
direct impact on how catalog lookups are done.
Thanks for debugging that. RestorePendingSyncs() also changes what
RelationInitPhysicalAddr() puts in the relcache entry, so it needs to stay
with RestoreRelationMap(). I'm attaching the fix I have in mind. Since the
above (commit 126ec0b), the following has been getting an
AssertPendingSyncConsistency() trap:
make check-tests TESTS=reindex_catalog PG_TEST_INITDB_EXTRA_OPTS='-cwal_level=minimal -cmax_wal_senders=0'
In commit range [6e086fa,126ec0b^], that failed differently, reflecting the
outdated relmapper. (6e086fa is mostly innocent here, though.)
I looked for ways we'd regret not back-patching commit 126ec0b and this, but I
didn't find a concrete problem. After InvalidateSystemCaches(), the
back-branch parallel worker relcache contains the nailed relations. Each
entry then has:
- rd_locator possibly outdated
- rd_firstRelfilelocatorSubid==0 (correct for rd_locator)
- rd_isvalid==false, from RelationReloadNailed()
From code comments, I gather RelationCacheInitializePhase2() is the latest we
use an rd_locator without first making the relcache entry rd_isvalid=true. If
that's right, so long as no catalogs get rd_isvalid=true between
InvalidateSystemCaches() and RestorePendingSyncs(), we're okay without a
back-patch.
I was going to add check-world coverage of this case, but I stopped in light
of the tricky deadlocks that led to commit fe4d022 disabling the
reindex_catalog test. check-world does reach it, in inplace.spec, if one runs
with both wal_level=minimal and debug_parallel_query=regress. (inplace.spec
may eventually fail with the same deadlocks. I've not heard of it deadlocking
so far, and being a NO_INSTALLCHECK test helps.)
Attachments:
RestorePendingSyncs-at-RestoreRelationMap-v1.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
Fix parallel worker tracking of new catalog relfilenumbers.
Reunite RestorePendingSyncs() with RestoreRelationMap(). If
RelationInitPhysicalAddr() ran after RestoreRelationMap() but before
RestorePendingSyncs(), the relcache entry could cause RelationNeedsWAL()
to return true erroneously. Trouble required commands of the current
transaction to include REINDEX or CLUSTER of a system catalog. The
parallel leader correctly derived RelationNeedsWAL()==false from the new
relfilenumber, but the worker saw RelationNeedsWAL()==true. Worker
MarkBufferDirtyHint() then wrote unwanted WAL. Recovery of that
unwanted WAL could lose tuples like the system could before commit
c6b92041d38512a4176ed76ad06f713d2e6c01a8 introduced this tracking.
RestorePendingSyncs() and RestoreRelationMap() were adjacent till commit
126ec0bc76d044d3a9eb86538b61242bf7da6db4, so no back-patch for now.
Reviewed by FIXME.
Discussion: https://postgr.es/m/FIXME
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index d4e84aa..4a2e352 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1421,17 +1421,18 @@ ParallelWorkerMain(Datum main_arg)
StartParallelWorkerTransaction(tstatespace);
/*
- * Restore relmapper and reindex state early, since these affect catalog
- * access. Ideally we'd do this even before calling InitPostgres, but
- * that has order-of-initialization problems, and also the relmapper would
- * get confused during the CommitTransactionCommand call above.
+ * Restore state that affects catalog access. Ideally we'd do this even
+ * before calling InitPostgres, but that has order-of-initialization
+ * problems, and also the relmapper would get confused during the
+ * CommitTransactionCommand call above.
*/
+ pendingsyncsspace = shm_toc_lookup(toc, PARALLEL_KEY_PENDING_SYNCS,
+ false);
+ RestorePendingSyncs(pendingsyncsspace);
relmapperspace = shm_toc_lookup(toc, PARALLEL_KEY_RELMAPPER_STATE, false);
RestoreRelationMap(relmapperspace);
reindexspace = shm_toc_lookup(toc, PARALLEL_KEY_REINDEX_STATE, false);
RestoreReindexState(reindexspace);
-
- /* Restore combo CID state. */
combocidspace = shm_toc_lookup(toc, PARALLEL_KEY_COMBO_CID, false);
RestoreComboCIDState(combocidspace);
@@ -1488,11 +1489,6 @@ ParallelWorkerMain(Datum main_arg)
SetTempNamespaceState(fps->temp_namespace_id,
fps->temp_toast_namespace_id);
- /* Restore pending syncs. */
- pendingsyncsspace = shm_toc_lookup(toc, PARALLEL_KEY_PENDING_SYNCS,
- false);
- RestorePendingSyncs(pendingsyncsspace);
-
/* Restore uncommitted enums. */
uncommittedenumsspace = shm_toc_lookup(toc, PARALLEL_KEY_UNCOMMITTEDENUMS,
false);
Noah Misch <noah@leadboat.com> writes:
Thanks for debugging that. RestorePendingSyncs() also changes what
RelationInitPhysicalAddr() puts in the relcache entry, so it needs to stay
with RestoreRelationMap(). I'm attaching the fix I have in mind.
Ah. No objections here.
regards, tom lane