Subscription tests fail under CLOBBER_CACHE_ALWAYS
I discovered $SUBJECT after wondering why hyrax hadn't reported
in recently, and trying to run check-world under CCA to see if
anything got stuck. Indeed it did --- although this doesn't
explain the radio silence from hyrax, because that animal doesn't
run any TAP tests. (Neither does avocet, which I think is the
only other active CCA critter. So this could have been broken
for a very long time.)
I count three distinct bugs that were exposed by this attempt:
1. In the part of 013_partition.pl that tests firing AFTER
triggers on partitioned tables, we have a case of continuing
to access a relcache entry that's already been closed.
(I'm not quite sure why prion's -DRELCACHE_FORCE_RELEASE
hasn't exposed this.) It looks to me like instead we had
a relcache reference leak before f3b141c48, but now, the
only relcache reference count on a partition child table
is dropped by ExecCleanupTupleRouting, which logical/worker.c
invokes before it fires triggers on that table. Kaboom.
This might go away if worker.c weren't so creatively different
from the other code paths concerned with executor shutdown.
2. Said bug causes a segfault in the apply worker process.
This causes the parent postmaster to give up and die.
I don't understand why we don't treat that like a crash
in a regular backend, considering that an apply worker
is running largely user-defined code.
3. Once the subscriber1 postmaster has exited, the TAP
test will eventually time out, and then this happens:
timed out waiting for catchup at t/013_partition.pl line 219.
### Stopping node "publisher" using mode immediate
# Running: pg_ctl -D /Users/tgl/pgsql/src/test/subscription/tmp_check/t_013_partition_publisher_data/pgdata -m immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "publisher"
### Stopping node "subscriber1" using mode immediate
# Running: pg_ctl -D /Users/tgl/pgsql/src/test/subscription/tmp_check/t_013_partition_subscriber1_data/pgdata -m immediate stop
pg_ctl: PID file "/Users/tgl/pgsql/src/test/subscription/tmp_check/t_013_partition_subscriber1_data/pgdata/postmaster.pid" does not exist
Is server running?
Bail out! system pg_ctl failed
That is, because we failed to shut down subscriber1, the
test script neglects to shut down subscriber2, and now
things just sit indefinitely. So that's a robustness
problem in the TAP infrastructure, rather than a bug in
PG proper; but I still say it's a bug that needs fixing.
regards, tom lane
On Tue, May 18, 2021 at 07:42:08PM -0400, Tom Lane wrote:
I count three distinct bugs that were exposed by this attempt:
1. In the part of 013_partition.pl that tests firing AFTER
triggers on partitioned tables, we have a case of continuing
to access a relcache entry that's already been closed.
(I'm not quite sure why prion's -DRELCACHE_FORCE_RELEASE
hasn't exposed this.) It looks to me like instead we had
a relcache reference leak before f3b141c48, but now, the
only relcache reference count on a partition child table
is dropped by ExecCleanupTupleRouting, which logical/worker.c
invokes before it fires triggers on that table. Kaboom.
This might go away if worker.c weren't so creatively different
from the other code paths concerned with executor shutdown.
The tuple routing has made the whole worker logic messier by a larger
degree compared to when this stuff was only able to apply DMLs changes
on the partition leaves. I know that it is not that great to be more
creative here, but we need to make sure that AfterTriggerEndQuery() is
moved before ExecCleanupTupleRouting(). We could either keep the
ExecCleanupTupleRouting() calls as they are now, and call
AfterTriggerEndQuery() in more code paths. Or we could have one
PartitionTupleRouting and one ModifyTableState created beforehand
in create_estate_for_relation() if applying the change on a
partitioned table but that means manipulating more structures across
more layers of this code. Something like the attached fixes the
problem for me, but honestly it does not help in clarifying this code
more. I was not patient enough to wait for CLOBBER_CACHE_ALWAYS to
initialize the nodes in the TAP tests, so I have tested that with a
setup initialized with a non-clobber build, and reproduced the problem
with CLOBBER_CACHE_ALWAYS builds on those same nodes.
You are right that this is not a problem of 14~. I can reproduce the
problem on 13 as well, and we have no coverage for tuple routing with
triggers on this branch, so this would never have been stressed in the
buildfarm. There is a good argument to be made here in cherry-picking
2ecfeda3 to REL_13_STABLE.
2. Said bug causes a segfault in the apply worker process.
This causes the parent postmaster to give up and die.
I don't understand why we don't treat that like a crash
in a regular backend, considering that an apply worker
is running largely user-defined code.
CleanupBackgroundWorker() and CleanupBackend() have a lot of common
points. Are you referring to an inconsistent behavior with
restart_after_crash that gets ignored for bgworkers? We disable it by
default in the TAP tests.
3. Once the subscriber1 postmaster has exited, the TAP
test will eventually time out, and then this happens:[.. logs ..]
That is, because we failed to shut down subscriber1, the
test script neglects to shut down subscriber2, and now
things just sit indefinitely. So that's a robustness
problem in the TAP infrastructure, rather than a bug in
PG proper; but I still say it's a bug that needs fixing.
This one comes down to teardown_node() that uses system_or_bail(),
leaving things unfinished. I guess that we could be more aggressive
and ignore failures if we have a non-zero error code and that not all
the tests have passed within the END block of PostgresNode.pm.
--
Michael
Attachments:
logirep-partition-clobber.patchtext/x-diff; charset=us-asciiDownload+21-3
On Wed, May 19, 2021 at 12:04 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, May 18, 2021 at 07:42:08PM -0400, Tom Lane wrote:
I count three distinct bugs that were exposed by this attempt:
1. In the part of 013_partition.pl that tests firing AFTER
triggers on partitioned tables, we have a case of continuing
to access a relcache entry that's already been closed.
(I'm not quite sure why prion's -DRELCACHE_FORCE_RELEASE
hasn't exposed this.) It looks to me like instead we had
a relcache reference leak before f3b141c48, but now, the
only relcache reference count on a partition child table
is dropped by ExecCleanupTupleRouting, which logical/worker.c
invokes before it fires triggers on that table. Kaboom.
Oops.
This might go away if worker.c weren't so creatively different
from the other code paths concerned with executor shutdown.The tuple routing has made the whole worker logic messier by a larger
degree compared to when this stuff was only able to apply DMLs changes
on the partition leaves. I know that it is not that great to be more
creative here, but we need to make sure that AfterTriggerEndQuery() is
moved before ExecCleanupTupleRouting(). We could either keep the
ExecCleanupTupleRouting() calls as they are now, and call
AfterTriggerEndQuery() in more code paths.
Yeah, that's what I thought to propose doing too. Your patch looks
enough in that regard. Thanks for writing it.
Or we could have one
PartitionTupleRouting and one ModifyTableState created beforehand
in create_estate_for_relation() if applying the change on a
partitioned table but that means manipulating more structures across
more layers of this code.
Yeah, that seems like too much change to me too.
Something like the attached fixes the
problem for me, but honestly it does not help in clarifying this code
more. I was not patient enough to wait for CLOBBER_CACHE_ALWAYS to
initialize the nodes in the TAP tests, so I have tested that with a
setup initialized with a non-clobber build, and reproduced the problem
with CLOBBER_CACHE_ALWAYS builds on those same nodes.
I have checked the fix works with a CLOBBER_CACHE_ALWAYS build.
You are right that this is not a problem of 14~. I can reproduce the
problem on 13 as well, and we have no coverage for tuple routing with
triggers on this branch, so this would never have been stressed in the
buildfarm. There is a good argument to be made here in cherry-picking
2ecfeda3 to REL_13_STABLE.
+1
--
Amit Langote
EDB: http://www.enterprisedb.com
Michael Paquier <michael@paquier.xyz> writes:
On Tue, May 18, 2021 at 07:42:08PM -0400, Tom Lane wrote:
This might go away if worker.c weren't so creatively different
from the other code paths concerned with executor shutdown.
The tuple routing has made the whole worker logic messier by a larger
degree compared to when this stuff was only able to apply DMLs changes
on the partition leaves. I know that it is not that great to be more
creative here, but we need to make sure that AfterTriggerEndQuery() is
moved before ExecCleanupTupleRouting(). We could either keep the
ExecCleanupTupleRouting() calls as they are now, and call
AfterTriggerEndQuery() in more code paths. Or we could have one
PartitionTupleRouting and one ModifyTableState created beforehand
in create_estate_for_relation() if applying the change on a
partitioned table but that means manipulating more structures across
more layers of this code. Something like the attached fixes the
problem for me, but honestly it does not help in clarifying this code
more.
I was wondering if we could move the ExecCleanupTupleRouting call
into finish_estate. copyfrom.c, for example, does that during
its shutdown function. Compare also the worker.c changes proposed
in
/messages/by-id/3362608.1621367104@sss.pgh.pa.us
which are because I discovered it's unsafe to pop the snapshot
before running AfterTriggerEndQuery.
regards, tom lane
On Tue, May 18, 2021 at 11:46:25PM -0400, Tom Lane wrote:
I was wondering if we could move the ExecCleanupTupleRouting call
into finish_estate. copyfrom.c, for example, does that during
its shutdown function. Compare also the worker.c changes proposed
in
Yeah, the first patch I wrote for this thread was pushing out
PopActiveSnapshot() into the finish() routine, but I really found the
creation of the ModifyTableState stuff needed for a partitioned table
done in create_estate_for_relation() to make the code more confusing,
as that's only a piece needed for the tuple routing path. Saying
that, minimizing calls to PopActiveSnapshot() and PushActiveSnapshot()
is an improvement. That's why I settled into more calls to
AfterTriggerEndQuery() in the 4 code paths of any apply (tuple routing
+ 3 DMLs).
/messages/by-id/3362608.1621367104@sss.pgh.pa.us
which are because I discovered it's unsafe to pop the snapshot
before running AfterTriggerEndQuery.
Didn't remember this one. This reminds me of something similar I did
a couple of weeks ago for the worker code, similar to what you have
here. Moving the snapshot push to be earlier, as your other patch is
doing, was bringing a bit more sanity when it came to opening the
indexes of the relation on which a change is applied as we need an
active snapshot for predicates and expressions (aka ExecOpenIndices
and ExecCloseIndices).
--
Michael
On Wed, May 19, 2021 at 9:54 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, May 18, 2021 at 11:46:25PM -0400, Tom Lane wrote:
I was wondering if we could move the ExecCleanupTupleRouting call
into finish_estate. copyfrom.c, for example, does that during
its shutdown function. Compare also the worker.c changes proposed
inYeah, the first patch I wrote for this thread was pushing out
PopActiveSnapshot() into the finish() routine, but I really found the
creation of the ModifyTableState stuff needed for a partitioned table
done in create_estate_for_relation() to make the code more confusing,
as that's only a piece needed for the tuple routing path.
How about moving AfterTriggerEndQuery() to apply_handle_*_internal
calls? That way, we might not even need to change Push/Pop calls.
--
With Regards,
Amit Kapila.
On Wed, May 19, 2021 at 10:26:28AM +0530, Amit Kapila wrote:
How about moving AfterTriggerEndQuery() to apply_handle_*_internal
calls? That way, we might not even need to change Push/Pop calls.
Isn't that going to be a problem when a tuple is moved to a new
partition in the tuple routing? This does a DELETE followed by an
INSERT, but the operation is an UPDATE.
--
Michael
On Wed, May 19, 2021 at 10:35 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, May 19, 2021 at 10:26:28AM +0530, Amit Kapila wrote:
How about moving AfterTriggerEndQuery() to apply_handle_*_internal
calls? That way, we might not even need to change Push/Pop calls.Isn't that going to be a problem when a tuple is moved to a new
partition in the tuple routing?
Right, it won't work.
--
With Regards,
Amit Kapila.
On Wed, May 19, 2021 at 2:05 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, May 19, 2021 at 10:26:28AM +0530, Amit Kapila wrote:
How about moving AfterTriggerEndQuery() to apply_handle_*_internal
calls? That way, we might not even need to change Push/Pop calls.Isn't that going to be a problem when a tuple is moved to a new
partition in the tuple routing? This does a DELETE followed by an
INSERT, but the operation is an UPDATE.
That indeed doesn't work. Once AfterTriggerEndQuery() would get
called for DELETE from apply_handle_delete_internal(), after triggers
of the subsequent INSERT can't be processed, instead causing:
ERROR: AfterTriggerSaveEvent() called outside of query
IOW, the patch you posted earlier seems like the way to go.
--
Amit Langote
EDB: http://www.enterprisedb.com
On 5/19/21 1:42 AM, Tom Lane wrote:
I discovered $SUBJECT after wondering why hyrax hadn't reported
in recently, and trying to run check-world under CCA to see if
anything got stuck. Indeed it did --- although this doesn't
explain the radio silence from hyrax, because that animal doesn't
run any TAP tests. (Neither does avocet, which I think is the
only other active CCA critter. So this could have been broken
for a very long time.)
There are three CCA animals on the same box (avocet, husky and
trilobite) with different compilers, running in a round-robin manner.
One cycle took about 14 days, but about a month ago the machine got
stuck, requiring a hard reboot about a week ago (no idea why it got
stuck). It has more CPU power now (8 cores instead of 2), so it should
take less time to run one test cycle.
avocet already ran all the tests, husky is running HEAD at the moment
and then it'll be trilobite's turn ... AFAICS none of those runs seems
to have failed or got stuck so far.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 5/18/21 11:03 PM, Michael Paquier wrote:
3. Once the subscriber1 postmaster has exited, the TAP
test will eventually time out, and then this happens:[.. logs ..]
That is, because we failed to shut down subscriber1, the
test script neglects to shut down subscriber2, and now
things just sit indefinitely. So that's a robustness
problem in the TAP infrastructure, rather than a bug in
PG proper; but I still say it's a bug that needs fixing.This one comes down to teardown_node() that uses system_or_bail(),
leaving things unfinished. I guess that we could be more aggressive
and ignore failures if we have a non-zero error code and that not all
the tests have passed within the END block of PostgresNode.pm.
Yeah, this area needs substantial improvement. I have seen similar sorts
of nasty hangs, where the script is waiting forever for some process
that hasn't got the shutdown message. At least we probably need some way
of making sure the END handler doesn't abort early. Maybe
PostgresNode::stop() needs a mode that handles failure more gracefully.
Maybe it needs to try shutting down all the nodes and only calling
BAIL_OUT after trying all of them and getting a failure. But that might
still leave us work to do on failures occuring pre-END.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Amit Langote <amitlangote09@gmail.com> writes:
IOW, the patch you posted earlier seems like the way to go.
I really dislike that patch. I think it's doubling down on the messy,
unstructured coding patterns that got us into this situation to begin
with. I'd prefer to expend a little effort on refactoring so that
the ExecCleanupTupleRouting call can be moved to the cleanup function
where it belongs.
So, I propose the attached, which invents a new struct to carry
the stuff we've discovered to be necessary. This makes the APIs
noticeably cleaner IMHO.
I did not touch the APIs of the apply_XXX_internal functions,
as it didn't really seem to offer any notational advantage.
We can't simply collapse them to take an "edata" as I did for
apply_handle_tuple_routing, because the ResultRelInfo they're
supposed to operate on could be different from the original one.
I considered a couple of alternatives:
* Replace their estate arguments with edata, but keep the separate
ResultRelInfo arguments. This might be worth doing in future, if we
add more fields to ApplyExecutionData. Right now it'd save nothing,
and it'd create a risk of confusion about when to use the
ResultRelInfo argument vs. edata->resultRelInfo.
* Allow apply_handle_tuple_routing to overwrite edata->resultRelInfo
with the partition child's RRI, then simplify the apply_XXX_internal
functions to take just edata instead of separate estate and
resultRelInfo args. I think this would work right now, but it seems
grotty, and it might cause problems in future.
* Replace the edata->resultRelInfo field with two fields, one for
the original parent and one for the actual/current target. Perhaps
this is worth doing, not sure.
Thoughts?
regards, tom lane
Attachments:
postpone-tuple-routing-cleanup.patchtext/x-diff; charset=us-ascii; name=postpone-tuple-routing-cleanup.patchDownload+60-39
On Wed, May 19, 2021 at 04:23:55PM -0400, Tom Lane wrote:
I really dislike that patch. I think it's doubling down on the messy,
unstructured coding patterns that got us into this situation to begin
with. I'd prefer to expend a little effort on refactoring so that
the ExecCleanupTupleRouting call can be moved to the cleanup function
where it belongs.
Okay.
I did not touch the APIs of the apply_XXX_internal functions,
as it didn't really seem to offer any notational advantage.
We can't simply collapse them to take an "edata" as I did for
apply_handle_tuple_routing, because the ResultRelInfo they're
supposed to operate on could be different from the original one.
I considered a couple of alternatives:* Replace their estate arguments with edata, but keep the separate
ResultRelInfo arguments. This might be worth doing in future, if we
add more fields to ApplyExecutionData. Right now it'd save nothing,
and it'd create a risk of confusion about when to use the
ResultRelInfo argument vs. edata->resultRelInfo.
Not sure about this one. It may be better to wait until this gets
more expanded, if it gets expanded.
* Allow apply_handle_tuple_routing to overwrite edata->resultRelInfo
with the partition child's RRI, then simplify the apply_XXX_internal
functions to take just edata instead of separate estate and
resultRelInfo args. I think this would work right now, but it seems
grotty, and it might cause problems in future.
Agreed that it does not seem like a good idea to blindly overwrite
resultRelInfo with the partition targetted for the apply.
* Replace the edata->resultRelInfo field with two fields, one for
the original parent and one for the actual/current target. Perhaps
this is worth doing, not sure.
This one sounds more natural to me, though.
Thoughts?
May I ask why you are not moving the snapshot pop and push into the
finish() and create() routines for this patch? Also, any thoughts
about adding the trigger tests from 013_partition.pl to REL_13_STABLE?
--
Michael
On Wed, May 19, 2021 at 02:36:03PM -0400, Andrew Dunstan wrote:
Yeah, this area needs substantial improvement. I have seen similar sorts
of nasty hangs, where the script is waiting forever for some process
that hasn't got the shutdown message. At least we probably need some way
of making sure the END handler doesn't abort early. Maybe
PostgresNode::stop() needs a mode that handles failure more gracefully.
Maybe it needs to try shutting down all the nodes and only calling
BAIL_OUT after trying all of them and getting a failure. But that might
still leave us work to do on failures occuring pre-END.
For that, we could just make the END block called run_log() directly
as well, as this catches stderr and an error code. What about making
the shutdown a two-phase logic by the way? Trigger an immediate stop,
and if it fails fallback to an extra kill9() to be on the safe side.
Have you seen this being a problem even in cases where the tests all
passed? If yes, it may be worth using the more aggressive flow even
in the case where the tests pass.
--
Michael
On Thu, May 20, 2021 at 5:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Langote <amitlangote09@gmail.com> writes:
IOW, the patch you posted earlier seems like the way to go.
I really dislike that patch. I think it's doubling down on the messy,
unstructured coding patterns that got us into this situation to begin
with. I'd prefer to expend a little effort on refactoring so that
the ExecCleanupTupleRouting call can be moved to the cleanup function
where it belongs.So, I propose the attached, which invents a new struct to carry
the stuff we've discovered to be necessary. This makes the APIs
noticeably cleaner IMHO.
Larger footprint, but definitely cleaner. Thanks.
I did not touch the APIs of the apply_XXX_internal functions,
as it didn't really seem to offer any notational advantage.
We can't simply collapse them to take an "edata" as I did for
apply_handle_tuple_routing, because the ResultRelInfo they're
supposed to operate on could be different from the original one.
I considered a couple of alternatives:* Replace their estate arguments with edata, but keep the separate
ResultRelInfo arguments. This might be worth doing in future, if we
add more fields to ApplyExecutionData. Right now it'd save nothing,
and it'd create a risk of confusion about when to use the
ResultRelInfo argument vs. edata->resultRelInfo.* Allow apply_handle_tuple_routing to overwrite edata->resultRelInfo
with the partition child's RRI, then simplify the apply_XXX_internal
functions to take just edata instead of separate estate and
resultRelInfo args. I think this would work right now, but it seems
grotty, and it might cause problems in future.* Replace the edata->resultRelInfo field with two fields, one for
the original parent and one for the actual/current target. Perhaps
this is worth doing, not sure.Thoughts?
IMHO, it would be better to keep the lowest-level
apply_handle_XXX_internal() out of this, because presumably we're only
cleaning up the mess in higher-level callers. Somewhat related, one
of the intentions behind a04daa97a43, which removed
es_result_relation_info in favor of passing the ResultRelInfo
explicitly to the executor's lower-level functions, was to avoid bugs
caused by failing to set/reset that global field correctly in
higher-level callers. Now worker.c is pretty small compared with the
executor, but still it seems like a good idea to follow the same
principle here.
--
Amit Langote
EDB: http://www.enterprisedb.com
Michael Paquier <michael@paquier.xyz> writes:
On Wed, May 19, 2021 at 04:23:55PM -0400, Tom Lane wrote:
* Replace the edata->resultRelInfo field with two fields, one for
the original parent and one for the actual/current target. Perhaps
this is worth doing, not sure.
This one sounds more natural to me, though.
OK, I'll give that a try tomorrow.
May I ask why you are not moving the snapshot pop and push into the
finish() and create() routines for this patch?
That does need to happen, but I figured I'd leave it to the other
patch, since there are other things to change too for that snapshot
problem. Obviously, whichever patch goes in second will need trivial
adjustments, but I think it's logically clearer that way.
Also, any thoughts
about adding the trigger tests from 013_partition.pl to REL_13_STABLE?
Yeah, if this is a pre-existing problem then we should back-port the
tests that revealed it.
regards, tom lane
I wrote:
Michael Paquier <michael@paquier.xyz> writes:
On Wed, May 19, 2021 at 04:23:55PM -0400, Tom Lane wrote:
* Replace the edata->resultRelInfo field with two fields, one for
the original parent and one for the actual/current target. Perhaps
this is worth doing, not sure.
This one sounds more natural to me, though.
OK, I'll give that a try tomorrow.
Here's a version that does it like that. I'm not entirely convinced
whether this is better or not.
regards, tom lane
Attachments:
postpone-tuple-routing-cleanup-2.patchtext/x-diff; charset=us-ascii; name=postpone-tuple-routing-cleanup-2.patchDownload+93-55
On Thu, May 20, 2021 at 02:57:40PM -0400, Tom Lane wrote:
Here's a version that does it like that. I'm not entirely convinced
whether this is better or not.
Hmm. I think that this is better. This makes the code easier to
follow, and the extra information is useful for debugging.
The change looks good to me.
--
Michael
On Fri, May 21, 2021 at 6:29 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, May 20, 2021 at 02:57:40PM -0400, Tom Lane wrote:
Here's a version that does it like that. I'm not entirely convinced
whether this is better or not.Hmm. I think that this is better. This makes the code easier to
follow, and the extra information is useful for debugging.The change looks good to me.
Yeah, the change looks good to me as well but I think we should
consider Amit L's point that maintaining this extra activeRelInfo
might be prone to bugs if the partitioning logic needs to be extended
at other places in the worker.c. As the code stands today, it doesn't
seem problematic so we can go with the second patch if both Tom and
you feel that is a better option.
--
With Regards,
Amit Kapila.
Amit Langote <amitlangote09@gmail.com> writes:
IMHO, it would be better to keep the lowest-level
apply_handle_XXX_internal() out of this, because presumably we're only
cleaning up the mess in higher-level callers. Somewhat related, one
of the intentions behind a04daa97a43, which removed
es_result_relation_info in favor of passing the ResultRelInfo
explicitly to the executor's lower-level functions, was to avoid bugs
caused by failing to set/reset that global field correctly in
higher-level callers.
Yeah, that's a fair point, and after some reflection I think that
repeatedly changing the "active" field of the struct is exactly
what was bothering me about the v2 patch. So in the attached v3,
I went back to passing that as an explicit argument. The state
struct now has no fields that need to change after first being set.
I did notice that we could remove some other random arguments
by adding the LogicalRepRelMapEntry* to the state struct,
so this also does that.
regards, tom lane