Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

Started by Bharath Rupireddyalmost 6 years ago29 messageshackers
Jump to latest
#1Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com

Hi,

I found an issue while executing a backup use case(please see [1]CREATE TABLE t1 (id SERIAL); INSERT INTO t1 (id) SELECT * FROM generate_series(1, 20000000); SELECT * FROM pg_start_backup('label', false, false); /*JIT is enabled in my session and it is being picked by below query*/ EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT COUNT(*) FROM t1; SELECT * FROM pg_stop_backup(false, true); for
queries) on postgres version 12.

Firstly, pg_start_backup registers nonexclusive_base_backup_cleanup as
on_shmem_exit call back which will
add this function to the before_shmem_exit_list array which is
supposed to be removed on pg_stop_backup
so that we can do the pending cleanup and issue a warning for each
pg_start_backup for which we did not call
the pg_stop backup. Now, I executed a query for which JIT is picked
up, then the the llvm compiler inserts it's
own exit callback i.e. llvm_shutdown at the end of
before_shmem_exit_list array. Now, I executed pg_stop_backup
and call to cancel_before_shmem_exit() is made with the expectation
that the nonexclusive_base_backup_cleanup
callback is removed from before_shmem_exit_list array.

Since the cancel_before_shmem_exit() only checks the last entry in the
before_shmem_exit_list array, which is
llvm compiler's exit callback, so we exit the
cancel_before_shmem_exit() without removing the intended
nonexclusive_base_backup_cleanup callback which remains still the
before_shmem_exit_list and gets executed
during the session exit throwing a warning "aborting backup due to
backend exiting before pg_stop_backup was called",
which is unintended.

Attached is the patch that fixes the above problem by making
cancel_before_shmem_exit() to look for the
given function(and for given args) in the entire
before_shmem_exit_list array, not just the last entry, starting
from the last entry.

Request the community take this patch for review for v12.

Having said that, abovementioned problem for backup use case does not
occur for v13 and latest versions of
postgres (please below description[2]for v13 and latest versions, start_backup first registers do_pg_abort_backup, and then pg_start_backup_callback, performs startup backup operations and unregisters only pg_start_backup_callback from before_shmem_exit_list, retaining do_pg_abort_backup still in the list, which is to be called on session's exit JIT compiler inserts it's own exit call back at the end of before_shmem_exit_list array. stop_backup registers pg_stop_backup_callback, performs stop operations, unregisters pg_stop_backup_callback from before_shmem_exit_list, and sets the sessionBackupState = SESSION_BACKUP_NONE, note that the callback do_pg_abort_backup registered by start_backup command still exists in the before_shmem_exit_list and will not be removed by stop_backup. On session exit, do_pg_abort_backup gets called but returns without performing any operations(not even throwing a warning), by checking sessionBackupState which was set to SESSION_BACKUP_NONE by stop_backup.), but these kinds of issues can
come, if the cancel_before_shmem_exit()
is left to just look at the last array entry while removing a
registered callback.

There's also a comment in cancel_before_shmem_exit() function
description "For simplicity, only the latest entry
can be removed. (We could work harder but there is no need for current uses.)

Since we start to find use cases now, there is a need to enhance
cancel_before_shmem_exit(), so I also propose
to have the same attached patch for v13 and latest versions.

Thoughts?

[1]: CREATE TABLE t1 (id SERIAL); INSERT INTO t1 (id) SELECT * FROM generate_series(1, 20000000); SELECT * FROM pg_start_backup('label', false, false); /*JIT is enabled in my session and it is being picked by below query*/ EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT COUNT(*) FROM t1; SELECT * FROM pg_stop_backup(false, true);
CREATE TABLE t1 (id SERIAL);
INSERT INTO t1 (id) SELECT * FROM generate_series(1, 20000000);
SELECT * FROM pg_start_backup('label', false, false);
/*JIT is enabled in my session and it is being picked by below query*/
EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT COUNT(*) FROM t1;
SELECT * FROM pg_stop_backup(false, true);

[2]: for v13 and latest versions, start_backup first registers do_pg_abort_backup, and then pg_start_backup_callback, performs startup backup operations and unregisters only pg_start_backup_callback from before_shmem_exit_list, retaining do_pg_abort_backup still in the list, which is to be called on session's exit JIT compiler inserts it's own exit call back at the end of before_shmem_exit_list array. stop_backup registers pg_stop_backup_callback, performs stop operations, unregisters pg_stop_backup_callback from before_shmem_exit_list, and sets the sessionBackupState = SESSION_BACKUP_NONE, note that the callback do_pg_abort_backup registered by start_backup command still exists in the before_shmem_exit_list and will not be removed by stop_backup. On session exit, do_pg_abort_backup gets called but returns without performing any operations(not even throwing a warning), by checking sessionBackupState which was set to SESSION_BACKUP_NONE by stop_backup.
for v13 and latest versions, start_backup first registers do_pg_abort_backup,
and then pg_start_backup_callback, performs startup backup operations
and unregisters only pg_start_backup_callback from before_shmem_exit_list,
retaining do_pg_abort_backup still in the list, which is to be called
on session's exit
JIT compiler inserts it's own exit call back at the end of
before_shmem_exit_list array.
stop_backup registers pg_stop_backup_callback, performs stop operations,
unregisters pg_stop_backup_callback from before_shmem_exit_list, and sets
the sessionBackupState = SESSION_BACKUP_NONE, note that the callback
do_pg_abort_backup registered by start_backup command still exists in the
before_shmem_exit_list and will not be removed by stop_backup. On session exit,
do_pg_abort_backup gets called but returns without performing any operations(not
even throwing a warning), by checking sessionBackupState which was set to
SESSION_BACKUP_NONE by stop_backup.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Modify-cancel_before_shmem_exit-to-search-all-reg.patchapplication/octet-stream; name=v1-0001-Modify-cancel_before_shmem_exit-to-search-all-reg.patchDownload+53-9
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bharath Rupireddy (#1)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

Firstly, pg_start_backup registers nonexclusive_base_backup_cleanup as
on_shmem_exit call back which will
add this function to the before_shmem_exit_list array which is
supposed to be removed on pg_stop_backup
so that we can do the pending cleanup and issue a warning for each
pg_start_backup for which we did not call
the pg_stop backup. Now, I executed a query for which JIT is picked
up, then the the llvm compiler inserts it's
own exit callback i.e. llvm_shutdown at the end of
before_shmem_exit_list array. Now, I executed pg_stop_backup
and call to cancel_before_shmem_exit() is made with the expectation
that the nonexclusive_base_backup_cleanup
callback is removed from before_shmem_exit_list array.

I'm of the opinion that the JIT code is abusing this mechanism, and the
right thing to do is fix that. The restriction you propose to remove is
not just there out of laziness, it's an expectation about what safe use of
this mechanism would involve. Un-ordered removal of callbacks seems
pretty risky: it would mean that whatever cleanup is needed is not going
to be done in LIFO order.

regards, tom lane

#3Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

Hi,

On 2020-07-07 09:44:41 -0400, Tom Lane wrote:

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

Firstly, pg_start_backup registers nonexclusive_base_backup_cleanup as
on_shmem_exit call back which will
add this function to the before_shmem_exit_list array which is
supposed to be removed on pg_stop_backup
so that we can do the pending cleanup and issue a warning for each
pg_start_backup for which we did not call
the pg_stop backup. Now, I executed a query for which JIT is picked
up, then the the llvm compiler inserts it's
own exit callback i.e. llvm_shutdown at the end of
before_shmem_exit_list array. Now, I executed pg_stop_backup
and call to cancel_before_shmem_exit() is made with the expectation
that the nonexclusive_base_backup_cleanup
callback is removed from before_shmem_exit_list array.

I'm of the opinion that the JIT code is abusing this mechanism, and the
right thing to do is fix that.

What are you proposing? For now we could easily enough work around this
by just making it a on_proc_exit() callback, but that doesn't really
change the fundamental issue imo.

The restriction you propose to remove is not just there out of
laziness, it's an expectation about what safe use of this mechanism
would involve. Un-ordered removal of callbacks seems pretty risky: it
would mean that whatever cleanup is needed is not going to be done in
LIFO order.

Maybe I am confused, but isn't it <13's pg_start_backup() that's
violating the protocol much more clearly than the JIT code? Given that
it relies on there not being any callbacks registered between two SQL
function calls? I mean, what it does is basically:

1) before_shmem_exit(nonexclusive_base_backup_cleanup...
2) arbitrary code executed for a long time
3) cancel_before_shmem_exit(nonexclusive_base_backup_cleanup...

which pretty obviously can't at all deal with any other
before_shmem_exit callbacks being registered in 2). Won't this be a
problem for every other before_shmem_exit callback that we register
on-demand? Say Async_UnlistenOnExit, RemoveTempRelationsCallback?

Greetings,

Andres Freund

#4Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#3)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Tue, Jul 7, 2020 at 10:24 PM Andres Freund <andres@anarazel.de> wrote:

On 2020-07-07 09:44:41 -0400, Tom Lane wrote:

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

Firstly, pg_start_backup registers nonexclusive_base_backup_cleanup as
on_shmem_exit call back which will
add this function to the before_shmem_exit_list array which is
supposed to be removed on pg_stop_backup
so that we can do the pending cleanup and issue a warning for each
pg_start_backup for which we did not call
the pg_stop backup. Now, I executed a query for which JIT is picked
up, then the the llvm compiler inserts it's
own exit callback i.e. llvm_shutdown at the end of
before_shmem_exit_list array. Now, I executed pg_stop_backup
and call to cancel_before_shmem_exit() is made with the expectation
that the nonexclusive_base_backup_cleanup
callback is removed from before_shmem_exit_list array.

I'm of the opinion that the JIT code is abusing this mechanism, and the
right thing to do is fix that.

What are you proposing? For now we could easily enough work around this
by just making it a on_proc_exit() callback, but that doesn't really
change the fundamental issue imo.

The restriction you propose to remove is not just there out of
laziness, it's an expectation about what safe use of this mechanism
would involve. Un-ordered removal of callbacks seems pretty risky: it
would mean that whatever cleanup is needed is not going to be done in
LIFO order.

I quickly searched(in HEAD) what all the callbacks are getting
registered to before_shmem_exit_list, with the intention to see if
they also call corresponding cancel_before_shmem_exit() after their
intended usage is done.

For few of the callbacks there is no cancel_before_shmem_exit(). This
seems expected; those callbacks ought to be executed before shmem
exit. These callbacks are(let say SET 1): ShutdownPostgres,
logicalrep_worker_onexit, llvm_shutdown, Async_UnlistenOnExit,
RemoveTempRelationsCallback, ShutdownAuxiliaryProcess,
do_pg_abort_backup in xlog.c (this callback exist only in v13 or
later), AtProcExit_Twophase.

Which means, once they are into the before_shmem_exit_list array, in
some order, they are never going to be removed from it as they don't
have corresponding cancel_before_shmem_exit() and the relative order
of execution remains the same.

And there are other callbacks that are getting registered to
before_shmem_exit_list array(let say SET 2): apw_detach_shmem,
_bt_end_vacuum_callback, pg_start_backup_callback,
pg_stop_backup_callback, createdb_failure_callback,
movedb_failure_callback, do_pg_abort_backup(in basebackup.c). They all
have corresponding cancel_before_shmem_exit() to unregister/remove the
callbacks from before_shmem_exit_list array.

I think the callbacks that have no cancel_before_shmem_exit()(SET 1)
may have to be executed in the LIFO order: it makes sense to execute
ShutdownPostgres at the end after let's say other callbacks in SET 1.

And the SET 2 callbacks have cancel_before_shmem_exit() with the only
intention that there's no need to call the callbacks on the
before_shmem_exit(), since they are not needed, and try to remove from
the before_shmem_exit_list array and may fail, if any other callback
gets registered in between.

If I'm not wrong with the above points, we must enhance
cancel_before_shmem_exit() or have cancel_before_shmem_exit_v2() (as
mentioned in my below response).

Maybe I am confused, but isn't it <13's pg_start_backup() that's
violating the protocol much more clearly than the JIT code? Given that
it relies on there not being any callbacks registered between two SQL
function calls? I mean, what it does is basically:

1) before_shmem_exit(nonexclusive_base_backup_cleanup...
2) arbitrary code executed for a long time
3) cancel_before_shmem_exit(nonexclusive_base_backup_cleanup...

which pretty obviously can't at all deal with any other
before_shmem_exit callbacks being registered in 2). Won't this be a
problem for every other before_shmem_exit callback that we register
on-demand? Say Async_UnlistenOnExit, RemoveTempRelationsCallback?

Yes, for versions <13's, clearly pg_start_backup causes the problem
and the issue can also be reproduced with Async_UnlistenOnExit,
RemoveTempRelationsCallback coming in between pg_start_backup and
pg_stop_backup.

We can have it fixed in a few ways: 1) enhance
cancel_before_shmem_exit() as attached in the original patch. 2) have
existing cancel_before_shmem_exit(), whenever called for
nonexclusive_base_backup_cleanup(), we can look for the entire array
instead of just the last entry. 3) have a separate function, say,
cancel_before_shmem_exit_v2(), that searches for the entire
before_shmem_exit_list array(the logic proposed in this patch) so that
it will not disturb the existing cancel_before_shmem_exit(). 4) or try
to have the pg_start_backup code that exists in after > 13 versions.

If okay to have cancel_before_shmem_exit_v2() for versions < 13's,
with the searching for the entire array instead of just the last
element to fix the abort issue, maybe we can have this function in
version 13 and latest as well(at least with disable mode, something
like #if 0 ... #endif), so that in case if any of similar issues arise
we could just quickly reuse.

If the before_shmem_exit_list array is to be used in LIFO order, do we
have some comment/usage guideline mentioned in the ipc.c/.h? I didn't
find one.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#5Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#3)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Tue, Jul 7, 2020 at 12:55 PM Andres Freund <andres@anarazel.de> wrote:

What are you proposing? For now we could easily enough work around this
by just making it a on_proc_exit() callback, but that doesn't really
change the fundamental issue imo.

I think it would be more correct for it to be an on_proc_exit()
callback, because before_shmem_exit() callbacks can and do perform
actions which rely on an awful lot of the system being still in a
working state. RemoveTempRelationsCallback() is a good example: it
thinks it can start and end transactions and make a bunch of catalog
changes. I don't know that any of that could use JIT, but moving the
JIT cleanup to the on_shmem_exit() stage seems better. At that point,
there shouldn't be anybody doing anything that relies on being able to
perform logical changes to the database; we're just shutting down
low-level subsystems at that point, and thus presumably not doing
anything that could possibly need JIT.

But I also agree that what pg_start_backup() was doing before v13 was
wrong; that's why I committed
303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't
back-patch it is because the consequences are so minor I didn't think
it was worth worrying about. We could, though. I'd be somewhat
inclined to both do that and also change LLVM to use on_proc_exit() in
master, but I don't feel super-strongly about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Robert Haas (#5)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Tue, Jul 21, 2020 at 1:17 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 7, 2020 at 12:55 PM Andres Freund <andres@anarazel.de> wrote:

What are you proposing? For now we could easily enough work around this
by just making it a on_proc_exit() callback, but that doesn't really
change the fundamental issue imo.

I think it would be more correct for it to be an on_proc_exit()
callback, because before_shmem_exit() callbacks can and do perform
actions which rely on an awful lot of the system being still in a
working state. RemoveTempRelationsCallback() is a good example: it
thinks it can start and end transactions and make a bunch of catalog
changes. I don't know that any of that could use JIT, but moving the
JIT cleanup to the on_shmem_exit() stage seems better. At that point,
there shouldn't be anybody doing anything that relies on being able to
perform logical changes to the database; we're just shutting down
low-level subsystems at that point, and thus presumably not doing
anything that could possibly need JIT.

I looked at what actually llvm_shutdown() does? It frees up JIT stacks,
also if exists perf related resource, using LLVMOrcDisposeInstance() and
LLVMOrcUnregisterPerf(), that were dynamically allocated in
llvm_session_initialize through a JIT library function
LLVMOrcCreateInstance() [1]https://llvm.org/doxygen/OrcCBindings_8cpp_source.html.

It looks like there is no problem in moving llvm_shutdown to either
on_shmem_exit() or on_proc_exit().

[1]: https://llvm.org/doxygen/OrcCBindings_8cpp_source.html

But I also agree that what pg_start_backup() was doing before v13 was
wrong; that's why I committed
303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't
back-patch it is because the consequences are so minor I didn't think
it was worth worrying about. We could, though. I'd be somewhat
inclined to both do that and also change LLVM to use on_proc_exit() in
master, but I don't feel super-strongly about it.

Patch: v1-0001-Move-llvm_shutdown-to-on_proc_exit-list-from-befo.patch
Moved llvm_shutdown to on_proc_exit() call back list. Request to consider
this change for master, if possible <=13 versions. Basic JIT use cases and
regression tests are working fine with the patch.

Patches: PG11-0001-Fix-minor-problems-with-non-exclusive-backup-clea.patch
and PG12-0001-Fix-minor-problems-with-non-exclusive-backup-cleanup.patch
Request to consider the commit
303640199d0436c5e7acdf50b837a027b5726594(above two patches are for this
commit) to versions < 13, to fix the abort issue. Please note that the
above two patches have no difference in the code, just I made it applicable
on PG11.

Patch: v1-0001-Modify-cancel_before_shmem_exit-comments.patch
This patch, modifies cancel_before_shmem_exit() function comment to reflect
the safe usage of before_shmem_exit_list callback mechanism and also
removes the point "For simplicity, only the latest entry can be
removed*********" as this gives a meaning that there is still scope for
improvement in cancel_before_shmem_exit() search mechanism.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Move-llvm_shutdown-to-on_proc_exit-list-from-befo.patchapplication/x-patch; name=v1-0001-Move-llvm_shutdown-to-on_proc_exit-list-from-befo.patchDownload+1-2
PG11-0001-Fix-minor-problems-with-non-exclusive-backup-clea.patchapplication/x-patch; name=PG11-0001-Fix-minor-problems-with-non-exclusive-backup-clea.patchDownload+35-34
PG12-0001-Fix-minor-problems-with-non-exclusive-backup-cleanup.patchapplication/x-patch; name=PG12-0001-Fix-minor-problems-with-non-exclusive-backup-cleanup.patchDownload+35-34
v1-0001-Modify-cancel_before_shmem_exit-comments.patchapplication/x-patch; name=v1-0001-Modify-cancel_before_shmem_exit-comments.patchDownload+3-4
#7Robert Haas
robertmhaas@gmail.com
In reply to: Bharath Rupireddy (#6)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Fri, Jul 24, 2020 at 7:10 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I looked at what actually llvm_shutdown() does? It frees up JIT stacks, also if exists perf related resource, using LLVMOrcDisposeInstance() and LLVMOrcUnregisterPerf(), that were dynamically allocated in llvm_session_initialize through a JIT library function LLVMOrcCreateInstance() [1].

It looks like there is no problem in moving llvm_shutdown to either on_shmem_exit() or on_proc_exit().

If it doesn't involve shared memory, I guess it can be on_proc_exit()
rather than on_shmem_exit().

I guess the other question is why we're doing it at all. What
resources are being allocated that wouldn't be freed up by process
exit anyway?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Robert Haas (#7)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Fri, Jul 24, 2020 at 8:07 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 24, 2020 at 7:10 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I looked at what actually llvm_shutdown() does? It frees up JIT stacks,

also if exists perf related resource, using LLVMOrcDisposeInstance() and
LLVMOrcUnregisterPerf(), that were dynamically allocated in
llvm_session_initialize through a JIT library function
LLVMOrcCreateInstance() [1]https://llvm.org/doxygen/OrcCBindings_8cpp_source.html.

It looks like there is no problem in moving llvm_shutdown to either

on_shmem_exit() or on_proc_exit().

If it doesn't involve shared memory, I guess it can be on_proc_exit()
rather than on_shmem_exit().

I guess the other question is why we're doing it at all. What
resources are being allocated that wouldn't be freed up by process
exit anyway?

LLVMOrcCreateInstance() and LLVMOrcDisposeInstance() are doing new and
delete respectively, I just found these functions from the link [1]https://llvm.org/doxygen/OrcCBindings_8cpp_source.html. But I
don't exactly know whether there are any other resources being allocated
that can't be freed up by proc_exit(). Tagging @Andres Freund for inputs
on whether we have any problem making llvm_shutdown() a on_proc_exit()
callback instead of before_shmem_exit() callback.

And as suggested in the previous mails, we wanted to make it on_proc_exit()
to avoid the abort issue reported in this mail chain, however if we take
the abort issue fix commit # 303640199d0436c5e7acdf50b837a027b5726594 as
mentioned in the previous response[2]/messages/by-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg@mail.gmail.com, then it may not be necessary, right
now, but just to be safer and to avoid any of these similar kind of issues
in future, we can consider this change as well.

[1]: https://llvm.org/doxygen/OrcCBindings_8cpp_source.html

LLVMOrcJITStackRef LLVMOrcCreateInstance(LLVMTargetMachineRef TM) {
TargetMachine *TM2(unwrap(TM));
Triple T(TM2->getTargetTriple());
auto IndirectStubsMgrBuilder =
orc::createLocalIndirectStubsManagerBuilder(T);
OrcCBindingsStack *JITStack =
new OrcCBindingsStack(*TM2, std::move(IndirectStubsMgrBuilder));
return wrap(JITStack);
}

LLVMErrorRef LLVMOrcDisposeInstance(LLVMOrcJITStackRef JITStack) {
auto *J = unwrap(JITStack);
auto Err = J->shutdown();
delete J;
return wrap(std::move(Err));
}

[2]: /messages/by-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg@mail.gmail.com
/messages/by-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg@mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#9Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#5)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Mon, Jul 20, 2020 at 3:47 PM Robert Haas <robertmhaas@gmail.com> wrote:

But I also agree that what pg_start_backup() was doing before v13 was
wrong; that's why I committed
303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't
back-patch it is because the consequences are so minor I didn't think
it was worth worrying about. We could, though. I'd be somewhat
inclined to both do that and also change LLVM to use on_proc_exit() in
master, but I don't feel super-strongly about it.

Unless somebody complains pretty soon, I'm going to go ahead and do
what is described above.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#9)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Thu, Jul 30, 2020 at 8:11 AM Robert Haas <robertmhaas@gmail.com> wrote:

Unless somebody complains pretty soon, I'm going to go ahead and do
what is described above.

Done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Robert Haas (#10)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Thu, Aug 6, 2020 at 11:51 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jul 30, 2020 at 8:11 AM Robert Haas <robertmhaas@gmail.com> wrote:

Unless somebody complains pretty soon, I'm going to go ahead and do
what is described above.

Done.

Thanks!

I have one more request to make: since we are of the opinion to not
change the way cancel_before_shmem_exit() searches
before_shmem_exit_list array, wouldn't it be good to adjust comments
before the function cancel_before_shmem_exit()?

I sent the patch previously[1]/messages/by-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg@mail.gmail.com, but attaching here again, modifies
cancel_before_shmem_exit() function comment to reflect the safe usage
of before_shmem_exit_list callback mechanism and also removes the
point "For simplicity, only the latest entry can be removed*********"
as this gives a meaning that there is still scope for improvement in
cancel_before_shmem_exit() search mechanism.

Thoughts?

[1]: /messages/by-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg@mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Modify-cancel_before_shmem_exit-comments.patchapplication/x-patch; name=v1-0001-Modify-cancel_before_shmem_exit-comments.patchDownload+3-4
#12Robert Haas
robertmhaas@gmail.com
In reply to: Bharath Rupireddy (#11)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Thu, Aug 6, 2020 at 11:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I sent the patch previously[1], but attaching here again, modifies
cancel_before_shmem_exit() function comment to reflect the safe usage
of before_shmem_exit_list callback mechanism and also removes the
point "For simplicity, only the latest entry can be removed*********"
as this gives a meaning that there is still scope for improvement in
cancel_before_shmem_exit() search mechanism.

Thoughts?

I think that the first part of the comment change you suggest is a
good idea and would avoid developer confusion, but I think that the
statement about unordered removal of comments being risky doesn't add
much. It's too vague to help anybody and I don't think I believe it,
either. So I suggest something more like:

- * callback.  For simplicity, only the latest entry can be
- * removed.  (We could work harder but there is no need for
- * current uses.)
+ * callback.  We only look at the latest entry for removal, as we
+ * expect the caller to use before_shmem_exit callback mechanism
+ * in the LIFO order.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

Robert Haas <robertmhaas@gmail.com> writes:

... So I suggest something more like:

- * callback.  For simplicity, only the latest entry can be
- * removed.  (We could work harder but there is no need for
- * current uses.)
+ * callback.  We only look at the latest entry for removal, as we
+ * expect the caller to use before_shmem_exit callback mechanism
+ * in the LIFO order.

That's a meaningless statement for any one caller. So it needs to be more
like "we expect callers to add and remove temporary before_shmem_exit
callbacks in strict LIFO order".

I wonder whether we ought to change the function to complain if the
last list entry doesn't match. We'd have caught this bug sooner
if it did, and it's not very clear why silently doing nothing is
a good idea when there's no match.

regards, tom lane

#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Fri, Aug 7, 2020 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That's a meaningless statement for any one caller. So it needs to be more
like "we expect callers to add and remove temporary before_shmem_exit
callbacks in strict LIFO order".

Sure, that seems fine.

I wonder whether we ought to change the function to complain if the
last list entry doesn't match. We'd have caught this bug sooner
if it did, and it's not very clear why silently doing nothing is
a good idea when there's no match.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#12)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

Hi,

On 2020-08-07 12:29:03 -0400, Robert Haas wrote:

On Thu, Aug 6, 2020 at 11:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I sent the patch previously[1], but attaching here again, modifies
cancel_before_shmem_exit() function comment to reflect the safe usage
of before_shmem_exit_list callback mechanism and also removes the
point "For simplicity, only the latest entry can be removed*********"
as this gives a meaning that there is still scope for improvement in
cancel_before_shmem_exit() search mechanism.

Thoughts?

I think that the first part of the comment change you suggest is a
good idea and would avoid developer confusion, but I think that the
statement about unordered removal of comments being risky doesn't add
much. It's too vague to help anybody and I don't think I believe it,
either. So I suggest something more like:

- * callback.  For simplicity, only the latest entry can be
- * removed.  (We could work harder but there is no need for
- * current uses.)
+ * callback.  We only look at the latest entry for removal, as we
+ * expect the caller to use before_shmem_exit callback mechanism
+ * in the LIFO order.

In which situations is the removal actually useful *and* safe, with
these constraints? You'd have to have a very narrow set of functions
that are called while the exit hook is present, i.e. basically this
would only be usable for PG_ENSURE_ERROR_CLEANUP and nothing else. And
even there it seems like it's pretty easy to get into a situation where
it's not safe.

Greetings,

Andres Freund

#16Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Robert Haas (#14)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Fri, Aug 7, 2020 at 11:09 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Aug 7, 2020 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That's a meaningless statement for any one caller. So it needs to be more
like "we expect callers to add and remove temporary before_shmem_exit
callbacks in strict LIFO order".

Sure, that seems fine.

v2 patch has the comments modified.

I wonder whether we ought to change the function to complain if the
last list entry doesn't match. We'd have caught this bug sooner
if it did, and it's not very clear why silently doing nothing is
a good idea when there's no match.

+1.

This is a good idea. v3 patch has both the modified comments(from v2)
as well as a DEBUG3 (DEBUG3 level, because the other
non-error/non-fatal logs in ipc.c are using the same level) log to
report when the latest entry for removal is not matched with the one
the caller cancel_before_shmem_exit() is looking for and a hint on how
to safely use temporary before_shmem_exit() callbacks. In v3 patch,
function pointer is being printed, I'm not sure how much it is helpful
to have function pointers in the logs though there are some other
places printing pointers into the logs, I wish I could print function
names. (Is there a way we could get function names from function
pointers?).

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Modify-cancel_before_shmem_exit-comments.patchapplication/x-patch; name=v2-0001-Modify-cancel_before_shmem_exit-comments.patchDownload+3-4
v3-0001-Modify-cancel_before_shmem_exit-comments.patchapplication/x-patch; name=v3-0001-Modify-cancel_before_shmem_exit-comments.patchDownload+14-4
#17Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#15)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Fri, Aug 7, 2020 at 5:20 PM Andres Freund <andres@anarazel.de> wrote:

In which situations is the removal actually useful *and* safe, with
these constraints? You'd have to have a very narrow set of functions
that are called while the exit hook is present, i.e. basically this
would only be usable for PG_ENSURE_ERROR_CLEANUP and nothing else. And
even there it seems like it's pretty easy to get into a situation where
it's not safe.

Well, I don't really care whether or not we change this function to
iterate over the callback list or whether we add a warning that you
need to use it in LIFO order, but I think we should do one or the
other, because this same confusion has come up multiple times. I
thought that Tom was opposed to making it iterate over the callback
list (for reasons I don't really understand, honestly) so adding a
comment and a cross-check seemed like the practical option. Now I also
think it's fine to iterate over the callback list: this function
doesn't get used so much that it's likely to be a performance problem,
and I don't think this is the first bug that would have become a
non-bug had we done that years and years ago whenever it was first
proposed. In fact, I'd go so far as to say that the latter is a
slightly better option. However, doing nothing is clearly worst.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#17)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

Robert Haas <robertmhaas@gmail.com> writes:

Well, I don't really care whether or not we change this function to
iterate over the callback list or whether we add a warning that you
need to use it in LIFO order, but I think we should do one or the
other, because this same confusion has come up multiple times. I
thought that Tom was opposed to making it iterate over the callback
list (for reasons I don't really understand, honestly) so adding a
comment and a cross-check seemed like the practical option. Now I also
think it's fine to iterate over the callback list: this function
doesn't get used so much that it's likely to be a performance problem,
and I don't think this is the first bug that would have become a
non-bug had we done that years and years ago whenever it was first
proposed. In fact, I'd go so far as to say that the latter is a
slightly better option. However, doing nothing is clearly worst.

I agree that doing nothing seems like a bad idea. My concern about
allowing non-LIFO callback removal is that it seems to me that such
usage patterns have likely got *other* bugs, so we should discourage
that. These callbacks don't exist in a vacuum: they reflect that
the mainline code path has set up, or torn down, important state.
Non-LIFO usage requires very strong assumptions that the states in
question are not interdependent, and that's something I'd rather not
rely on if we don't have to.

regards, tom lane

#19Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#18)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Mon, Aug 10, 2020 at 10:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I agree that doing nothing seems like a bad idea. My concern about
allowing non-LIFO callback removal is that it seems to me that such
usage patterns have likely got *other* bugs, so we should discourage
that. These callbacks don't exist in a vacuum: they reflect that
the mainline code path has set up, or torn down, important state.
Non-LIFO usage requires very strong assumptions that the states in
question are not interdependent, and that's something I'd rather not
rely on if we don't have to.

I have mixed feelings about this. On the one hand, I think saying that
it requires "very strong assumptions" about interdependence makes it
sound scarier than it is -- not that your statement is wrong, but that
in most cases we kinda know whether that's true or not, and the idea
that you shouldn't remove a callback upon which some later callback
might be depending is not too difficult for anybody to understand. On
the other hand, I think that in most cases we ought to be discouraging
people who are trying to do per-subsystem cleanup from using
before_shmem_exit() at all. I think such callbacks ought to be done
using on_shmem_exit() if they involve shared memory or on_proc_exit()
if they do not. Those functions don't have cancel_blah_exit()
variants, and I don't think they should: the right way to code those
things is not to remove the callbacks from the stack when they're no
longer needed, but rather to code the callbacks so that they will do
nothing if no work is required, leaving them permanently registered.

And the main reason why I think that such callbacks should be
registered using on_shmem_exit() or on_proc_exit() rather than
before_shmem_exit() is because of (1) the interdependency issue you
raise and (2) the fact that cancel_before_shmem_exit doesn't do what
people tend to think it does. For an example of (1), look at
ShutdownPostgres() and RemoveTempRelationsCallback(). The latter
aborts out of any transaction and then starts a new one to drop your
temp schema. But that might fail, so ShutdownPostgres() also needs to
be prepared to AbortOutOfAnyTransaction(). If you inserted more
cleanup steps that were thematically similar to those, each one of
them would also need to begin with AbortOutOfAnyTransaction(). That
suggests that this whole thing is a bit under-engineered. Some of the
other before_shmem_exit() callbacks don't start with that incantation,
but that's because they are per-subsystem callbacks that likely ought
to be using on_shmem_exit() rather than actually being the same sort
of thing.

Perhaps we really have four categories here:
(1) Temporary handlers for PG_ENSURE_ERROR_CLEANUP().
(2) High-level cleanup that needs to run after aborting out of the
current transaction.
(3) Per-subsystem shutdown for shared memory stuff.
(4) Per-subsystem shutdown for backend-private stuff.

Right now we blend (1), (2), and some of (3) together, but we could
try to create a cleaner line. We could redefine before_shmem_exit() as
being exactly #2, and abort out of any transaction before calling each
step, and document that you shouldn't use it unless you need that
behavior. And we could have a separate stack for #1 that is explicitly
LIFO and not intended for any other use. But then again maybe that's
overkill. What I do think we should do, after thinking about it more,
is discourage the casual use of before_shmem_exit() for things where
on_shmem_exit() or on_proc_exit() would be just as good. I think
that's what would avoid the most problems here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#19)
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

Robert Haas <robertmhaas@gmail.com> writes:

Perhaps we really have four categories here:
(1) Temporary handlers for PG_ENSURE_ERROR_CLEANUP().
(2) High-level cleanup that needs to run after aborting out of the
current transaction.
(3) Per-subsystem shutdown for shared memory stuff.
(4) Per-subsystem shutdown for backend-private stuff.

Hmm, I don't think we actually have any of (2) do we? Or at least
we aren't using ipc.c callbacks for them.

What I do think we should do, after thinking about it more,
is discourage the casual use of before_shmem_exit() for things where
on_shmem_exit() or on_proc_exit() would be just as good. I think
that's what would avoid the most problems here.

I think we're mostly in violent agreement here. The interesting
question seems to be Andres' one about whether before_shmem_exit
actually has any safe use except for PG_ENSURE_ERROR_CLEANUP.
It may not, in which case perhaps we oughta rename it?

regards, tom lane

#21Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#22)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#23)
#25Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#25)
#27Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bharath Rupireddy (#27)
#29Daniel Gustafsson
daniel@yesql.se
In reply to: Robert Haas (#9)