BUG #16785: Postgresql shutdown during initial pg_prewarm prewarming causes Pg to get stuck
The following bug has been logged on the website:
Bug reference: 16785
Logged by: John Thompson
Email address: jthompson@maxmind.com
PostgreSQL version: 12.5
Operating system: Ubuntu Linux
Description:
We use pg_prewarm via shared_preload_libraries and we've noticed that if we
start Pg then stop it again before pg_prewarm finishes its initial
prewarming, Pg gets stuck in shutdown until we kill the 'autoprewarm master'
process with 'kill -9'. Removing pg_prewarm from shared_preload_libraries
allows Pg to stop without issue. Also, if we start Pg, wait for 'autoprewarm
successfully prewarmed XXXXXX of YYYYYY previously-loaded blocks' to appear
in the logs, then stop, it shuts down as expected.
This is happening on several different servers, but the one I've been
testing on has 855569 lines in
/var/lib/postgresql/12/main/autoprewarm.blocks, 32GB of RAM, and
shared_buffers is 8028MB.
We're using the 12.5 packages from apt.postgresql.org. Normally we use
systemd to start and stop services, but here's an example with just
pg_ctl:
$ sudo -u postgres /usr/lib/postgresql/12/bin/pg_ctl start -D
/var/lib/postgresql/12/main -l /var/log/postgresql/postgresql-12-main.log -o
'-c config_file="/etc/postgresql/12/main/postgresql.conf"'
waiting for server to start.... done
server started
$ sudo -u postgres /usr/lib/postgresql/12/bin/pg_ctl stop -s -w -D
/var/lib/postgresql/12/main -m fast
pg_ctl: server does not shut down
$ ps auxwww | grep ^postgres
postgres 448 0.7 0.7 8667168 240252 ? Ss 21:10 0:01
/usr/lib/postgresql/12/bin/postgres -D /var/lib/postgresql/12/main -c
config_file=/etc/postgresql/12/main/postgresql.conf
postgres 455 0.1 0.0 182036 4252 ? Ss 21:10 0:00 postgres:
logger
postgres 458 0.0 0.0 8666132 4088 ? Ss 21:10 0:00 postgres:
checkpointer
postgres 462 0.0 0.0 185232 5928 ? Ss 21:10 0:00 postgres:
stats collector
postgres 463 0.3 0.0 8682976 22740 ? Ss 21:10 0:00 postgres:
autoprewarm master
postgres 507 0.0 0.0 8667996 12616 ? Ss 21:10 0:00 postgres:
walsender repmgr 10.150.0.16(33718) streaming 34A65/4243A5C8
postgres 508 0.0 0.0 8667996 12612 ? Ss 21:10 0:00 postgres:
walsender repmgr 10.150.0.15(52982) streaming 34A65/4243A5C8
$ sudo kill 463 -9
$ ps auxwww | grep ^postgres
$
Please let me know if any other information would be useful, I'm happy to
help. Thanks for any help you can offer.
PG Bug reporting form <noreply@postgresql.org> writes:
We use pg_prewarm via shared_preload_libraries and we've noticed that if we
start Pg then stop it again before pg_prewarm finishes its initial
prewarming, Pg gets stuck in shutdown until we kill the 'autoprewarm master'
process with 'kill -9'.
Yeah. I assume you've got multiple large databases? When I try this
here, the current autoprewarm worker process shuts down promptly, but
then the master process tries to launch another worker for the next DB,
and then just sits there because the postmaster is now ignoring worker
start requests.
So this is partly autoprewarm's fault, because it's taking no thought
for the possibility that a shutdown will be requested before it's
managed to preload everything. But it's *mostly* the fault of the
incredibly brain-dead design for background workers, which has no
way to tell worker requestors that their request is never going to
be granted. I've ranted about that before, with little effect on
the people responsible, so I guess at some point I'm going to have
to go fix it myself.
In the meantime, attached is a proposed patch that improves
autoprewarm's behavior. It will only get stuck if you're unlucky
enough to issue "pg_ctl stop" just as it's trying to launch
another worker, which should be a pretty tight window.
I noticed while testing this that when autoprewarm did get out of
the situation promptly, it'd rewrite autoprewarm.blocks with only
as many blocks as it'd managed to prewarm before being canceled.
This doesn't seem like a good thing, because then the next start
might have next door to no prewarm behavior. So the patch also
hacks it to skip the final autoprewarm.blocks dump if we were
told to terminate before the initial prewarming pass is done.
This is against HEAD but it seems to apply cleanly to v12.
If you're in a position to rebuild pg_prewarm with the patch,
please see if it resolves the problem for you.
regards, tom lane
Attachments:
teach-autoprewarm-about-early-shutdown-1.patchtext/x-diff; charset=us-ascii; name=teach-autoprewarm-about-early-shutdown-1.patchDownload+25-5
I wrote:
This is against HEAD but it seems to apply cleanly to v12.
If you're in a position to rebuild pg_prewarm with the patch,
please see if it resolves the problem for you.
Eh, sorry about that, patch(1) was willing to apply it but the
result didn't compile :-(. If you want to try local testing,
the version that actually works in v12 is at
regards, tom lane
Hi Tom,
Thanks so much for taking a look at this. As you mentioned the patch didn't
apply cleanly to 12, but I ended up making the same changes as are in the
link you posted, and it fixes the problem in my testing. I appreciate your
help.
Take care
On Tue, Dec 22, 2020 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
I wrote:
This is against HEAD but it seems to apply cleanly to v12.
If you're in a position to rebuild pg_prewarm with the patch,
please see if it resolves the problem for you.Eh, sorry about that, patch(1) was willing to apply it but the
result didn't compile :-(. If you want to try local testing,
the version that actually works in v12 is atregards, tom lane
John Thompson <jthompson@maxmind.com> writes:
Thanks so much for taking a look at this. As you mentioned the patch didn't
apply cleanly to 12, but I ended up making the same changes as are in the
link you posted, and it fixes the problem in my testing. I appreciate your
help.
OK, thanks for confirming!
regards, tom lane