O(n) tasks cause lengthy startups and checkpoints
Hi hackers,
Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash. However,
as noted earlier this year [0]/messages/by-id/32B59582-AA6C-4609-B08F-2256A271F7A5@amazon.com, there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters. I'll likely split each of these into its own
thread, given there is community interest for such changes.
1) CheckPointSnapBuild(): This function loops through
pg_logical/snapshots to remove all snapshots that are no longer
needed. If there are many entries in this directory, this can take
a long time. The note above this function indicates that this is
done during checkpoints simply because it is convenient. IIUC
there is no requirement that this function actually completes for a
given checkpoint. My current idea is to move this to a new
maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
pg_logical/mappings to remove old mappings and flush all remaining
ones. IIUC there is no requirement that the "remove old mappings"
part must complete for a given checkpoint, but the "flush all
remaining" portion allows replay after a checkpoint to only "deal
with the parts of a mapping that have been written out after the
checkpoint started." Therefore, I think we should move the "remove
old mappings" part to a new maintenance worker (probably the same
one as for 1), and we should consider using syncfs() for the "flush
all remaining" part. (I suspect the main argument against the
latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
temporary files to individually remove. This step is already
optionally done after a crash via the remove_temp_files_after_crash
GUC. I propose that we have startup move the temporary file
directories aside and create new ones, and then a separate worker
(probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
has been spilled to disk. This code appears to be written to avoid
deleting different types of files in these directories, but AFAICT
there shouldn't be any other files. Therefore, I think we could do
something similar to 3 (i.e., move the directories aside during
startup and clean them up via a new maintenance worker).
I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?
Nathan
[0]: /messages/by-id/32B59582-AA6C-4609-B08F-2256A271F7A5@amazon.com
+1 to the idea. I don't see a reason why checkpointer has to do all of
that. Keeping checkpoint to minimal essential work helps servers recover
faster in the event of a crash.
RemoveOldXlogFiles is also an O(N) operation that can at least be avoided
during the end of recovery (CHECKPOINT_END_OF_RECOVERY) checkpoint. When a
sufficient number of WAL files accumulated and the previous checkpoint did
not get a chance to cleanup, this can increase the unavailability of the
server.
RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
On Wed, Dec 1, 2021 at 12:24 PM Bossart, Nathan <bossartn@amazon.com> wrote:
Show quoted text
Hi hackers,
Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash. However,
as noted earlier this year [0], there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters. I'll likely split each of these into its own
thread, given there is community interest for such changes.1) CheckPointSnapBuild(): This function loops through
pg_logical/snapshots to remove all snapshots that are no longer
needed. If there are many entries in this directory, this can take
a long time. The note above this function indicates that this is
done during checkpoints simply because it is convenient. IIUC
there is no requirement that this function actually completes for a
given checkpoint. My current idea is to move this to a new
maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
pg_logical/mappings to remove old mappings and flush all remaining
ones. IIUC there is no requirement that the "remove old mappings"
part must complete for a given checkpoint, but the "flush all
remaining" portion allows replay after a checkpoint to only "deal
with the parts of a mapping that have been written out after the
checkpoint started." Therefore, I think we should move the "remove
old mappings" part to a new maintenance worker (probably the same
one as for 1), and we should consider using syncfs() for the "flush
all remaining" part. (I suspect the main argument against the
latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
temporary files to individually remove. This step is already
optionally done after a crash via the remove_temp_files_after_crash
GUC. I propose that we have startup move the temporary file
directories aside and create new ones, and then a separate worker
(probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
has been spilled to disk. This code appears to be written to avoid
deleting different types of files in these directories, but AFAICT
there shouldn't be any other files. Therefore, I think we could do
something similar to 3 (i.e., move the directories aside during
startup and clean them up via a new maintenance worker).I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?Nathan
[0] /messages/by-id/32B59582-AA6C-4609-B08F-2256A271F7A5@amazon.com
Hi,
On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:
I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?
-1. I think the overhead of an additional worker is disproportional here. And
there's simplicity benefits in having a predictable cleanup interlock as well.
I think particularly for the snapshot stuff it'd be better to optimize away
unnecessary snapshot files, rather than making the cleanup more asynchronous.
Greetings,
Andres Freund
On 12/1/21, 2:56 PM, "Andres Freund" <andres@anarazel.de> wrote:
On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:
I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?-1. I think the overhead of an additional worker is disproportional here. And
there's simplicity benefits in having a predictable cleanup interlock as well.
Another idea I had was to put some upper limit on how much time is
spent on such tasks. For example, a checkpoint would only spend X
minutes on CheckPointSnapBuild() before giving up until the next one.
I think the main downside of that approach is that it could lead to
unbounded growth, so perhaps we would limit (or even skip) such tasks
only for end-of-recovery and shutdown checkpoints. Perhaps the
startup tasks could be limited in a similar fashion.
I think particularly for the snapshot stuff it'd be better to optimize away
unnecessary snapshot files, rather than making the cleanup more asynchronous.
I can look into this. Any pointers would be much appreciated.
Nathan
On Wed, Dec 1, 2021, at 9:19 PM, Bossart, Nathan wrote:
On 12/1/21, 2:56 PM, "Andres Freund" <andres@anarazel.de> wrote:
On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:
I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?-1. I think the overhead of an additional worker is disproportional here. And
there's simplicity benefits in having a predictable cleanup interlock as well.Another idea I had was to put some upper limit on how much time is
spent on such tasks. For example, a checkpoint would only spend X
minutes on CheckPointSnapBuild() before giving up until the next one.
I think the main downside of that approach is that it could lead to
unbounded growth, so perhaps we would limit (or even skip) such tasks
only for end-of-recovery and shutdown checkpoints. Perhaps the
startup tasks could be limited in a similar fashion.
Saying that a certain task is O(n) doesn't mean it needs a separate process to
handle it. Did you have a use case or even better numbers (% of checkpoint /
startup time) that makes your proposal worthwhile?
I would try to optimize (1) and (2). However, delayed removal can be a
long-term issue if the new routine cannot keep up with the pace of file
creation (specially if the checkpoints are far apart).
For (3), there is already a GUC that would avoid the slowdown during startup.
Use it if you think the startup time is more important that disk space occupied
by useless files.
For (4), you are forgetting that the on-disk state of replication slots is
stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the
replication slot directory and copy the state file. What happen if there is a
crash before copying the state file?
While we are talking about items (1), (2) and (4), we could probably have an
option to create some ephemeral logical decoding files into ramdisk (similar to
statistics directory). I wouldn't like to hijack this thread but this proposal
could alleviate the possible issues that you pointed out. If people are
interested in this proposal, I can start a new thread about it.
--
Euler Taveira
EDB https://www.enterprisedb.com/
On Thu, Dec 2, 2021 at 1:54 AM Bossart, Nathan <bossartn@amazon.com> wrote:
Hi hackers,
Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash. However,
as noted earlier this year [0], there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters. I'll likely split each of these into its own
thread, given there is community interest for such changes.1) CheckPointSnapBuild(): This function loops through
pg_logical/snapshots to remove all snapshots that are no longer
needed. If there are many entries in this directory, this can take
a long time. The note above this function indicates that this is
done during checkpoints simply because it is convenient. IIUC
there is no requirement that this function actually completes for a
given checkpoint. My current idea is to move this to a new
maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
pg_logical/mappings to remove old mappings and flush all remaining
ones. IIUC there is no requirement that the "remove old mappings"
part must complete for a given checkpoint, but the "flush all
remaining" portion allows replay after a checkpoint to only "deal
with the parts of a mapping that have been written out after the
checkpoint started." Therefore, I think we should move the "remove
old mappings" part to a new maintenance worker (probably the same
one as for 1), and we should consider using syncfs() for the "flush
all remaining" part. (I suspect the main argument against the
latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
temporary files to individually remove. This step is already
optionally done after a crash via the remove_temp_files_after_crash
GUC. I propose that we have startup move the temporary file
directories aside and create new ones, and then a separate worker
(probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
has been spilled to disk. This code appears to be written to avoid
deleting different types of files in these directories, but AFAICT
there shouldn't be any other files. Therefore, I think we could do
something similar to 3 (i.e., move the directories aside during
startup and clean them up via a new maintenance worker).I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?
+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write some
marker files (for instance, it can write snapshot_<cutofflsn> files
with file name itself representing the cutoff lsn so that the new bg
cleaner can remove the snapshot files, similarly it can write marker
files for other file removals). Having said that, a new bg cleaner
deleting the files asynchronously on behalf of checkpoint can look an
overkill until we have some numbers that we could save with this
approach. For this purpose, I did a small experiment to figure out how
much usually file deletion takes [1]on SSD: deletion of 1000000 files took 7.930380 seconds deletion of 500000 files took 3.921676 seconds deletion of 100000 files took 0.768772 seconds deletion of 50000 files took 0.400623 seconds deletion of 10000 files took 0.077565 seconds deletion of 1000 files took 0.006232 seconds on a SSD, for 1million files
8seconds, I'm sure it will be much more on HDD.
The bg cleaner can also be used for RemovePgTempFiles, probably the
postmaster just renaming the pgsql_temp to something
pgsql_temp_delete, then proceeding with the server startup, the bg
cleaner can then delete the files.
Also, we could do something similar for removing/recycling old xlog
files and StartupReorderBuffer.
Another idea could be to parallelize the checkpoint i.e. IIUC, the
tasks that checkpoint do in CheckPointGuts are independent and if we
have some counters like (how many snapshot/mapping files that the
server generated)
[1]: on SSD: deletion of 1000000 files took 7.930380 seconds deletion of 500000 files took 3.921676 seconds deletion of 100000 files took 0.768772 seconds deletion of 50000 files took 0.400623 seconds deletion of 10000 files took 0.077565 seconds deletion of 1000 files took 0.006232 seconds
deletion of 1000000 files took 7.930380 seconds
deletion of 500000 files took 3.921676 seconds
deletion of 100000 files took 0.768772 seconds
deletion of 50000 files took 0.400623 seconds
deletion of 10000 files took 0.077565 seconds
deletion of 1000 files took 0.006232 seconds
Regards,
Bharath Rupireddy.
On 12/1/21, 6:06 PM, "Euler Taveira" <euler@eulerto.com> wrote:
Saying that a certain task is O(n) doesn't mean it needs a separate process to
handle it. Did you have a use case or even better numbers (% of checkpoint /
startup time) that makes your proposal worthwhile?
I don't have specific numbers on hand, but each of the four functions
I listed is something I routinely see impacting customers.
For (3), there is already a GUC that would avoid the slowdown during startup.
Use it if you think the startup time is more important that disk space occupied
by useless files.
Setting remove_temp_files_after_crash to false only prevents temp file
cleanup during restart after a backend crash. It is always called for
other startups.
For (4), you are forgetting that the on-disk state of replication slots is
stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the
replication slot directory and copy the state file. What happen if there is a
crash before copying the state file?
Good point. I think it's possible to deal with this, though. Perhaps
the files that should be deleted on startup should go in a separate
directory, or maybe we could devise a way to ensure the state file is
copied even if there is a crash at an inconvenient time.
Nathan
On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write some
Right. IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks. I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem. That being said, I'm all for
exploring other ways to handle this.
Another idea could be to parallelize the checkpoint i.e. IIUC, the
tasks that checkpoint do in CheckPointGuts are independent and if we
have some counters like (how many snapshot/mapping files that the
server generated)
Could you elaborate on this? Is your idea that the checkpointer would
create worker processes like autovacuum does?
Nathan
On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote:
On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write someRight. IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks. I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem. That being said, I'm all for
exploring other ways to handle this.
Having a generic background cleaner process (controllable via a few
GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
temp files etc.) or some other task on behalf of the checkpointer,
seems to be the easiest solution.
I'm too open for other ideas.
Another idea could be to parallelize the checkpoint i.e. IIUC, the
tasks that checkpoint do in CheckPointGuts are independent and if we
have some counters like (how many snapshot/mapping files that the
server generated)Could you elaborate on this? Is your idea that the checkpointer would
create worker processes like autovacuum does?
Yes, I was thinking that the checkpointer creates one or more dynamic
background workers (we can assume one background worker for now) to
delete the files. If a threshold of files crosses (snapshot files
count is more than this threshold), the new worker gets spawned which
would then enumerate the files and delete the unneeded ones, the
checkpointer can proceed with the other tasks and finish the
checkpointing. Having said this, I prefer the background cleaner
approach over the dynamic background worker. The advantage with the
background cleaner being that it can do other tasks (like other kinds
of file deletion).
Another idea could be that, use the existing background writer to do
the file deletion while the checkpoint is happening. But again, this
might cause problems because the bg writer flushing dirty buffers will
get delayed.
Regards,
Bharath Rupireddy.
On 12/3/21, 5:57 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote:
On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write someRight. IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks. I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem. That being said, I'm all for
exploring other ways to handle this.Having a generic background cleaner process (controllable via a few
GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
temp files etc.) or some other task on behalf of the checkpointer,
seems to be the easiest solution.I'm too open for other ideas.
I might hack something together for the separate worker approach, if
for no other reason than to make sure I really understand how these
functions work. If/when a better idea emerges, we can alter course.
Nathan
On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan <bossartn@amazon.com> wrote:
On 12/3/21, 5:57 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote:
On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write someRight. IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks. I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem. That being said, I'm all for
exploring other ways to handle this.Having a generic background cleaner process (controllable via a few
GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
temp files etc.) or some other task on behalf of the checkpointer,
seems to be the easiest solution.I'm too open for other ideas.
I might hack something together for the separate worker approach, if
for no other reason than to make sure I really understand how these
functions work. If/when a better idea emerges, we can alter course.
Thanks. As I said upthread we've been discussing the approach of
offloading some of the checkpoint tasks like (deleting snapshot files)
internally for quite some time and I would like to share a patch that
adds a new background cleaner process (currently able to delete the
logical replication snapshot files, if required can be extended to do
other tasks as well). I don't mind if it gets rejected. Please have a
look.
Regards,
Bharath Rupireddy.
Attachments:
v1-0001-background-cleaner-to-offload-checkpoint-tasks.patchapplication/octet-stream; name=v1-0001-background-cleaner-to-offload-checkpoint-tasks.patchDownload
From 4735db532aa818a9e3958ccc79229044fdfc7069 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 6 Dec 2021 11:39:36 +0000
Subject: [PATCH v1] background cleaner to offload checkpoint tasks
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/bgcleaner.c | 415 ++++++++++++++++++++
src/backend/postmaster/postmaster.c | 34 +-
src/backend/replication/logical/logical.c | 40 ++
src/backend/replication/logical/snapbuild.c | 8 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc.c | 37 +-
src/include/miscadmin.h | 1 +
src/include/postmaster/bgcleaner.h | 32 ++
src/include/replication/logical.h | 6 +
src/include/utils/wait_event.h | 1 +
12 files changed, 579 insertions(+), 2 deletions(-)
create mode 100644 src/backend/postmaster/bgcleaner.c
create mode 100644 src/include/postmaster/bgcleaner.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 787c6a2c3b..f55903dd1a 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
autovacuum.o \
auxprocess.o \
+ bgcleaner.o \
bgworker.o \
bgwriter.o \
checkpointer.o \
diff --git a/src/backend/postmaster/bgcleaner.c b/src/backend/postmaster/bgcleaner.c
new file mode 100644
index 0000000000..14d98e48eb
--- /dev/null
+++ b/src/backend/postmaster/bgcleaner.c
@@ -0,0 +1,415 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgcleaner.c
+ *
+ * The background cleaner (bgcleaner) process removes unneeded replication
+ * slot files (.snap). This is to offload the checkpoint responsibility so
+ * that the checkpoint (and so the recovery) can be faster.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/bgcleaner.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <signal.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgcleaner.h"
+#include "postmaster/fork_process.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/postmaster.h"
+#include "replication/logical.h"
+#include "storage/dsm.h"
+#include "storage/fd.h"
+#include "storage/pg_shmem.h"
+#include "utils/guc.h"
+#include "utils/ps_status.h"
+
+/*
+ * GUC parameters
+ */
+bool BgCleanerEnable = true;
+bool BgCleanerStopProcessingFiles = false;
+int BgCleanerDelay = 180;
+
+#ifdef EXEC_BACKEND
+static pid_t bgcleaner_forkexec(void);
+#endif
+
+NON_EXEC_STATIC void BackgroundCleanerMain(int argc, char *argv[]) pg_attribute_noreturn();
+
+static void CheckForBgCleanerInterrupts(void);
+static void ProcessSnapshotCutoffFiles(void);
+static int RemoveSnapShotFiles(XLogRecPtr cutoff);
+
+#ifdef EXEC_BACKEND
+/*
+ * bgcleaner_forkexec() -
+ *
+ * Format up the arglist for, then fork and exec, bgcleaner process
+ */
+static pid_t
+bgcleaner_forkexec(void)
+{
+ char *av[10];
+ int ac = 0;
+
+ av[ac++] = "postgres";
+ av[ac++] = "--forkbgcleaner";
+ av[ac++] = NULL; /* filled in by postmaster_forkexec */
+
+ av[ac] = NULL;
+ Assert(ac < lengthof(av));
+
+ return postmaster_forkexec(ac, av);
+}
+#endif /* EXEC_BACKEND */
+
+/*
+ * Called from postmaster at startup or after an existing bgcleaner died.
+ * Attempt to fire up a fresh bgcleaner.
+ *
+ * Returns PID of child process, or 0 if fail.
+ *
+ * Note: if fail, we will be called again from the postmaster main loop.
+ */
+int
+BgCleanerStart(void)
+{
+ pid_t pid;
+
+#ifdef EXEC_BACKEND
+ switch ((pid = bgcleaner_forkexec()))
+#else
+ switch ((pid = fork_process()))
+#endif
+ {
+ case -1:
+ ereport(LOG,
+ (errmsg("could not fork background cleaner: %m")));
+ return 0;
+
+#ifndef EXEC_BACKEND
+ case 0:
+ /* in postmaster child ... */
+ InitPostmasterChild();
+
+ /* Close the postmaster's sockets */
+ ClosePostmasterPorts(false);
+
+ /* Drop our connection to postmaster's shared memory, as well */
+ dsm_detach_all();
+ PGSharedMemoryDetach();
+
+ BackgroundCleanerMain(0, NULL);
+ break;
+#endif
+
+ default:
+ return (int) pid;
+ }
+
+ /* shouldn't get here */
+ return 0;
+}
+
+/*
+ * Main entry point for bgcleaner process
+ *
+ * argc/argv parameters are valid only in EXEC_BACKEND case.
+ */
+NON_EXEC_STATIC void
+BackgroundCleanerMain(int argc, char *argv[])
+{
+ int n_wait = 0;
+ bool msg_logged = false;
+
+ /*
+ * Ignore all signals usually bound to some action in the postmaster,
+ * except SIGHUP, SIGTERM and SIGQUIT. Note we don't need a SIGUSR1
+ * handler to support latch operations, because we only use a local latch.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SIG_IGN);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, SIG_IGN);
+ pqsignal(SIGUSR2, SIG_IGN);
+ pqsignal(SIGCHLD, SIG_DFL);
+ pqsignal(SIGTTIN, SIG_DFL);
+ pqsignal(SIGTTOU, SIG_DFL);
+ pqsignal(SIGCONT, SIG_DFL);
+ pqsignal(SIGWINCH, SIG_DFL);
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ MyBackendType = B_BG_CLEANER;
+ init_ps_display(NULL);
+
+ /*
+ * Loop until we get SIGQUIT, SIGTERM or detect ungraceful death of
+ * parent postmaster.
+ */
+ for (;;)
+ {
+ int rc;
+
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ CheckForBgCleanerInterrupts();
+
+ /* Do the main work */
+ if (!BgCleanerStopProcessingFiles)
+ {
+ if (msg_logged)
+ {
+ /* we were told to start processing files */
+ elog(LOG, "background cleaner started file processing as parameter \"%s\" is set to off",
+ "bgcleaner_stop_processing_files");
+ msg_logged = false;
+ }
+
+ ProcessSnapshotCutoffFiles();
+ }
+ else if (BgCleanerStopProcessingFiles && !msg_logged)
+ {
+ /* we were told to stop processing files */
+ elog(LOG, "background cleaner stopped file processing as parameter \"%s\" is set to on",
+ "bgcleaner_stop_processing_files");
+ msg_logged = true;
+ }
+
+ /*
+ * Sleep until we are signaled or BgCleanerDelay has elapsed.
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ BgCleanerDelay * 1000L /* convert to ms */ ,
+ WAIT_EVENT_BGCLEANER_MAIN);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ break;
+
+ n_wait++;
+
+ /*
+ * Emit a log message after every 5 rounds of sleep to indicate the
+ * process is active.
+ */
+ if (n_wait == 5 && !BgCleanerStopProcessingFiles)
+ {
+ elog(LOG, "background cleaner is running with %d seconds of sleep time between rounds",
+ BgCleanerDelay);
+
+ n_wait = 0;
+ }
+ }
+
+ exit(0);
+}
+
+/*
+ * Read snapshot cutoff file names written by checkpointer to get the cutoff
+ * LSN and remove the unneded snapshot files.
+ */
+static void
+ProcessSnapshotCutoffFiles(void)
+{
+ DIR *dir;
+ struct dirent *cutoff_de;
+
+ dir = AllocateDir("pg_logical");
+ while ((cutoff_de = ReadDir(dir, "pg_logical")) != NULL)
+ {
+ char path[MAXPGPATH + 11];
+ uint32 hi;
+ uint32 lo;
+ XLogRecPtr cutoff = InvalidXLogRecPtr;
+ XLogRecPtr prev_cutoff = InvalidXLogRecPtr;
+ struct stat statbuf;
+ int res;
+
+ CheckForBgCleanerInterrupts();
+
+ /* see if we were told to stop processing files */
+ if (BgCleanerStopProcessingFiles)
+ {
+ elog(LOG, "background cleaner is stopping file processing at cutoff LSN: %X/%X as parameter \"%s\" is set to on",
+ LSN_FORMAT_ARGS(prev_cutoff),
+ "bgcleaner_stop_processing_files");
+ return;
+ }
+
+ if (strcmp(cutoff_de->d_name, ".") == 0 ||
+ strcmp(cutoff_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/%s", cutoff_de->d_name);
+
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ {
+ elog(DEBUG2, "only regular files expected: %s", cutoff_de->d_name);
+ continue;
+ }
+
+ /*
+ * We just log a message if a file doesn't fit the pattern, it's
+ * probably some editors lock/state file or similar...
+ */
+ if (sscanf(cutoff_de->d_name, "snapshot_cutoff_%X-%X", &hi, &lo) != 2)
+ {
+ elog(DEBUG2, "could not parse file name: %s", cutoff_de->d_name);
+ continue;
+ }
+
+ prev_cutoff = cutoff;
+ cutoff = ((uint64) hi) << 32 | lo;
+ elog(DEBUG2, "replication slots cutoff LSN: %X/%X",
+ LSN_FORMAT_ARGS(cutoff));
+
+ res = RemoveSnapShotFiles(cutoff);
+
+ if (res == 0)
+ {
+ /* remove the file */
+ if (unlink(path) < 0)
+ {
+ elog(LOG, "could not remove snapshot cutoff file: %s",
+ cutoff_de->d_name);
+ continue;
+ }
+
+ elog(DEBUG1, "removed snapshot cutoff file: %s", cutoff_de->d_name);
+ }
+ else if (res < 0)
+ {
+ elog(DEBUG2, "retained snapshot cutoff file: %s", cutoff_de->d_name);
+ continue;
+ }
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Remove unneded snapshot files i.e. files with LSN < cutoff LSN.
+ *
+ * Return value -1 indicates that the caller can not remove the snapshot file.
+ *
+ * Return value 0 indicates that the caller can delete the snapshot cutoff
+ * file.
+ */
+static int
+RemoveSnapShotFiles(XLogRecPtr cutoff)
+{
+ DIR *dir;
+ struct dirent *snap_de;
+ int res = 0;
+ uint32 snap_files_deleted = 0;
+
+ dir = AllocateDir("pg_logical/snapshots");
+ while ((snap_de = ReadDir(dir, "pg_logical/snapshots")) != NULL)
+ {
+ char path[MAXPGPATH + 21];
+ uint32 hi;
+ uint32 lo;
+ XLogRecPtr lsn;
+ struct stat statbuf;
+
+ CheckForBgCleanerInterrupts();
+
+ /* see if we were told to stop processing files */
+ if (BgCleanerStopProcessingFiles)
+ {
+ elog(LOG, "background cleaner is stopping file processing at cutoff LSN: %X/%X as parameter \"%s\" is set to on",
+ LSN_FORMAT_ARGS(cutoff),
+ "bgcleaner_stop_processing_files");
+ return -1;
+ }
+
+ if (strcmp(snap_de->d_name, ".") == 0 ||
+ strcmp(snap_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/snapshots/%s", snap_de->d_name);
+
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /*
+ * We just log a message if a file doesn't fit the pattern, it's
+ * probably some editors lock/state file or similar...
+ */
+ if (sscanf(snap_de->d_name, "%X-%X.snap", &hi, &lo) != 2)
+ {
+ elog(DEBUG2, "could not parse file name: %s", snap_de->d_name);
+ continue;
+ }
+
+ lsn = ((uint64) hi) << 32 | lo;
+
+ /* check whether we still need it */
+ if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
+ {
+ /* remove the file */
+ if (unlink(path) < 0)
+ {
+ elog(LOG, "could not remove snapshot file: %s",
+ snap_de->d_name);
+ res = -1;
+ continue;
+ }
+
+ snap_files_deleted++;
+ elog(DEBUG1, "removed snapshot file: %s", snap_de->d_name);
+ }
+ else
+ {
+ elog(DEBUG2, "retained snapshot file: %s", snap_de->d_name);
+ continue;
+ }
+ }
+ FreeDir(dir);
+
+ if (snap_files_deleted > 0)
+ ereport(LOG,
+ (errmsg_plural("removed %u snapshot file with cutoff LSN %X/%X",
+ "removed %u snapshot files with cutoff LSN %X/%X",
+ snap_files_deleted,
+ snap_files_deleted,
+ LSN_FORMAT_ARGS(cutoff))));
+ return res;
+}
+
+static void
+CheckForBgCleanerInterrupts(void)
+{
+ if (ShutdownRequestPending)
+ exit(0);
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 328ecafa8c..49325570e5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -110,6 +110,7 @@
#include "port/pg_bswap.h"
#include "postmaster/autovacuum.h"
#include "postmaster/auxprocess.h"
+#include "postmaster/bgcleaner.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/fork_process.h"
#include "postmaster/interrupt.h"
@@ -255,7 +256,8 @@ static pid_t StartupPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
PgStatPID = 0,
- SysLoggerPID = 0;
+ SysLoggerPID = 0,
+ BgCleanerPID = 0;
/* Startup process's status */
typedef enum
@@ -1459,6 +1461,8 @@ PostmasterMain(int argc, char *argv[])
CheckpointerPID = StartCheckpointer();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
+ if (BgCleanerPID == 0 && BgCleanerEnable)
+ BgCleanerPID = BgCleanerStart();
/*
* We're ready to rock and roll...
@@ -1828,6 +1832,8 @@ ServerLoop(void)
CheckpointerPID = StartCheckpointer();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
+ if (BgCleanerPID == 0 && BgCleanerEnable)
+ BgCleanerPID = BgCleanerStart();
}
/*
@@ -2794,6 +2800,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(SysLoggerPID, SIGHUP);
if (PgStatPID != 0)
signal_child(PgStatPID, SIGHUP);
+ if (BgCleanerPID != 0)
+ signal_child(BgCleanerPID, SIGHUP);
/* Reload authentication config files too */
if (!load_hba())
@@ -3111,6 +3119,8 @@ reaper(SIGNAL_ARGS)
CheckpointerPID = StartCheckpointer();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
+ if (BgCleanerPID == 0 && BgCleanerEnable)
+ BgCleanerPID = BgCleanerStart();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
@@ -3225,6 +3235,22 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the bgcleaner? If so, just try to start a new one; no need
+ * to force reset of the rest of the system. (If fail, we'll try again
+ * in future cycles of the main loop.)
+ */
+ if (pid == BgCleanerPID)
+ {
+ BgCleanerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ LogChildExit(LOG, _("background cleaner process"),
+ pid, exitstatus);
+ if (BgCleanerPID == 0 && BgCleanerEnable)
+ BgCleanerPID = BgCleanerStart();
+ continue;
+ }
+
/*
* Was it the wal receiver? If exit status is zero (normal) or one
* (FATAL exit), we assume everything is all right just like normal
@@ -5175,6 +5201,12 @@ SubPostmasterMain(int argc, char *argv[])
SysLoggerMain(argc, argv); /* does not return */
}
+ if (strcmp(argv[1], "--forkbgcleaner") == 0)
+ {
+ /* Do not want to attach to shared memory */
+
+ BackgroundCleanerMain(argc, argv); /* does not return */
+ }
abort(); /* shouldn't get here */
}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 10cbdea124..7b947a7428 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -28,6 +28,8 @@
#include "postgres.h"
+#include <unistd.h>
+
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "fmgr.h"
@@ -1842,3 +1844,41 @@ UpdateDecodingStats(LogicalDecodingContext *ctx)
rb->totalTxns = 0;
rb->totalBytes = 0;
}
+
+void
+CreateReplicationCleanupFile(ReplCleanupFileKind kind, XLogRecPtr cutoff_lsn)
+{
+ int fd;
+ /*
+ * 27 is size of the fixed path name "pg_logical/snapshot_cutoff_".
+ * XXX: Dynamically allocate memory for the path variable, if at all, this
+ * function is changed to deal with other kinds of files with differnt
+ * fixed path names.
+ */
+ char path[MAXPGPATH + 27];
+
+ Assert(kind == REPL_CLEANUP_FILE_SNAPSHOT);
+
+ MemSet(path, '\0', sizeof(path));
+
+ /* create a file named snapshot_cutoff_<hi>-<lo> */
+ snprintf(path, sizeof(path), "pg_logical/snapshot_cutoff_%X-%X",
+ LSN_FORMAT_ARGS(cutoff_lsn));
+
+ /*
+ * We don't need O_EXCL flag as it might cause FATAL error if the file
+ * already exists. It is the responsibility of the clean up prgoram to
+ * delete the previous files.
+ */
+ fd = BasicOpenFile(path, O_RDWR | O_CREAT);
+ if (fd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", path)));
+
+ /* make sure we persist */
+ fsync_fname(path, false);
+ fsync_fname("pg_logical", true);
+
+ close(fd);
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index dbdc172a2b..3933923d52 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -125,6 +125,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/bgcleaner.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -1941,6 +1942,13 @@ CheckPointSnapBuild(void)
if (redo < cutoff)
cutoff = redo;
+ /* do this only if there exists a background cleaner */
+ if (BgCleanerEnable && !BgCleanerStopProcessingFiles)
+ {
+ CreateReplicationCleanupFile(REPL_CLEANUP_FILE_SNAPSHOT, cutoff);
+ return;
+ }
+
snap_dir = AllocateDir("pg_logical/snapshots");
while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
{
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 4d53f040e8..f0fc681e5f 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -215,6 +215,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_AUTOVACUUM_MAIN:
event_name = "AutoVacuumMain";
break;
+ case WAIT_EVENT_BGCLEANER_MAIN:
+ event_name = "BackgroundCleanerMain";
+ break;
case WAIT_EVENT_BGWRITER_HIBERNATE:
event_name = "BgWriterHibernate";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..f734c82229 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -264,6 +264,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_BACKEND:
backendDesc = "client backend";
break;
+ case B_BG_CLEANER:
+ backendDesc = "background cleaner";
+ break;
case B_BG_WORKER:
backendDesc = "background worker";
break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ee6a838b3a..29eef8f155 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -68,6 +68,9 @@
#include "parser/scansup.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
+#ifdef AZURE_SERVICE_FABRIC
+#include "postmaster/bgcleaner.h"
+#endif
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
@@ -1791,7 +1794,26 @@ static struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
-
+#ifdef AZURE_SERVICE_FABRIC
+ {
+ {"bgcleaner_enable", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Start a subprocess to remove unneeded replication slot snapshot files."),
+ NULL
+ },
+ &BgCleanerEnable,
+ true,
+ NULL, NULL, NULL
+ },
+ {
+ {"bgcleaner_stop_processing_files", PGC_SIGHUP, DEVELOPER_OPTIONS,
+ gettext_noop("Inform background cleaner to stop processing files."),
+ NULL
+ },
+ &BgCleanerStopProcessingFiles,
+ false,
+ NULL, NULL, NULL
+ },
+#endif
#ifdef TRACE_SORT
{
{"trace_sort", PGC_USERSET, DEVELOPER_OPTIONS,
@@ -3065,6 +3087,19 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+#ifdef AZURE_SERVICE_FABRIC
+ {
+ {"bgcleaner_delay", PGC_SIGHUP, DEVELOPER_OPTIONS,
+ gettext_noop("Background cleaner sleep time between rounds."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &BgCleanerDelay,
+ 180, 60, 86400,
+ NULL, NULL, NULL
+ },
+#endif
+
{
{"effective_io_concurrency",
PGC_USERSET,
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..5449f07a80 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -326,6 +326,7 @@ typedef enum BackendType
B_AUTOVAC_LAUNCHER,
B_AUTOVAC_WORKER,
B_BACKEND,
+ B_BG_CLEANER,
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
diff --git a/src/include/postmaster/bgcleaner.h b/src/include/postmaster/bgcleaner.h
new file mode 100644
index 0000000000..5bba615ec8
--- /dev/null
+++ b/src/include/postmaster/bgcleaner.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgcleaner.h
+ * Exports from postmaster/bgcleaner.c.
+ *
+ * The bgcleaner process removes unneeded replication slot files (.snap).
+ * This is to offload the checkpoint responsibility so that the checkpoint
+ * (and so the recovery) can be faster.
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/bgcleaner.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BGCLEANER_H
+#define _BGCLEANER_H
+
+/*
+ * GUC parameters
+ */
+extern bool BgCleanerEnable;
+extern bool BgCleanerStopProcessingFiles;
+extern int BgCleanerDelay;
+
+extern int BgCleanerStart(void);
+
+#ifdef EXEC_BACKEND
+extern void BackgroundCleanerMain(int argc, char *argv[]) pg_attribute_noreturn();
+#endif
+
+#endif /* _BGCLEANER_H */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index e0f513b773..9971dc81f7 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -16,6 +16,11 @@
struct LogicalDecodingContext;
+typedef enum ReplCleanupFileKind
+{
+ REPL_CLEANUP_FILE_SNAPSHOT = 1
+} ReplCleanupFileKind;
+
typedef void (*LogicalOutputPluginWriterWrite) (struct LogicalDecodingContext *lr,
XLogRecPtr Ptr,
TransactionId xid,
@@ -140,5 +145,6 @@ extern bool filter_prepare_cb_wrapper(LogicalDecodingContext *ctx,
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
extern void ResetLogicalStreamingState(void);
extern void UpdateDecodingStats(LogicalDecodingContext *ctx);
+extern void CreateReplicationCleanupFile(ReplCleanupFileKind kind, XLogRecPtr cutoff_lsn);
#endif
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 8785a8e12c..be90cd0fe1 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -37,6 +37,7 @@ typedef enum
{
WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
WAIT_EVENT_AUTOVACUUM_MAIN,
+ WAIT_EVENT_BGCLEANER_MAIN,
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
--
2.25.1
On 12/6/21, 3:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan <bossartn@amazon.com> wrote:
I might hack something together for the separate worker approach, if
for no other reason than to make sure I really understand how these
functions work. If/when a better idea emerges, we can alter course.Thanks. As I said upthread we've been discussing the approach of
offloading some of the checkpoint tasks like (deleting snapshot files)
internally for quite some time and I would like to share a patch that
adds a new background cleaner process (currently able to delete the
logical replication snapshot files, if required can be extended to do
other tasks as well). I don't mind if it gets rejected. Please have a
look.
Thanks for sharing! I've also spent some time on a patch set, which I
intend to share once I have handling for all four tasks (so far I have
handling for CheckPointSnapBuild() and RemovePgTempFiles()). I'll
take a look at your patch as well.
Nathan
On 12/6/21, 11:23 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
On 12/6/21, 3:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
Thanks. As I said upthread we've been discussing the approach of
offloading some of the checkpoint tasks like (deleting snapshot files)
internally for quite some time and I would like to share a patch that
adds a new background cleaner process (currently able to delete the
logical replication snapshot files, if required can be extended to do
other tasks as well). I don't mind if it gets rejected. Please have a
look.Thanks for sharing! I've also spent some time on a patch set, which I
intend to share once I have handling for all four tasks (so far I have
handling for CheckPointSnapBuild() and RemovePgTempFiles()). I'll
take a look at your patch as well.
Well, I haven't had a chance to look at your patch, and my patch set
still only has handling for CheckPointSnapBuild() and
RemovePgTempFiles(), but I thought I'd share what I have anyway. I
split it into 5 patches:
0001 - Adds a new "custodian" auxiliary process that does nothing.
0002 - During startup, remove the pgsql_tmp directories instead of
only clearing the contents.
0003 - Split temporary file cleanup during startup into two stages.
The first renames the directories, and the second clears them.
0004 - Moves the second stage from 0003 to the custodian process.
0005 - Moves CheckPointSnapBuild() to the custodian process.
This is still very much a work in progress, and I've done minimal
testing so far.
Nathan
Attachments:
v1-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchapplication/octet-stream; name=v1-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchDownload
From bc755b58a982956e2a494fab91a49c0142f84030 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v1 4/5] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 13 ++++++++++++-
src/backend/postmaster/postmaster.c | 14 +++++++++-----
src/backend/storage/file/fd.c | 22 ++++++++++++++++------
3 files changed, 37 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 0ba59949bb..a5443f9a21 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -191,7 +191,18 @@ CustodianMain(void)
start_time = (pg_time_t) time(NULL);
- /* TODO: offloaded tasks go here */
+ /*
+ * Remove any pgsql_tmp directories that have been staged for deletion.
+ * Since pgsql_tmp directories can accumulate many files, removing all
+ * of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the temporary directories, and we clean them up here.
+ *
+ * pgsql_tmp directories are not staged or cleaned in single-user mode,
+ * so we don't need any extra handling outside of the custodian process
+ * for this.
+ */
+ RemovePgTempFiles(false, false);
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1ae2dc179e..b098482496 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1391,9 +1391,11 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4139,12 +4141,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 633c6eee18..0807f9c590 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -100,6 +100,8 @@
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
#include "utils/guc.h"
#include "utils/resowner_private.h"
@@ -1640,9 +1642,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1840,9 +1842,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
@@ -3211,6 +3213,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
* would create a race condition. It's done separately, earlier in
* postmaster startup.
*/
+
+ /*
+ * If we just staged some pgsql_tmp directories for removal, wake up the
+ * custodian process so that it deletes all the files in the staged
+ * directories as well as the directories themselves.
+ */
+ if (stage && ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
}
/*
--
2.16.6
v1-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchapplication/octet-stream; name=v1-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchDownload
From 400d3ef141593e02c7703534e8c909f42fad3a2b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v1 3/5] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be renamed to stage them for
removal. Then, all files in pgsql_tmp are removed before removing
the staged directories themselves. This change is being made in
preparation for a follow-up change to offload most temporary file
cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
---
src/backend/postmaster/postmaster.c | 8 +-
src/backend/storage/file/fd.c | 176 +++++++++++++++++++++++++++++++-----
src/include/storage/fd.h | 2 +-
3 files changed, 162 insertions(+), 24 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 51613aaa2a..1ae2dc179e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1392,7 +1392,8 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
- RemovePgTempFiles();
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4140,7 +4141,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
- RemovePgTempFiles();
+ {
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 545e91978c..633c6eee18 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -112,6 +112,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_DIR_TO_REMOVE_PREFIX (PG_TEMP_FILES_DIR "_to_remove_")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -338,6 +340,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3133,24 +3137,20 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
* Remove temporary and temporary relation files left over from a prior
* postmaster session
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * If stage is true, this function will simply rename all pgsql_tmp directories
+ * to stage them for removal at a later time. If stage is false, this function
+ * will delete all files in the staged directories as well as the directories
+ * themselves.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * If remove_relation_files is true, this function will remove the temporary
+ * relation files. Otherwise, this step is skipped.
*
* NOTE: this function and its subroutines generally report syscall failures
* with ereport(LOG) and keep going. Removing temp files is not so critical
* that we should fail to start the database when we can't do it.
*/
void
-RemovePgTempFiles(void)
+RemovePgTempFiles(bool stage, bool remove_relation_files)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3159,9 +3159,16 @@ RemovePgTempFiles(void)
/*
* First process temp files in pg_default ($PGDATA/base)
*/
- snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
- RemovePgTempRelationFiles("base");
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ RemoveStagedPgTempDirs("base");
+
+ if (remove_relation_files)
+ RemovePgTempRelationFiles("base");
/*
* Cycle through temp directories for all non-default tablespaces.
@@ -3174,13 +3181,26 @@ RemovePgTempFiles(void)
strcmp(spc_de->d_name, "..") == 0)
continue;
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY,
+ PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- RemovePgTempRelationFiles(temp_path);
+ if (remove_relation_files)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemovePgTempRelationFiles(temp_path);
+ }
}
FreeDir(spc_dir);
@@ -3194,7 +3214,121 @@ RemovePgTempFiles(void)
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function renames the given directory with a special prefix that
+ * RemoveStagedPgTempDirs() will know to look for. An integer is appended to
+ * the end of the new directory name in case previously staged pgsql_tmp
+ * directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ DIR *dir;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ dir = AllocateDir(tmp_dir);
+ if (dir == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m", tmp_dir)));
+ return;
+ }
+ FreeDir(dir);
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
+ PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ stage_path)));
+ return;
+ }
+ FreeDir(dir);
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ {
+ ereport(LOG,
+ (errmsg("could not stage \"%s\" for deletion",
+ tmp_dir)));
+ return;
+ }
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(spc_dir);
+ while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
+ strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", spc_dir, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
* Any other problem results in a LOG message. (missing_ok should be true at
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 762f6b46c1..85fa987aca 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,7 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
-extern void RemovePgTempFiles(void);
+extern void RemovePgTempFiles(bool stage, bool remove_relation_files);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
--
2.16.6
v1-0002-Also-remove-pgsql_tmp-directories-during-startup.patchapplication/octet-stream; name=v1-0002-Also-remove-pgsql_tmp-directories-during-startup.patchDownload
From 9b4683c6d3ad798f02ba0d084d26ddc0283d1ef1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v1 2/5] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 635313cdb7..51613aaa2a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1117,7 +1117,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..545e91978c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3160,7 +3160,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3176,7 +3176,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3209,7 +3209,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3247,13 +3247,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (S_ISDIR(statbuf.st_mode))
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3271,6 +3265,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..762f6b46c1 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index bf95a30761..481f1f23a2 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -143,7 +143,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -241,7 +242,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.16.6
v1-0001-Introduce-custodian.patchapplication/octet-stream; name=v1-0001-Introduce-custodian.patchDownload
From e17e79bfe150fa137d345dc7c6848c4e596c2fa4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 15:07:28 -0800
Subject: [PATCH v1 1/5] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 ++
src/backend/postmaster/custodian.c | 210 ++++++++++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++++++-
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 17 +++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
11 files changed, 297 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 787c6a2c3b..7ec7b23467 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7452f908b2..c55cc84490 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
InitXLOGAccess();
WalWriterMain();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..0ba59949bb
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,210 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process is new as of Postgres 15. It's main purpose is to
+ * offload tasks that could otherwise delay startup and checkpointing, but
+ * it needn't be restricted to just those things. Offloaded tasks should
+ * not be synchronous (e.g., checkpointing shouldn't need to wait for the
+ * custodian to complete a task before proceeding). Also, ensure that any
+ * offloaded tasks are either not required during single-user mode or are
+ * performed separately during single-user mode.
+ *
+ * The custodian is not an essential process and can shutdown quickly when
+ * requested. The custodian will wake up approximately once every 5
+ * minutes to perform its tasks, but backends can (and should) set its
+ * latch to wake it up sooner.
+ *
+ * Normal termination is by SIGTERM, which instructs the bgwriter to
+ * exit(0). Emergency termination is by SIGQUIT; like any backend, the
+ * custodian will simply abort and exit on SIGQUIT.
+ *
+ * If the custodian exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining
+ * backends should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+
+#define CUSTODIAN_TIMEOUT_S (300) /* 5 minutes */
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ *
+ * You might wonder why this isn't coded as an infinite loop around a
+ * PG_TRY construct. The reason is that this is the bottom of the
+ * exception stack, and so with PG_TRY there would be no exception handler
+ * in force at all during the CATCH part. By leaving the outermost setjmp
+ * always active, we have at least some chance of recovering from an error
+ * during error recovery. (If we get into an infinite loop thereby, it
+ * will soon be stopped by overflow of elog.c's internal state stack.)
+ *
+ * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
+ * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus,
+ * signals other than SIGQUIT will be blocked until we complete error
+ * recovery. It might seem that this policy makes the HOLD_INTERRUPS()
+ * call redundant, but it is not since InterruptPending might be set
+ * already.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ pg_time_t start_time;
+ pg_time_t end_time;
+ int elapsed_secs;
+ int cur_timeout;
+
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ start_time = (pg_time_t) time(NULL);
+
+ /* TODO: offloaded tasks go here */
+
+ /* Calculate how long to sleep */
+ end_time = (pg_time_t) time(NULL);
+ elapsed_secs = end_time - start_time;
+ if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
+ continue; /* no sleep for us */
+ cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
+
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ cur_timeout * 1000L /* convert to ms */ ,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 328ecafa8c..635313cdb7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -250,6 +250,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -556,6 +557,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1819,13 +1821,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2782,6 +2787,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3109,6 +3116,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3211,6 +3220,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3684,6 +3707,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3887,6 +3922,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3924,6 +3962,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -4017,6 +4056,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4222,6 +4262,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b7d9da0aa9..a86a05adb4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -182,6 +182,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 4d53f040e8..530af294d9 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..90c4160d42 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -273,6 +273,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_STARTUP:
backendDesc = "startup";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..83089d23ff 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -329,6 +329,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
@@ -433,6 +434,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -445,6 +447,7 @@ extern AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..e8ac2ad3dd
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index cfabfdbedf..1fc4599941 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -357,6 +357,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* The proc of the Startup process, since not in ProcArray */
@@ -377,11 +379,12 @@ extern PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 8785a8e12c..08dc9d5caa 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_PGSTAT_MAIN,
--
2.16.6
v1-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchapplication/octet-stream; name=v1-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchDownload
From b1014d699b82e5eb5bbe05a1b3b294577df1f77c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v1 5/5] Move removal of old serialized snapshots to custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 2 --
src/backend/postmaster/custodian.c | 11 +++++++++++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/replication/snapbuild.h | 2 +-
4 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d894af310a..2fcca38c23 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -56,7 +56,6 @@
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -9602,7 +9601,6 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index a5443f9a21..b088cdb0b8 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -38,6 +38,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/proc.h"
@@ -204,6 +205,16 @@ CustodianMain(void)
*/
RemovePgTempFiles(false, false);
+ /*
+ * Remove serialized snapshots that are no longer required by any
+ * logical replication slot.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra hndling outside of the custodian process for
+ * this.
+ */
+ RemoveOldSerializedSnapshots();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index dbdc172a2b..0fa7d822e1 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1912,14 +1912,13 @@ snapshot_not_interesting:
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 82aa86125b..ba7276058d 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.16.6
On Fri, Dec 10, 2021 at 2:03 PM Bossart, Nathan <bossartn@amazon.com> wrote:
Well, I haven't had a chance to look at your patch, and my patch set
still only has handling for CheckPointSnapBuild() and
RemovePgTempFiles(), but I thought I'd share what I have anyway. I
split it into 5 patches:0001 - Adds a new "custodian" auxiliary process that does nothing.
0002 - During startup, remove the pgsql_tmp directories instead of
only clearing the contents.
0003 - Split temporary file cleanup during startup into two stages.
The first renames the directories, and the second clears them.
0004 - Moves the second stage from 0003 to the custodian process.
0005 - Moves CheckPointSnapBuild() to the custodian process.
I don't know whether this kind of idea is good or not.
One thing we've seen a number of times now is that entrusting the same
process with multiple responsibilities often ends poorly. Sometimes
it's busy with one thing when another thing really needs to be done
RIGHT NOW. Perhaps that won't be an issue here since all of these
things are related to checkpointing, but then the process name should
reflect that rather than making it sound like we can just keep piling
more responsibilities onto this process indefinitely. At some point
that seems bound to become an issue.
Another issue is that we don't want to increase the number of
processes without bound. Processes use memory and CPU resources and if
we run too many of them it becomes a burden on the system. Low-end
systems may not have too many resources in total, and high-end systems
can struggle to fit demanding workloads within the resources that they
have. Maybe it would be cheaper to do more things at once if we were
using threads rather than processes, but that day still seems fairly
far off.
But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Dec 13, 2021 at 08:53:37AM -0500, Robert Haas wrote:
On Fri, Dec 10, 2021 at 2:03 PM Bossart, Nathan <bossartn@amazon.com> wrote:
Well, I haven't had a chance to look at your patch, and my patch set
still only has handling for CheckPointSnapBuild() and
RemovePgTempFiles(), but I thought I'd share what I have anyway. I
split it into 5 patches:0001 - Adds a new "custodian" auxiliary process that does nothing.
...
I don't know whether this kind of idea is good or not.
...
Another issue is that we don't want to increase the number of
processes without bound. Processes use memory and CPU resources and if
we run too many of them it becomes a burden on the system. Low-end
systems may not have too many resources in total, and high-end systems
can struggle to fit demanding workloads within the resources that they
have. Maybe it would be cheaper to do more things at once if we were
using threads rather than processes, but that day still seems fairly
far off.
Maybe that's an argument that this should be a dynamic background worker
instead of an auxilliary process. Then maybe it would be controlled by
max_parallel_maintenance_workers (or something similar). The checkpointer
would need to do these tasks itself if parallel workers were disabled or
couldn't be launched.
--
Justin
On 12/13/21, 5:54 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
I don't know whether this kind of idea is good or not.
Thanks for chiming in. I have an almost-complete patch set that I'm
hoping to post to the lists in the next couple of days.
One thing we've seen a number of times now is that entrusting the same
process with multiple responsibilities often ends poorly. Sometimes
it's busy with one thing when another thing really needs to be done
RIGHT NOW. Perhaps that won't be an issue here since all of these
things are related to checkpointing, but then the process name should
reflect that rather than making it sound like we can just keep piling
more responsibilities onto this process indefinitely. At some point
that seems bound to become an issue.
Two of the tasks are cleanup tasks that checkpointing handles at the
moment, and two are cleanup tasks that are done at startup. For now,
all of these tasks are somewhat nonessential. There's no requirement
that any of these tasks complete in order to finish startup or
checkpointing. In fact, outside of preventing the server from running
out of disk space, I don't think there's any requirement that these
tasks run at all. IMO this would have to be a core tenet of a new
auxiliary process like this.
That being said, I totally understand your point. If there were a
dozen such tasks handled by a single auxiliary process, that could
cause a new set of problems. Your checkpointing and startup might be
fast, but you might run out of disk space because our cleanup process
can't handle it all. So a new worker could end up becoming an
availability risk as well.
Another issue is that we don't want to increase the number of
processes without bound. Processes use memory and CPU resources and if
we run too many of them it becomes a burden on the system. Low-end
systems may not have too many resources in total, and high-end systems
can struggle to fit demanding workloads within the resources that they
have. Maybe it would be cheaper to do more things at once if we were
using threads rather than processes, but that day still seems fairly
far off.
I do agree that it is important to be very careful about adding new
processes, and if a better idea for how to handle these tasks emerges,
I will readily abandon my current approach. Upthread, Andres
mentioned optimizing unnecessary snapshot files, and I mentioned
possibly limiting how much time startup and checkpoints spend on these
tasks. I don't have too many details for the former, and for the
latter, I'm worried about not being able to keep up. But if the
prospect of adding a new auxiliary process for this stuff is a non-
starter, perhaps I should explore that approach some more.
But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.
Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.
Nathan
On 12/13/21, 9:20 AM, "Justin Pryzby" <pryzby@telsasoft.com> wrote:
On Mon, Dec 13, 2021 at 08:53:37AM -0500, Robert Haas wrote:
Another issue is that we don't want to increase the number of
processes without bound. Processes use memory and CPU resources and if
we run too many of them it becomes a burden on the system. Low-end
systems may not have too many resources in total, and high-end systems
can struggle to fit demanding workloads within the resources that they
have. Maybe it would be cheaper to do more things at once if we were
using threads rather than processes, but that day still seems fairly
far off.Maybe that's an argument that this should be a dynamic background worker
instead of an auxilliary process. Then maybe it would be controlled by
max_parallel_maintenance_workers (or something similar). The checkpointer
would need to do these tasks itself if parallel workers were disabled or
couldn't be launched.
I think this is an interesting idea. I dislike the prospect of having
two code paths for all this stuff, but if it addresses the concerns
about resource usage, maybe it's worth it.
Nathan
On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote:
But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.
Well sure. But I've never actually seen that happen.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 12/13/21, 12:37 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote:
But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.Well sure. But I've never actually seen that happen.
I'll admit that surprises me. As noted elsewhere [0]/messages/by-id/E7573D54-A8C9-40A8-89D7-0596A36ED124@amazon.com, we were seeing
this enough with pgsql_tmp that we started moving the directory aside
before starting the server. Discussions about handling this usually
prompt questions about why there are so many temporary files in the
first place (which is fair). FWIW all four functions noted in my
original message [1]/messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com are things I've seen firsthand affecting users.
Nathan
[0]: /messages/by-id/E7573D54-A8C9-40A8-89D7-0596A36ED124@amazon.com
[1]: /messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com
On Mon, Dec 13, 2021 at 11:05:46PM +0000, Bossart, Nathan wrote:
On 12/13/21, 12:37 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote:
But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.Well sure. But I've never actually seen that happen.
I'll admit that surprises me. As noted elsewhere [0], we were seeing
this enough with pgsql_tmp that we started moving the directory aside
before starting the server. Discussions about handling this usually
prompt questions about why there are so many temporary files in the
first place (which is fair). FWIW all four functions noted in my
original message [1] are things I've seen firsthand affecting users.
Have we changed temporary file handling in any recent major releases,
meaning is this a current problem or one already improved in PG 14.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
If only the physical world exists, free will is an illusion.
On 12/14/21, 9:00 AM, "Bruce Momjian" <bruce@momjian.us> wrote:
Have we changed temporary file handling in any recent major releases,
meaning is this a current problem or one already improved in PG 14.
I haven't noticed any recent improvements while working in this area.
Nathan
On 12/14/21, 12:09 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
On 12/14/21, 9:00 AM, "Bruce Momjian" <bruce@momjian.us> wrote:
Have we changed temporary file handling in any recent major releases,
meaning is this a current problem or one already improved in PG 14.I haven't noticed any recent improvements while working in this area.
On second thought, the addition of the remove_temp_files_after_crash
GUC is arguably an improvement since it could prevent files from
accumulating after repeated crashes.
Nathan
On 12/13/21, 10:21 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
Thanks for chiming in. I have an almost-complete patch set that I'm
hoping to post to the lists in the next couple of days.
As promised, here is v2. This patch set includes handling for all
four tasks noted upthread. I'd still consider this a work-in-
progress, as I've done minimal testing. At the very least, it should
demonstrate what an auxiliary process approach might look like.
Nathan
Attachments:
v2-0001-Introduce-custodian.patchapplication/octet-stream; name=v2-0001-Introduce-custodian.patchDownload
From 5b8133d2707d9da843ac3bb0561b9535ed40675a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 15:07:28 -0800
Subject: [PATCH v2 1/8] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 ++
src/backend/postmaster/custodian.c | 210 ++++++++++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++++++-
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 17 +++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
11 files changed, 297 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 787c6a2c3b..7ec7b23467 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 43497676ab..10626e7029 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..0ba59949bb
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,210 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process is new as of Postgres 15. It's main purpose is to
+ * offload tasks that could otherwise delay startup and checkpointing, but
+ * it needn't be restricted to just those things. Offloaded tasks should
+ * not be synchronous (e.g., checkpointing shouldn't need to wait for the
+ * custodian to complete a task before proceeding). Also, ensure that any
+ * offloaded tasks are either not required during single-user mode or are
+ * performed separately during single-user mode.
+ *
+ * The custodian is not an essential process and can shutdown quickly when
+ * requested. The custodian will wake up approximately once every 5
+ * minutes to perform its tasks, but backends can (and should) set its
+ * latch to wake it up sooner.
+ *
+ * Normal termination is by SIGTERM, which instructs the bgwriter to
+ * exit(0). Emergency termination is by SIGQUIT; like any backend, the
+ * custodian will simply abort and exit on SIGQUIT.
+ *
+ * If the custodian exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining
+ * backends should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+
+#define CUSTODIAN_TIMEOUT_S (300) /* 5 minutes */
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ *
+ * You might wonder why this isn't coded as an infinite loop around a
+ * PG_TRY construct. The reason is that this is the bottom of the
+ * exception stack, and so with PG_TRY there would be no exception handler
+ * in force at all during the CATCH part. By leaving the outermost setjmp
+ * always active, we have at least some chance of recovering from an error
+ * during error recovery. (If we get into an infinite loop thereby, it
+ * will soon be stopped by overflow of elog.c's internal state stack.)
+ *
+ * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
+ * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus,
+ * signals other than SIGQUIT will be blocked until we complete error
+ * recovery. It might seem that this policy makes the HOLD_INTERRUPS()
+ * call redundant, but it is not since InterruptPending might be set
+ * already.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ pg_time_t start_time;
+ pg_time_t end_time;
+ int elapsed_secs;
+ int cur_timeout;
+
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ start_time = (pg_time_t) time(NULL);
+
+ /* TODO: offloaded tasks go here */
+
+ /* Calculate how long to sleep */
+ end_time = (pg_time_t) time(NULL);
+ elapsed_secs = end_time - start_time;
+ if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
+ continue; /* no sleep for us */
+ cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
+
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ cur_timeout * 1000L /* convert to ms */ ,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 328ecafa8c..635313cdb7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -250,6 +250,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -556,6 +557,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1819,13 +1821,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2782,6 +2787,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3109,6 +3116,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3211,6 +3220,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3684,6 +3707,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3887,6 +3922,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3924,6 +3962,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -4017,6 +4056,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4222,6 +4262,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b7d9da0aa9..a86a05adb4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -182,6 +182,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 4d53f040e8..530af294d9 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..90c4160d42 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -273,6 +273,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_STARTUP:
backendDesc = "startup";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..83089d23ff 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -329,6 +329,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
@@ -433,6 +434,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -445,6 +447,7 @@ extern AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..e8ac2ad3dd
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index cfabfdbedf..1fc4599941 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -357,6 +357,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* The proc of the Startup process, since not in ProcArray */
@@ -377,11 +379,12 @@ extern PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 8785a8e12c..08dc9d5caa 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_PGSTAT_MAIN,
--
2.16.6
v2-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patchapplication/octet-stream; name=v2-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patchDownload
From 590fc3d96320fa701662a5e4af7713a8daeb9a46 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Tue, 14 Dec 2021 18:40:12 +0000
Subject: [PATCH v2 8/8] Move removal of spilled logical slot data to
custodian.
If there are many such files, startup can take much longer than
necessary. To handle this, startup creates a new slot directory,
copies the state file, and swaps the new directory with the old
one. The custodian then asynchronously cleans up the old slot
directory.
---
src/backend/access/transam/xlog.c | 15 +-
src/backend/postmaster/custodian.c | 14 ++
src/backend/replication/logical/reorderbuffer.c | 292 +++++++++++++++++++++++-
src/backend/replication/slot.c | 4 +
src/include/replication/reorderbuffer.h | 1 +
5 files changed, 317 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5fb8ac9483..bd3c671988 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7156,18 +7156,21 @@ StartupXLOG(void)
checkPoint.newestCommitTsXid);
XLogCtl->ckptFullXid = checkPoint.nextXid;
- /*
- * Initialize replication slots, before there's a chance to remove
- * required resources.
- */
- StartupReplicationSlots();
-
/*
* Startup logical state, needs to be setup now so we have proper data
* during crash recovery.
+ *
+ * NB: This also performs some important cleanup that must be done prior to
+ * other replication slot steps (e.g., StartupReplicationSlots()).
*/
StartupReorderBuffer();
+ /*
+ * Initialize replication slots, before there's a chance to remove
+ * required resources.
+ */
+ StartupReplicationSlots();
+
/*
* Startup CLOG. This must be done after ShmemVariableCache->nextXid has
* been initialized and before we accept connections or begin WAL replay.
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index e3318878d0..9bfad80794 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -39,6 +39,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
@@ -206,6 +207,19 @@ CustodianMain(void)
*/
RemovePgTempFiles(false, false);
+ /*
+ * Remove any replication slot directories that have been staged for
+ * deletion. Since slot directories can accumulate many files, removing
+ * all of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the slot directories, and we clean them up here.
+ *
+ * Replication slot directories are not staged or cleaned in single-user
+ * mode, so we don't need any extra handling outside of the custodian
+ * process for this.
+ */
+ RemoveStagedSlotDirectories();
+
/*
* Remove serialized snapshots that are no longer required by any
* logical replication slot.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7aa5647a2c..a43e98d848 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -91,15 +91,19 @@
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "catalog/catalog.h"
+#include "common/string.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/slot.h"
#include "replication/snapbuild.h" /* just for SnapBuildSnapDecRefcount */
#include "storage/bufmgr.h"
+#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/proc.h"
#include "storage/sinval.h"
#include "utils/builtins.h"
#include "utils/combocid.h"
@@ -255,12 +259,15 @@ static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn
static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared);
static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
+static void ReorderBufferCleanup(const char *slotname);
static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
TransactionId xid, XLogSegNo segno);
static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
ReorderBufferTXN *txn, CommandId cid);
+static void StageSlotDirForRemoval(const char *slotname, const char *slotpath);
+static void RemoveStagedSlotDirectory(const char *path);
/*
* ---------------------------------------
@@ -4430,6 +4437,202 @@ ReorderBufferCleanupSerializedTXNs(const char *slotname)
FreeDir(spill_dir);
}
+/*
+ * Cleanup everything in the logical slot directory except for the "state" file.
+ * This is specially written for StartupReorderBuffer(), which has special logic
+ * to handle crashes at inconvenient times.
+ *
+ * NB: If anything except for the "state" file cannot be removed after startup,
+ * this will need to be updated.
+ */
+static void
+ReorderBufferCleanup(const char *slotname)
+{
+ char path[MAXPGPATH];
+ char newpath[MAXPGPATH];
+ char statepath[MAXPGPATH];
+ char newstatepath[MAXPGPATH];
+ struct stat statbuf;
+
+ sprintf(path, "pg_replslot/%s", slotname);
+ sprintf(newpath, "pg_replslot/%s.new", slotname);
+ sprintf(statepath, "pg_replslot/%s/state", slotname);
+ sprintf(newstatepath, "pg_replslot/%s.new/state", slotname);
+
+ /* we're only handling directories here, skip if it's not ours */
+ if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
+ return;
+
+ /*
+ * Build our new slot directory, suffixed with ".new". The caller (likely
+ * StartupReorderBuffer()) should have already ensured that any pre-existing
+ * ".new" directories leftover after a crash have been cleaned up.
+ */
+ if (MakePGDirectory(newpath) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", newpath)));
+
+ copy_file(statepath, newstatepath);
+
+ fsync_fname(newstatepath, false);
+ fsync_fname(newpath, true);
+ fsync_fname("pg_replslot", true);
+
+ /*
+ * Move the slot directory aside for cleanup by the custodian. After this
+ * step, there will be no slot directory. StartupReorderBuffer() has
+ * special logic to make sure we don't lose the slot if we crash at this
+ * point.
+ */
+ StageSlotDirForRemoval(slotname, path);
+
+ /*
+ * Move our ".new" directory to become our new slot directory.
+ */
+ if (rename(newpath, path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m", newpath)));
+
+ fsync_fname(path, true);
+ fsync_fname("pg_replslot", true);
+}
+
+/*
+ * This function renames the given directory with a special suffix that the
+ * custodian will know to look for. An integer is appended to the end of the
+ * new directory name in case previously staged slot directories have not yet
+ * been removed.
+ */
+static void
+StageSlotDirForRemoval(const char *slotname, const char *slotpath)
+{
+ char stage_path[MAXPGPATH];
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ DIR *dir;
+
+ sprintf(stage_path, "pg_replslot/%s.to_remove_%d", slotname, n);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ stage_path)));
+ }
+ FreeDir(dir);
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ ereport(ERROR,
+ (errmsg("could not stage \"%s\" for deletion",
+ slotpath)));
+
+ /*
+ * Rename the slot directory.
+ */
+ if (rename(slotpath, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m", slotpath)));
+
+ fsync_fname(stage_path, true);
+ fsync_fname("pg_replslot", true);
+}
+
+/*
+ * Remove slot directories that have been staged for deletion by
+ * ReorderBufferCleanup().
+ */
+void
+RemoveStagedSlotDirectories(void)
+{
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir("pg_replslot");
+ while (!ShutdownRequestPending &&
+ (de = ReadDir(dir, "pg_replslot")) != NULL)
+ {
+ struct stat st;
+ char path[MAXPGPATH];
+
+ if (strstr(de->d_name, ".to_remove") == NULL)
+ continue;
+
+ sprintf(path, "pg_replslot/%s", de->d_name);
+ if (lstat(path, &st) != 0)
+ ereport(ERROR,
+ (errmsg("could not stat file \"%s\": %m", path)));
+
+ if (!S_ISDIR(st.st_mode))
+ continue;
+
+ RemoveStagedSlotDirectory(path);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Removes one slot directory that has been staged for deletion by
+ * ReorderBufferCleanup(). If a shutdown request is pending, exit as soon as
+ * possible.
+ */
+static void
+RemoveStagedSlotDirectory(const char *path)
+{
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(path);
+ while (!ShutdownRequestPending &&
+ (de = ReadDir(dir, path)) != NULL)
+ {
+ struct stat st;
+ char filepath[MAXPGPATH];
+
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ sprintf(filepath, "%s/%s", path, de->d_name);
+
+ if (lstat(filepath, &st) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", filepath)));
+ else if (S_ISDIR(st.st_mode))
+ RemoveStagedSlotDirectory(filepath);
+ else if (unlink(filepath) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", filepath)));
+ }
+ FreeDir(dir);
+
+ if (rmdir(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m", path)));
+}
+
/*
* Given a replication slot, transaction ID and segment number, fill in the
* corresponding spill file into 'path', which is a caller-owned buffer of size
@@ -4458,6 +4661,83 @@ StartupReorderBuffer(void)
DIR *logical_dir;
struct dirent *logical_de;
+ /*
+ * First, handle any ".new" directories that were leftover after a crash.
+ * These are created and swapped with the actual replication slot
+ * directories so that cleanup of spilled data can be done asynchronously by
+ * the custodian.
+ */
+ logical_dir = AllocateDir("pg_replslot");
+ while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
+ {
+ char name[NAMEDATALEN];
+ char path[NAMEDATALEN + 12];
+ struct stat statbuf;
+
+ if (strcmp(logical_de->d_name, ".") == 0 ||
+ strcmp(logical_de->d_name, "..") == 0)
+ continue;
+
+ /*
+ * Make sure it's a valid ".new" directory.
+ */
+ if (!pg_str_endswith(logical_de->d_name, ".new") ||
+ strlen(logical_de->d_name) >= NAMEDATALEN + 4)
+ continue;
+
+ strncpy(name, logical_de->d_name, sizeof(name));
+ name[strlen(logical_de->d_name) - 4] = '\0';
+ if (!ReplicationSlotValidateName(name, DEBUG2))
+ continue;
+
+ sprintf(path, "pg_replslot/%s", name);
+ if (lstat(path, &statbuf) == 0)
+ {
+ if (!S_ISDIR(statbuf.st_mode))
+ continue;
+
+ /*
+ * If the original directory still exists, just delete the ".new"
+ * directory. We'll try again when we call ReorderBufferCleanup()
+ * later on.
+ */
+ if (!rmtree(path, true))
+ ereport(ERROR,
+ (errmsg("could not remove directory \"%s\"", path)));
+ }
+ else if (errno == ENOENT)
+ {
+ char newpath[NAMEDATALEN + 16];
+
+ /*
+ * If the original directory is gone, we need to rename the ".new"
+ * directory to take its place. We know that the ".new" directory
+ * is ready to be the real deal if we previously made it far enough
+ * to delete the original directory.
+ */
+ sprintf(newpath, "pg_replslot/%s", logical_de->d_name);
+ if (rename(newpath, path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+ newpath, path)));
+
+ fsync_fname(path, true);
+ }
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+
+ fsync_fname("pg_replslot", true);
+ }
+ FreeDir(logical_dir);
+
+ /*
+ * Now we can proceed with deleting all spilled data. (This actually just
+ * moves the directories aside so that the custodian can clean it up
+ * asynchronously.)
+ */
logical_dir = AllocateDir("pg_replslot");
while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
{
@@ -4470,12 +4750,18 @@ StartupReorderBuffer(void)
continue;
/*
- * ok, has to be a surviving logical slot, iterate and delete
- * everything starting with xid-*
+ * ok, has to be a surviving logical slot, delete everything except for
+ * state
*/
- ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
+ ReorderBufferCleanup(logical_de->d_name);
}
FreeDir(logical_dir);
+
+ /*
+ * Wake up the custodian so it cleans up our old slot data.
+ */
+ if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
}
/* ---------------------------------------
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 90ba9b417d..e9dbc18e22 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1430,6 +1430,10 @@ StartupReplicationSlots(void)
continue;
}
+ /* if it's an old slot directory that's staged for removal, ignore it */
+ if (strstr(replication_de->d_name, ".to_remove") != NULL)
+ continue;
+
/* looks like a slot in a normal state, restore */
RestoreSlotFromDisk(replication_de->d_name);
}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b40ff75f7..807b7038d9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -681,5 +681,6 @@ TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
void StartupReorderBuffer(void);
+void RemoveStagedSlotDirectories(void);
#endif
--
2.16.6
v2-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patchapplication/octet-stream; name=v2-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patchDownload
From 2d7978947c9891585692a84e5ae236664f60cee0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Mon, 13 Dec 2021 20:20:12 -0800
Subject: [PATCH v2 7/8] Use syncfs() in CheckPointLogicalRewriteHeap() for
shutdown and end-of-recovery checkpoints.
This may save quite a bit of time when there are many mapping files
to flush to disk.
---
src/backend/access/heap/rewriteheap.c | 35 ++++++++++++++++++++++++++++++++++-
src/backend/access/transam/xlog.c | 2 +-
src/include/access/rewriteheap.h | 2 +-
3 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e11f5bfb80..d697d46fbb 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1193,7 +1193,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* ---
*/
void
-CheckPointLogicalRewriteHeap(void)
+CheckPointLogicalRewriteHeap(bool shutdown)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
@@ -1219,6 +1219,39 @@ CheckPointLogicalRewriteHeap(void)
if (ProcGlobal->custodianLatch)
SetLatch(ProcGlobal->custodianLatch);
+#ifdef HAVE_SYNCFS
+
+ /*
+ * If we are doing a shutdown or end-of-recovery checkpoint, let's use
+ * syncfs() to flush the mappings to disk instead of flushing each one
+ * individually. This may save us quite a bit of time when there are many
+ * such files to flush.
+ */
+ if (shutdown)
+ {
+ int fd;
+
+ fd = OpenTransientFile("pg_logical/mappings", O_RDONLY);
+ if (fd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"pg_logical/mappings\": %m")));
+
+ if (syncfs(fd) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not synchronize file system for file \"pg_logical/mappings\": %m")));
+
+ if (CloseTransientFile(fd) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file \"pg_logical/mappings\": %m")));
+
+ return;
+ }
+
+#endif /* HAVE_SYNCFS */
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2a5a3fe765..5fb8ac9483 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9572,7 +9572,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointLogicalRewriteHeap();
+ CheckPointLogicalRewriteHeap(flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY));
CheckPointReplicationOrigin();
/* Write out all dirty data in SLRUs and the main buffer pool */
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 2df5f4f5cd..dda0629db2 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -52,7 +52,7 @@ typedef struct LogicalRewriteMappingData
* ---
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
-void CheckPointLogicalRewriteHeap(void);
+void CheckPointLogicalRewriteHeap(bool shutdown);
void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
--
2.16.6
v2-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchapplication/octet-stream; name=v2-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchDownload
From ac5bec20bd74a61c69103ec35196c3920ea8d050 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v2 6/8] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
---
src/backend/access/heap/rewriteheap.c | 83 ++++++++++++++++++++++++++++++-----
src/backend/postmaster/checkpointer.c | 33 ++++++++++++++
src/backend/postmaster/custodian.c | 10 +++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/bgwriter.h | 3 ++
5 files changed, 120 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 986a776bbd..e11f5bfb80 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,10 +116,13 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
+#include "storage/proc.h"
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
@@ -1182,7 +1185,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1210,6 +1214,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ CheckPointSetLogicalRewriteCutoff(cutoff);
+ if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1240,15 +1249,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1283,3 +1284,65 @@ CheckPointLogicalRewriteHeap(void)
}
FreeDir(mappings_dir);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
+ * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
+ * disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+ bool value_set = false;
+
+ cutoff = CheckPointGetLogicalRewriteCutoff(&value_set);
+ if (!value_set)
+ return;
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while (!ShutdownRequestPending &&
+ (mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 25a18b7a14..0c5563cb4b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -127,6 +127,9 @@ typedef struct
uint32 num_backend_writes; /* counts user backend buffer writes */
uint32 num_backend_fsync; /* counts user backend fsync calls */
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
+ XLogRecPtr logical_rewrite_mappings_cutoff_set;
+
int num_requests; /* current # of requests */
int max_requests; /* allocated array size */
CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1337,3 +1340,33 @@ FirstCallSinceLastCheckpoint(void)
return FirstCall;
}
+
+/*
+ * Used by CheckPointLogicalRewriteHeap() to tell the custodian which logical
+ * rewrite mapping files it can remove.
+ */
+void
+CheckPointSetLogicalRewriteCutoff(XLogRecPtr cutoff)
+{
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ CheckpointerShmem->logical_rewrite_mappings_cutoff = cutoff;
+ CheckpointerShmem->logical_rewrite_mappings_cutoff_set = true;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CheckPointGetLogicalRewriteCutoff(bool *value_set)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ cutoff = CheckpointerShmem->logical_rewrite_mappings_cutoff;
+ *value_set = CheckpointerShmem->logical_rewrite_mappings_cutoff_set;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ return cutoff;
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index fbac2c6add..e3318878d0 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -34,6 +34,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -215,6 +216,15 @@ CustodianMain(void)
*/
RemoveOldSerializedSnapshots();
+ /*
+ * Remove logical rewrite mapping files that are no longer needed.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ RemoveOldLogicalRewriteMappings();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 121f552405..2df5f4f5cd 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
void CheckPointLogicalRewriteHeap(void);
+void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b236..bc9f57c93c 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,7 @@ extern void CheckpointerShmemInit(void);
extern bool FirstCallSinceLastCheckpoint(void);
+extern void CheckPointSetLogicalRewriteCutoff(XLogRecPtr cutoff);
+extern XLogRecPtr CheckPointGetLogicalRewriteCutoff(bool *value_set);
+
#endif /* _BGWRITER_H */
--
2.16.6
v2-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchapplication/octet-stream; name=v2-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchDownload
From 4e41d80a0ab0f820602b1df998605a2920363323 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v2 5/8] Move removal of old serialized snapshots to custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 2 --
src/backend/postmaster/custodian.c | 11 +++++++++++
src/backend/replication/logical/snapbuild.c | 13 +++++++------
src/include/replication/snapbuild.h | 2 +-
4 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 72aeb42961..2a5a3fe765 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -56,7 +56,6 @@
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -9573,7 +9572,6 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index a5443f9a21..fbac2c6add 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -38,6 +38,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/proc.h"
@@ -204,6 +205,16 @@ CustodianMain(void)
*/
RemovePgTempFiles(false, false);
+ /*
+ * Remove serialized snapshots that are no longer required by any
+ * logical replication slot.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ RemoveOldSerializedSnapshots();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index dbdc172a2b..cf873b519b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -125,6 +125,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -1912,14 +1913,13 @@ snapshot_not_interesting:
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
@@ -1942,7 +1942,8 @@ CheckPointSnapBuild(void)
cutoff = redo;
snap_dir = AllocateDir("pg_logical/snapshots");
- while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
+ while (!ShutdownRequestPending &&
+ (snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
{
uint32 hi;
uint32 lo;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 82aa86125b..ba7276058d 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.16.6
v2-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchapplication/octet-stream; name=v2-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchDownload
From 5b133a676a64fea4cc7463daa755ad60f7c8789e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v2 4/8] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 13 ++++++++++++-
src/backend/postmaster/postmaster.c | 14 +++++++++-----
src/backend/storage/file/fd.c | 32 +++++++++++++++++++++++---------
3 files changed, 44 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 0ba59949bb..a5443f9a21 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -191,7 +191,18 @@ CustodianMain(void)
start_time = (pg_time_t) time(NULL);
- /* TODO: offloaded tasks go here */
+ /*
+ * Remove any pgsql_tmp directories that have been staged for deletion.
+ * Since pgsql_tmp directories can accumulate many files, removing all
+ * of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the temporary directories, and we clean them up here.
+ *
+ * pgsql_tmp directories are not staged or cleaned in single-user mode,
+ * so we don't need any extra handling outside of the custodian process
+ * for this.
+ */
+ RemovePgTempFiles(false, false);
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1ae2dc179e..b098482496 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1391,9 +1391,11 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4139,12 +4141,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 633c6eee18..ac56c41562 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -97,9 +97,12 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/interrupt.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
#include "utils/guc.h"
#include "utils/resowner_private.h"
@@ -1640,9 +1643,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1840,9 +1843,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
@@ -3175,7 +3178,8 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3211,6 +3215,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
* would create a race condition. It's done separately, earlier in
* postmaster startup.
*/
+
+ /*
+ * If we just staged some pgsql_tmp directories for removal, wake up the
+ * custodian process so that it deletes all the files in the staged
+ * directories as well as the directories themselves.
+ */
+ if (stage && ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
}
/*
@@ -3315,7 +3327,8 @@ RemoveStagedPgTempDirs(const char *spc_dir)
struct dirent *de;
dir = AllocateDir(spc_dir);
- while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
{
if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
@@ -3354,7 +3367,8 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
--
2.16.6
v2-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchapplication/octet-stream; name=v2-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchDownload
From 1f19b7887874b484a3b2afe0f3555599e58d426d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v2 3/8] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be renamed to stage them for
removal. Then, all files in pgsql_tmp are removed before removing
the staged directories themselves. This change is being made in
preparation for a follow-up change to offload most temporary file
cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
---
src/backend/postmaster/postmaster.c | 8 +-
src/backend/storage/file/fd.c | 176 +++++++++++++++++++++++++++++++-----
src/include/storage/fd.h | 2 +-
3 files changed, 162 insertions(+), 24 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 51613aaa2a..1ae2dc179e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1392,7 +1392,8 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
- RemovePgTempFiles();
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4140,7 +4141,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
- RemovePgTempFiles();
+ {
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 545e91978c..633c6eee18 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -112,6 +112,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_DIR_TO_REMOVE_PREFIX (PG_TEMP_FILES_DIR "_to_remove_")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -338,6 +340,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3133,24 +3137,20 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
* Remove temporary and temporary relation files left over from a prior
* postmaster session
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * If stage is true, this function will simply rename all pgsql_tmp directories
+ * to stage them for removal at a later time. If stage is false, this function
+ * will delete all files in the staged directories as well as the directories
+ * themselves.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * If remove_relation_files is true, this function will remove the temporary
+ * relation files. Otherwise, this step is skipped.
*
* NOTE: this function and its subroutines generally report syscall failures
* with ereport(LOG) and keep going. Removing temp files is not so critical
* that we should fail to start the database when we can't do it.
*/
void
-RemovePgTempFiles(void)
+RemovePgTempFiles(bool stage, bool remove_relation_files)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3159,9 +3159,16 @@ RemovePgTempFiles(void)
/*
* First process temp files in pg_default ($PGDATA/base)
*/
- snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
- RemovePgTempRelationFiles("base");
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ RemoveStagedPgTempDirs("base");
+
+ if (remove_relation_files)
+ RemovePgTempRelationFiles("base");
/*
* Cycle through temp directories for all non-default tablespaces.
@@ -3174,13 +3181,26 @@ RemovePgTempFiles(void)
strcmp(spc_de->d_name, "..") == 0)
continue;
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY,
+ PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- RemovePgTempRelationFiles(temp_path);
+ if (remove_relation_files)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemovePgTempRelationFiles(temp_path);
+ }
}
FreeDir(spc_dir);
@@ -3194,7 +3214,121 @@ RemovePgTempFiles(void)
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function renames the given directory with a special prefix that
+ * RemoveStagedPgTempDirs() will know to look for. An integer is appended to
+ * the end of the new directory name in case previously staged pgsql_tmp
+ * directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ DIR *dir;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ dir = AllocateDir(tmp_dir);
+ if (dir == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m", tmp_dir)));
+ return;
+ }
+ FreeDir(dir);
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
+ PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ stage_path)));
+ return;
+ }
+ FreeDir(dir);
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ {
+ ereport(LOG,
+ (errmsg("could not stage \"%s\" for deletion",
+ tmp_dir)));
+ return;
+ }
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(spc_dir);
+ while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
+ strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", spc_dir, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
* Any other problem results in a LOG message. (missing_ok should be true at
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 762f6b46c1..85fa987aca 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,7 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
-extern void RemovePgTempFiles(void);
+extern void RemovePgTempFiles(bool stage, bool remove_relation_files);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
--
2.16.6
v2-0002-Also-remove-pgsql_tmp-directories-during-startup.patchapplication/octet-stream; name=v2-0002-Also-remove-pgsql_tmp-directories-during-startup.patchDownload
From 2546e9ed46c218903773aa50eaa1056b475a1a7b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v2 2/8] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 635313cdb7..51613aaa2a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1117,7 +1117,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..545e91978c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3160,7 +3160,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3176,7 +3176,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3209,7 +3209,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3247,13 +3247,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (S_ISDIR(statbuf.st_mode))
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3271,6 +3265,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..762f6b46c1 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index bf95a30761..481f1f23a2 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -143,7 +143,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -241,7 +242,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.16.6
Hi,
On 2021-12-14 20:23:57 +0000, Bossart, Nathan wrote:
As promised, here is v2. This patch set includes handling for all
four tasks noted upthread. I'd still consider this a work-in-
progress, as I've done minimal testing. At the very least, it should
demonstrate what an auxiliary process approach might look like.
This generates a compiler warning:
https://cirrus-ci.com/task/5740581082103808?logs=mingw_cross_warning#L378
Greetings,
Andres Freund
On Mon, Jan 3, 2022 at 2:56 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-14 20:23:57 +0000, Bossart, Nathan wrote:
As promised, here is v2. This patch set includes handling for all
four tasks noted upthread. I'd still consider this a work-in-
progress, as I've done minimal testing. At the very least, it should
demonstrate what an auxiliary process approach might look like.This generates a compiler warning:
https://cirrus-ci.com/task/5740581082103808?logs=mingw_cross_warning#L378
Somehow, I am not getting these compiler warnings on the latest master
head (69872d0bbe6).
Here are the few minor comments for the v2 version, I thought would help:
+ * Copyright (c) 2021, PostgreSQL Global Development Group
Time to change the year :)
--
+
+ /* These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
Incorrect formatting, the first line should be empty in the multiline
code comment.
--
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove
older mappings */
+ XLogRecPtr logical_rewrite_mappings_cutoff_set;
Look like logical_rewrite_mappings_cutoff gets to set only once and
never get reset, if it is true then I think that variable can be
skipped completely and set the initial logical_rewrite_mappings_cutoff
to InvalidXLogRecPtr, that will do the needful.
--
Regards,
Amul
Thanks for your review.
On 1/2/22, 11:00 PM, "Amul Sul" <sulamul@gmail.com> wrote:
On Mon, Jan 3, 2022 at 2:56 AM Andres Freund <andres@anarazel.de> wrote:
This generates a compiler warning:
https://cirrus-ci.com/task/5740581082103808?logs=mingw_cross_warning#L378Somehow, I am not getting these compiler warnings on the latest master
head (69872d0bbe6).
I attempted to fix this by including time.h in custodian.c.
Here are the few minor comments for the v2 version, I thought would help:
+ * Copyright (c) 2021, PostgreSQL Global Development Group
Time to change the year :)
Fixed in v3.
+ + /* These operations are really just a minimal subset of + * AbortTransaction(). We don't have very many resources to worry + * about. + */Incorrect formatting, the first line should be empty in the multiline
code comment.
Fixed in v3.
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */ + XLogRecPtr logical_rewrite_mappings_cutoff_set;Look like logical_rewrite_mappings_cutoff gets to set only once and
never get reset, if it is true then I think that variable can be
skipped completely and set the initial logical_rewrite_mappings_cutoff
to InvalidXLogRecPtr, that will do the needful.
I think the problem with this is that when the cutoff is
InvalidXLogRecPtr, it is taken to mean that all logical rewrite files
can be removed. If we just used the cutoff variable, we could remove
files we need if the custodian ran before the cutoff was set. I
suppose we could initially set the cutoff to MaxXLogRecPtr to indicate
that the value is not yet set, but I see no real advantage to doing it
that way versus just using a bool. Speaking of which,
logical_rewrite_mappings_cutoff_set obviously should be a bool. I've
fixed that in v3.
Nathan
Attachments:
v3-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patchapplication/octet-stream; name=v3-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patchDownload
From 139893fe7b5f674bd61cc7edadce6a4dd40311d3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Tue, 14 Dec 2021 18:40:12 +0000
Subject: [PATCH v3 8/8] Move removal of spilled logical slot data to
custodian.
If there are many such files, startup can take much longer than
necessary. To handle this, startup creates a new slot directory,
copies the state file, and swaps the new directory with the old
one. The custodian then asynchronously cleans up the old slot
directory.
---
src/backend/access/transam/xlog.c | 15 +-
src/backend/postmaster/custodian.c | 14 ++
src/backend/replication/logical/reorderbuffer.c | 292 +++++++++++++++++++++++-
src/backend/replication/slot.c | 4 +
src/include/replication/reorderbuffer.h | 1 +
5 files changed, 317 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ac309f83d9..8de0174e0e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7156,18 +7156,21 @@ StartupXLOG(void)
checkPoint.newestCommitTsXid);
XLogCtl->ckptFullXid = checkPoint.nextXid;
- /*
- * Initialize replication slots, before there's a chance to remove
- * required resources.
- */
- StartupReplicationSlots();
-
/*
* Startup logical state, needs to be setup now so we have proper data
* during crash recovery.
+ *
+ * NB: This also performs some important cleanup that must be done prior to
+ * other replication slot steps (e.g., StartupReplicationSlots()).
*/
StartupReorderBuffer();
+ /*
+ * Initialize replication slots, before there's a chance to remove
+ * required resources.
+ */
+ StartupReplicationSlots();
+
/*
* Startup CLOG. This must be done after ShmemVariableCache->nextXid has
* been initialized and before we accept connections or begin WAL replay.
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 9c5479b5cf..fdc614b1bd 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -41,6 +41,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
@@ -209,6 +210,19 @@ CustodianMain(void)
*/
RemovePgTempFiles(false, false);
+ /*
+ * Remove any replication slot directories that have been staged for
+ * deletion. Since slot directories can accumulate many files, removing
+ * all of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the slot directories, and we clean them up here.
+ *
+ * Replication slot directories are not staged or cleaned in single-user
+ * mode, so we don't need any extra handling outside of the custodian
+ * process for this.
+ */
+ RemoveStagedSlotDirectories();
+
/*
* Remove serialized snapshots that are no longer required by any
* logical replication slot.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7aa5647a2c..a43e98d848 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -91,15 +91,19 @@
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "catalog/catalog.h"
+#include "common/string.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/slot.h"
#include "replication/snapbuild.h" /* just for SnapBuildSnapDecRefcount */
#include "storage/bufmgr.h"
+#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/proc.h"
#include "storage/sinval.h"
#include "utils/builtins.h"
#include "utils/combocid.h"
@@ -255,12 +259,15 @@ static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn
static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared);
static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
+static void ReorderBufferCleanup(const char *slotname);
static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
TransactionId xid, XLogSegNo segno);
static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
ReorderBufferTXN *txn, CommandId cid);
+static void StageSlotDirForRemoval(const char *slotname, const char *slotpath);
+static void RemoveStagedSlotDirectory(const char *path);
/*
* ---------------------------------------
@@ -4430,6 +4437,202 @@ ReorderBufferCleanupSerializedTXNs(const char *slotname)
FreeDir(spill_dir);
}
+/*
+ * Cleanup everything in the logical slot directory except for the "state" file.
+ * This is specially written for StartupReorderBuffer(), which has special logic
+ * to handle crashes at inconvenient times.
+ *
+ * NB: If anything except for the "state" file cannot be removed after startup,
+ * this will need to be updated.
+ */
+static void
+ReorderBufferCleanup(const char *slotname)
+{
+ char path[MAXPGPATH];
+ char newpath[MAXPGPATH];
+ char statepath[MAXPGPATH];
+ char newstatepath[MAXPGPATH];
+ struct stat statbuf;
+
+ sprintf(path, "pg_replslot/%s", slotname);
+ sprintf(newpath, "pg_replslot/%s.new", slotname);
+ sprintf(statepath, "pg_replslot/%s/state", slotname);
+ sprintf(newstatepath, "pg_replslot/%s.new/state", slotname);
+
+ /* we're only handling directories here, skip if it's not ours */
+ if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
+ return;
+
+ /*
+ * Build our new slot directory, suffixed with ".new". The caller (likely
+ * StartupReorderBuffer()) should have already ensured that any pre-existing
+ * ".new" directories leftover after a crash have been cleaned up.
+ */
+ if (MakePGDirectory(newpath) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", newpath)));
+
+ copy_file(statepath, newstatepath);
+
+ fsync_fname(newstatepath, false);
+ fsync_fname(newpath, true);
+ fsync_fname("pg_replslot", true);
+
+ /*
+ * Move the slot directory aside for cleanup by the custodian. After this
+ * step, there will be no slot directory. StartupReorderBuffer() has
+ * special logic to make sure we don't lose the slot if we crash at this
+ * point.
+ */
+ StageSlotDirForRemoval(slotname, path);
+
+ /*
+ * Move our ".new" directory to become our new slot directory.
+ */
+ if (rename(newpath, path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m", newpath)));
+
+ fsync_fname(path, true);
+ fsync_fname("pg_replslot", true);
+}
+
+/*
+ * This function renames the given directory with a special suffix that the
+ * custodian will know to look for. An integer is appended to the end of the
+ * new directory name in case previously staged slot directories have not yet
+ * been removed.
+ */
+static void
+StageSlotDirForRemoval(const char *slotname, const char *slotpath)
+{
+ char stage_path[MAXPGPATH];
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ DIR *dir;
+
+ sprintf(stage_path, "pg_replslot/%s.to_remove_%d", slotname, n);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ stage_path)));
+ }
+ FreeDir(dir);
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ ereport(ERROR,
+ (errmsg("could not stage \"%s\" for deletion",
+ slotpath)));
+
+ /*
+ * Rename the slot directory.
+ */
+ if (rename(slotpath, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m", slotpath)));
+
+ fsync_fname(stage_path, true);
+ fsync_fname("pg_replslot", true);
+}
+
+/*
+ * Remove slot directories that have been staged for deletion by
+ * ReorderBufferCleanup().
+ */
+void
+RemoveStagedSlotDirectories(void)
+{
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir("pg_replslot");
+ while (!ShutdownRequestPending &&
+ (de = ReadDir(dir, "pg_replslot")) != NULL)
+ {
+ struct stat st;
+ char path[MAXPGPATH];
+
+ if (strstr(de->d_name, ".to_remove") == NULL)
+ continue;
+
+ sprintf(path, "pg_replslot/%s", de->d_name);
+ if (lstat(path, &st) != 0)
+ ereport(ERROR,
+ (errmsg("could not stat file \"%s\": %m", path)));
+
+ if (!S_ISDIR(st.st_mode))
+ continue;
+
+ RemoveStagedSlotDirectory(path);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Removes one slot directory that has been staged for deletion by
+ * ReorderBufferCleanup(). If a shutdown request is pending, exit as soon as
+ * possible.
+ */
+static void
+RemoveStagedSlotDirectory(const char *path)
+{
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(path);
+ while (!ShutdownRequestPending &&
+ (de = ReadDir(dir, path)) != NULL)
+ {
+ struct stat st;
+ char filepath[MAXPGPATH];
+
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ sprintf(filepath, "%s/%s", path, de->d_name);
+
+ if (lstat(filepath, &st) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", filepath)));
+ else if (S_ISDIR(st.st_mode))
+ RemoveStagedSlotDirectory(filepath);
+ else if (unlink(filepath) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", filepath)));
+ }
+ FreeDir(dir);
+
+ if (rmdir(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m", path)));
+}
+
/*
* Given a replication slot, transaction ID and segment number, fill in the
* corresponding spill file into 'path', which is a caller-owned buffer of size
@@ -4458,6 +4661,83 @@ StartupReorderBuffer(void)
DIR *logical_dir;
struct dirent *logical_de;
+ /*
+ * First, handle any ".new" directories that were leftover after a crash.
+ * These are created and swapped with the actual replication slot
+ * directories so that cleanup of spilled data can be done asynchronously by
+ * the custodian.
+ */
+ logical_dir = AllocateDir("pg_replslot");
+ while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
+ {
+ char name[NAMEDATALEN];
+ char path[NAMEDATALEN + 12];
+ struct stat statbuf;
+
+ if (strcmp(logical_de->d_name, ".") == 0 ||
+ strcmp(logical_de->d_name, "..") == 0)
+ continue;
+
+ /*
+ * Make sure it's a valid ".new" directory.
+ */
+ if (!pg_str_endswith(logical_de->d_name, ".new") ||
+ strlen(logical_de->d_name) >= NAMEDATALEN + 4)
+ continue;
+
+ strncpy(name, logical_de->d_name, sizeof(name));
+ name[strlen(logical_de->d_name) - 4] = '\0';
+ if (!ReplicationSlotValidateName(name, DEBUG2))
+ continue;
+
+ sprintf(path, "pg_replslot/%s", name);
+ if (lstat(path, &statbuf) == 0)
+ {
+ if (!S_ISDIR(statbuf.st_mode))
+ continue;
+
+ /*
+ * If the original directory still exists, just delete the ".new"
+ * directory. We'll try again when we call ReorderBufferCleanup()
+ * later on.
+ */
+ if (!rmtree(path, true))
+ ereport(ERROR,
+ (errmsg("could not remove directory \"%s\"", path)));
+ }
+ else if (errno == ENOENT)
+ {
+ char newpath[NAMEDATALEN + 16];
+
+ /*
+ * If the original directory is gone, we need to rename the ".new"
+ * directory to take its place. We know that the ".new" directory
+ * is ready to be the real deal if we previously made it far enough
+ * to delete the original directory.
+ */
+ sprintf(newpath, "pg_replslot/%s", logical_de->d_name);
+ if (rename(newpath, path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+ newpath, path)));
+
+ fsync_fname(path, true);
+ }
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+
+ fsync_fname("pg_replslot", true);
+ }
+ FreeDir(logical_dir);
+
+ /*
+ * Now we can proceed with deleting all spilled data. (This actually just
+ * moves the directories aside so that the custodian can clean it up
+ * asynchronously.)
+ */
logical_dir = AllocateDir("pg_replslot");
while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
{
@@ -4470,12 +4750,18 @@ StartupReorderBuffer(void)
continue;
/*
- * ok, has to be a surviving logical slot, iterate and delete
- * everything starting with xid-*
+ * ok, has to be a surviving logical slot, delete everything except for
+ * state
*/
- ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
+ ReorderBufferCleanup(logical_de->d_name);
}
FreeDir(logical_dir);
+
+ /*
+ * Wake up the custodian so it cleans up our old slot data.
+ */
+ if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
}
/* ---------------------------------------
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 90ba9b417d..e9dbc18e22 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1430,6 +1430,10 @@ StartupReplicationSlots(void)
continue;
}
+ /* if it's an old slot directory that's staged for removal, ignore it */
+ if (strstr(replication_de->d_name, ".to_remove") != NULL)
+ continue;
+
/* looks like a slot in a normal state, restore */
RestoreSlotFromDisk(replication_de->d_name);
}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b40ff75f7..807b7038d9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -681,5 +681,6 @@ TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
void StartupReorderBuffer(void);
+void RemoveStagedSlotDirectories(void);
#endif
--
2.16.6
v3-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patchapplication/octet-stream; name=v3-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patchDownload
From eba54838edc28317089ad04cad88f7a23710f01e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Mon, 13 Dec 2021 20:20:12 -0800
Subject: [PATCH v3 7/8] Use syncfs() in CheckPointLogicalRewriteHeap() for
shutdown and end-of-recovery checkpoints.
This may save quite a bit of time when there are many mapping files
to flush to disk.
---
src/backend/access/heap/rewriteheap.c | 35 ++++++++++++++++++++++++++++++++++-
src/backend/access/transam/xlog.c | 2 +-
src/include/access/rewriteheap.h | 2 +-
3 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e11f5bfb80..d697d46fbb 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1193,7 +1193,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* ---
*/
void
-CheckPointLogicalRewriteHeap(void)
+CheckPointLogicalRewriteHeap(bool shutdown)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
@@ -1219,6 +1219,39 @@ CheckPointLogicalRewriteHeap(void)
if (ProcGlobal->custodianLatch)
SetLatch(ProcGlobal->custodianLatch);
+#ifdef HAVE_SYNCFS
+
+ /*
+ * If we are doing a shutdown or end-of-recovery checkpoint, let's use
+ * syncfs() to flush the mappings to disk instead of flushing each one
+ * individually. This may save us quite a bit of time when there are many
+ * such files to flush.
+ */
+ if (shutdown)
+ {
+ int fd;
+
+ fd = OpenTransientFile("pg_logical/mappings", O_RDONLY);
+ if (fd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"pg_logical/mappings\": %m")));
+
+ if (syncfs(fd) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not synchronize file system for file \"pg_logical/mappings\": %m")));
+
+ if (CloseTransientFile(fd) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file \"pg_logical/mappings\": %m")));
+
+ return;
+ }
+
+#endif /* HAVE_SYNCFS */
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 222392ca8d..ac309f83d9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9569,7 +9569,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointLogicalRewriteHeap();
+ CheckPointLogicalRewriteHeap(flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY));
CheckPointReplicationOrigin();
/* Write out all dirty data in SLRUs and the main buffer pool */
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 2df5f4f5cd..dda0629db2 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -52,7 +52,7 @@ typedef struct LogicalRewriteMappingData
* ---
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
-void CheckPointLogicalRewriteHeap(void);
+void CheckPointLogicalRewriteHeap(bool shutdown);
void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
--
2.16.6
v3-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchapplication/octet-stream; name=v3-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchDownload
From 4904872953ddea12e7478373e497bb6265b9f6d0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v3 6/8] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
---
src/backend/access/heap/rewriteheap.c | 83 ++++++++++++++++++++++++++++++-----
src/backend/postmaster/checkpointer.c | 33 ++++++++++++++
src/backend/postmaster/custodian.c | 10 +++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/bgwriter.h | 3 ++
5 files changed, 120 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 986a776bbd..e11f5bfb80 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,10 +116,13 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
+#include "storage/proc.h"
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
@@ -1182,7 +1185,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1210,6 +1214,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ CheckPointSetLogicalRewriteCutoff(cutoff);
+ if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1240,15 +1249,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1283,3 +1284,65 @@ CheckPointLogicalRewriteHeap(void)
}
FreeDir(mappings_dir);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
+ * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
+ * disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+ bool value_set = false;
+
+ cutoff = CheckPointGetLogicalRewriteCutoff(&value_set);
+ if (!value_set)
+ return;
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while (!ShutdownRequestPending &&
+ (mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 25a18b7a14..7d7d60f040 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -127,6 +127,9 @@ typedef struct
uint32 num_backend_writes; /* counts user backend buffer writes */
uint32 num_backend_fsync; /* counts user backend fsync calls */
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
+ bool logical_rewrite_mappings_cutoff_set;
+
int num_requests; /* current # of requests */
int max_requests; /* allocated array size */
CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1337,3 +1340,33 @@ FirstCallSinceLastCheckpoint(void)
return FirstCall;
}
+
+/*
+ * Used by CheckPointLogicalRewriteHeap() to tell the custodian which logical
+ * rewrite mapping files it can remove.
+ */
+void
+CheckPointSetLogicalRewriteCutoff(XLogRecPtr cutoff)
+{
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ CheckpointerShmem->logical_rewrite_mappings_cutoff = cutoff;
+ CheckpointerShmem->logical_rewrite_mappings_cutoff_set = true;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CheckPointGetLogicalRewriteCutoff(bool *value_set)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ cutoff = CheckpointerShmem->logical_rewrite_mappings_cutoff;
+ *value_set = CheckpointerShmem->logical_rewrite_mappings_cutoff_set;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ return cutoff;
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 0f4dbdd669..9c5479b5cf 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -36,6 +36,7 @@
#include <time.h>
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -218,6 +219,15 @@ CustodianMain(void)
*/
RemoveOldSerializedSnapshots();
+ /*
+ * Remove logical rewrite mapping files that are no longer needed.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ RemoveOldLogicalRewriteMappings();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 121f552405..2df5f4f5cd 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
void CheckPointLogicalRewriteHeap(void);
+void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b236..bc9f57c93c 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,7 @@ extern void CheckpointerShmemInit(void);
extern bool FirstCallSinceLastCheckpoint(void);
+extern void CheckPointSetLogicalRewriteCutoff(XLogRecPtr cutoff);
+extern XLogRecPtr CheckPointGetLogicalRewriteCutoff(bool *value_set);
+
#endif /* _BGWRITER_H */
--
2.16.6
v3-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchapplication/octet-stream; name=v3-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchDownload
From 49090e3ad10112ec042f0c22fcbfce0a70fb8609 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v3 5/8] Move removal of old serialized snapshots to custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 2 --
src/backend/postmaster/custodian.c | 11 +++++++++++
src/backend/replication/logical/snapbuild.c | 13 +++++++------
src/include/replication/snapbuild.h | 2 +-
4 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87cd05c945..222392ca8d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -56,7 +56,6 @@
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -9570,7 +9569,6 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 79bc4a7065..0f4dbdd669 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -40,6 +40,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/proc.h"
@@ -207,6 +208,16 @@ CustodianMain(void)
*/
RemovePgTempFiles(false, false);
+ /*
+ * Remove serialized snapshots that are no longer required by any
+ * logical replication slot.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ RemoveOldSerializedSnapshots();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index dbdc172a2b..cf873b519b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -125,6 +125,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -1912,14 +1913,13 @@ snapshot_not_interesting:
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
@@ -1942,7 +1942,8 @@ CheckPointSnapBuild(void)
cutoff = redo;
snap_dir = AllocateDir("pg_logical/snapshots");
- while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
+ while (!ShutdownRequestPending &&
+ (snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
{
uint32 hi;
uint32 lo;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 82aa86125b..ba7276058d 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.16.6
v3-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchapplication/octet-stream; name=v3-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchDownload
From 30abe5d0ea016b26d7f5c81a61d649e617d39fd7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v3 4/8] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 13 ++++++++++++-
src/backend/postmaster/postmaster.c | 14 +++++++++-----
src/backend/storage/file/fd.c | 32 +++++++++++++++++++++++---------
3 files changed, 44 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index dd86f0f5ce..79bc4a7065 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -194,7 +194,18 @@ CustodianMain(void)
start_time = (pg_time_t) time(NULL);
- /* TODO: offloaded tasks go here */
+ /*
+ * Remove any pgsql_tmp directories that have been staged for deletion.
+ * Since pgsql_tmp directories can accumulate many files, removing all
+ * of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the temporary directories, and we clean them up here.
+ *
+ * pgsql_tmp directories are not staged or cleaned in single-user mode,
+ * so we don't need any extra handling outside of the custodian process
+ * for this.
+ */
+ RemovePgTempFiles(false, false);
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1ae2dc179e..b098482496 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1391,9 +1391,11 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4139,12 +4141,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 633c6eee18..ac56c41562 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -97,9 +97,12 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/interrupt.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
#include "utils/guc.h"
#include "utils/resowner_private.h"
@@ -1640,9 +1643,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1840,9 +1843,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
@@ -3175,7 +3178,8 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3211,6 +3215,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
* would create a race condition. It's done separately, earlier in
* postmaster startup.
*/
+
+ /*
+ * If we just staged some pgsql_tmp directories for removal, wake up the
+ * custodian process so that it deletes all the files in the staged
+ * directories as well as the directories themselves.
+ */
+ if (stage && ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
}
/*
@@ -3315,7 +3327,8 @@ RemoveStagedPgTempDirs(const char *spc_dir)
struct dirent *de;
dir = AllocateDir(spc_dir);
- while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
{
if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
@@ -3354,7 +3367,8 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
--
2.16.6
v3-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchapplication/octet-stream; name=v3-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchDownload
From 52e20b478fde85533803b9cdf5fc9753be6c8e76 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v3 3/8] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be renamed to stage them for
removal. Then, all files in pgsql_tmp are removed before removing
the staged directories themselves. This change is being made in
preparation for a follow-up change to offload most temporary file
cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
---
src/backend/postmaster/postmaster.c | 8 +-
src/backend/storage/file/fd.c | 176 +++++++++++++++++++++++++++++++-----
src/include/storage/fd.h | 2 +-
3 files changed, 162 insertions(+), 24 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 51613aaa2a..1ae2dc179e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1392,7 +1392,8 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
- RemovePgTempFiles();
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4140,7 +4141,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
- RemovePgTempFiles();
+ {
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 545e91978c..633c6eee18 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -112,6 +112,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_DIR_TO_REMOVE_PREFIX (PG_TEMP_FILES_DIR "_to_remove_")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -338,6 +340,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3133,24 +3137,20 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
* Remove temporary and temporary relation files left over from a prior
* postmaster session
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * If stage is true, this function will simply rename all pgsql_tmp directories
+ * to stage them for removal at a later time. If stage is false, this function
+ * will delete all files in the staged directories as well as the directories
+ * themselves.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * If remove_relation_files is true, this function will remove the temporary
+ * relation files. Otherwise, this step is skipped.
*
* NOTE: this function and its subroutines generally report syscall failures
* with ereport(LOG) and keep going. Removing temp files is not so critical
* that we should fail to start the database when we can't do it.
*/
void
-RemovePgTempFiles(void)
+RemovePgTempFiles(bool stage, bool remove_relation_files)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3159,9 +3159,16 @@ RemovePgTempFiles(void)
/*
* First process temp files in pg_default ($PGDATA/base)
*/
- snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
- RemovePgTempRelationFiles("base");
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ RemoveStagedPgTempDirs("base");
+
+ if (remove_relation_files)
+ RemovePgTempRelationFiles("base");
/*
* Cycle through temp directories for all non-default tablespaces.
@@ -3174,13 +3181,26 @@ RemovePgTempFiles(void)
strcmp(spc_de->d_name, "..") == 0)
continue;
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY,
+ PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- RemovePgTempRelationFiles(temp_path);
+ if (remove_relation_files)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemovePgTempRelationFiles(temp_path);
+ }
}
FreeDir(spc_dir);
@@ -3194,7 +3214,121 @@ RemovePgTempFiles(void)
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function renames the given directory with a special prefix that
+ * RemoveStagedPgTempDirs() will know to look for. An integer is appended to
+ * the end of the new directory name in case previously staged pgsql_tmp
+ * directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ DIR *dir;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ dir = AllocateDir(tmp_dir);
+ if (dir == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m", tmp_dir)));
+ return;
+ }
+ FreeDir(dir);
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
+ PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ stage_path)));
+ return;
+ }
+ FreeDir(dir);
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ {
+ ereport(LOG,
+ (errmsg("could not stage \"%s\" for deletion",
+ tmp_dir)));
+ return;
+ }
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(spc_dir);
+ while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
+ strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", spc_dir, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
* Any other problem results in a LOG message. (missing_ok should be true at
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 762f6b46c1..85fa987aca 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,7 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
-extern void RemovePgTempFiles(void);
+extern void RemovePgTempFiles(bool stage, bool remove_relation_files);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
--
2.16.6
v3-0002-Also-remove-pgsql_tmp-directories-during-startup.patchapplication/octet-stream; name=v3-0002-Also-remove-pgsql_tmp-directories-during-startup.patchDownload
From b3317931a271d0992c5dcc90a1605ee18f99f4a8 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v3 2/8] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 635313cdb7..51613aaa2a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1117,7 +1117,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..545e91978c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3160,7 +3160,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3176,7 +3176,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3209,7 +3209,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3247,13 +3247,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (S_ISDIR(statbuf.st_mode))
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3271,6 +3265,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..762f6b46c1 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index bf95a30761..481f1f23a2 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -143,7 +143,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -241,7 +242,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.16.6
v3-0001-Introduce-custodian.patchapplication/octet-stream; name=v3-0001-Introduce-custodian.patchDownload
From 5c74e736fce5c847e5d5f5e14f807e6b97d2a0f6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v3 1/8] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 ++
src/backend/postmaster/custodian.c | 213 ++++++++++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++++++-
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 17 +++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
11 files changed, 300 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 787c6a2c3b..7ec7b23467 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 43497676ab..10626e7029 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..dd86f0f5ce
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,213 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process is new as of Postgres 15. It's main purpose is to
+ * offload tasks that could otherwise delay startup and checkpointing, but
+ * it needn't be restricted to just those things. Offloaded tasks should
+ * not be synchronous (e.g., checkpointing shouldn't need to wait for the
+ * custodian to complete a task before proceeding). Also, ensure that any
+ * offloaded tasks are either not required during single-user mode or are
+ * performed separately during single-user mode.
+ *
+ * The custodian is not an essential process and can shutdown quickly when
+ * requested. The custodian will wake up approximately once every 5
+ * minutes to perform its tasks, but backends can (and should) set its
+ * latch to wake it up sooner.
+ *
+ * Normal termination is by SIGTERM, which instructs the bgwriter to
+ * exit(0). Emergency termination is by SIGQUIT; like any backend, the
+ * custodian will simply abort and exit on SIGQUIT.
+ *
+ * If the custodian exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining
+ * backends should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <time.h>
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+
+#define CUSTODIAN_TIMEOUT_S (300) /* 5 minutes */
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ *
+ * You might wonder why this isn't coded as an infinite loop around a
+ * PG_TRY construct. The reason is that this is the bottom of the
+ * exception stack, and so with PG_TRY there would be no exception handler
+ * in force at all during the CATCH part. By leaving the outermost setjmp
+ * always active, we have at least some chance of recovering from an error
+ * during error recovery. (If we get into an infinite loop thereby, it
+ * will soon be stopped by overflow of elog.c's internal state stack.)
+ *
+ * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
+ * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus,
+ * signals other than SIGQUIT will be blocked until we complete error
+ * recovery. It might seem that this policy makes the HOLD_INTERRUPS()
+ * call redundant, but it is not since InterruptPending might be set
+ * already.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ pg_time_t start_time;
+ pg_time_t end_time;
+ int elapsed_secs;
+ int cur_timeout;
+
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ start_time = (pg_time_t) time(NULL);
+
+ /* TODO: offloaded tasks go here */
+
+ /* Calculate how long to sleep */
+ end_time = (pg_time_t) time(NULL);
+ elapsed_secs = end_time - start_time;
+ if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
+ continue; /* no sleep for us */
+ cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
+
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ cur_timeout * 1000L /* convert to ms */ ,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 328ecafa8c..635313cdb7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -250,6 +250,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -556,6 +557,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1819,13 +1821,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2782,6 +2787,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3109,6 +3116,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3211,6 +3220,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3684,6 +3707,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3887,6 +3922,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3924,6 +3962,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -4017,6 +4056,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4222,6 +4262,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index d1d3cd0dc8..1f3ce5aa67 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 4d53f040e8..530af294d9 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..90c4160d42 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -273,6 +273,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_STARTUP:
backendDesc = "startup";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..83089d23ff 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -329,6 +329,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
@@ -433,6 +434,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -445,6 +447,7 @@ extern AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..cf0a04ca6c
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 44b477f49d..768c2a6352 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -357,6 +357,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -374,11 +376,12 @@ extern PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 8785a8e12c..08dc9d5caa 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_PGSTAT_MAIN,
--
2.16.6
The code seems to be in good condition. All the tests are running ok with
no errors.
I like the whole idea of shifting additional checkpointer jobs as much
as possible to another worker. In my view, it is more appropriate to call
this worker "bg cleaner" or "bg file cleaner" or smth.
It could be useful for systems with high load, which may deal with deleting
many files at once, but I'm not sure about "small" installations. Extra bg
worker need more resources to do occasional deletion of small amounts of
files. I really do not know how to do it better, maybe to have two
different code paths switched by GUC?
Should we also think about adding WAL preallocation into custodian worker
from the patch "Pre-alocationg WAL files" [1]/messages/by-id/20201225200953.jjkrytlrzojbndh5@alap3.anarazel.de -- Best regards, Maxim Orlov. ?
[1]: /messages/by-id/20201225200953.jjkrytlrzojbndh5@alap3.anarazel.de -- Best regards, Maxim Orlov.
/messages/by-id/20201225200953.jjkrytlrzojbndh5@alap3.anarazel.de
--
Best regards,
Maxim Orlov.
On 1/14/22, 3:43 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
The code seems to be in good condition. All the tests are running ok with no errors.
Thanks for your review.
I like the whole idea of shifting additional checkpointer jobs as much as possible to another worker. In my view, it is more appropriate to call this worker "bg cleaner" or "bg file cleaner" or smth.
It could be useful for systems with high load, which may deal with deleting many files at once, but I'm not sure about "small" installations. Extra bg worker need more resources to do occasional deletion of small amounts of files. I really do not know how to do it better, maybe to have two different code paths switched by GUC?
I'd personally like to avoid creating two code paths for the same
thing. Are there really cases when this one extra auxiliary process
would be too many? And if so, how would a user know when to adjust
this GUC? I understand the point that we should introduce new
processes sparingly to avoid burdening low-end systems, but I don't
think we should be afraid to add new ones when it is needed.
That being said, if making the extra worker optional addresses the
concerns about resource usage, maybe we should consider it. Justin
suggested using something like max_parallel_maintenance_workers
upthread [0]/messages/by-id/20211213171935.GX17618@telsasoft.com.
Should we also think about adding WAL preallocation into custodian worker from the patch "Pre-alocationg WAL files" [1] ?
This was brought up in the pre-allocation thread [1]/messages/by-id/B2ACCC5A-F9F2-41D9-AC3B-251362A0A254@amazon.com. I don't think
the custodian process would be the right place for it, and I'm also
not as concerned about it because it will generally be a small, fixed,
and configurable amount of work. In any case, I don't sense a ton of
support for a new auxiliary process in this thread, so I'm hesitant to
go down the same path for pre-allocation.
Nathan
[0]: /messages/by-id/20211213171935.GX17618@telsasoft.com
[1]: /messages/by-id/B2ACCC5A-F9F2-41D9-AC3B-251362A0A254@amazon.com
On Sat, Jan 15, 2022 at 12:46 AM Bossart, Nathan <bossartn@amazon.com> wrote:
On 1/14/22, 3:43 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
The code seems to be in good condition. All the tests are running ok with no errors.
Thanks for your review.
I like the whole idea of shifting additional checkpointer jobs as much as possible to another worker. In my view, it is more appropriate to call this worker "bg cleaner" or "bg file cleaner" or smth.
I personally prefer "background cleaner" as the new process name in
line with "background writer" and "background worker".
It could be useful for systems with high load, which may deal with deleting many files at once, but I'm not sure about "small" installations. Extra bg worker need more resources to do occasional deletion of small amounts of files. I really do not know how to do it better, maybe to have two different code paths switched by GUC?
I'd personally like to avoid creating two code paths for the same
thing. Are there really cases when this one extra auxiliary process
would be too many? And if so, how would a user know when to adjust
this GUC? I understand the point that we should introduce new
processes sparingly to avoid burdening low-end systems, but I don't
think we should be afraid to add new ones when it is needed.
IMO, having a GUC for enabling/disabling this new worker and it's
related code would be a better idea. The reason is that if the
postgres has no replication slots at all(which is quite possible in
real stand-alone production environments) or if the file enumeration
(directory traversal and file removal) is fast enough on the servers,
there's no point having this new worker, the checkpointer itself can
take care of the work as it is doing today.
That being said, if making the extra worker optional addresses the
concerns about resource usage, maybe we should consider it. Justin
suggested using something like max_parallel_maintenance_workers
upthread [0].
I don't think having this new process is built as part of
max_parallel_maintenance_workers, instead I prefer to have it as an
auxiliary process much like "background writer", "wal writer" and so
on.
I think now it's the time for us to run some use cases and get the
perf reports to see how beneficial this new process is going to be, in
terms of improving the checkpoint timings.
Should we also think about adding WAL preallocation into custodian worker from the patch "Pre-alocationg WAL files" [1] ?
This was brought up in the pre-allocation thread [1]. I don't think
the custodian process would be the right place for it, and I'm also
not as concerned about it because it will generally be a small, fixed,
and configurable amount of work. In any case, I don't sense a ton of
support for a new auxiliary process in this thread, so I'm hesitant to
go down the same path for pre-allocation.[0] /messages/by-id/20211213171935.GX17618@telsasoft.com
[1] /messages/by-id/B2ACCC5A-F9F2-41D9-AC3B-251362A0A254@amazon.com
I think the idea of weaving every non-critical task to a common
background process is a good idea but let's not mix up with the new
background cleaner process here for now, at least until we get some
numbers and prove that the idea proposed here will be beneficial.
Regards,
Bharath Rupireddy.
On 1/14/22, 11:26 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
On Sat, Jan 15, 2022 at 12:46 AM Bossart, Nathan <bossartn@amazon.com> wrote:
I'd personally like to avoid creating two code paths for the same
thing. Are there really cases when this one extra auxiliary process
would be too many? And if so, how would a user know when to adjust
this GUC? I understand the point that we should introduce new
processes sparingly to avoid burdening low-end systems, but I don't
think we should be afraid to add new ones when it is needed.IMO, having a GUC for enabling/disabling this new worker and it's
related code would be a better idea. The reason is that if the
postgres has no replication slots at all(which is quite possible in
real stand-alone production environments) or if the file enumeration
(directory traversal and file removal) is fast enough on the servers,
there's no point having this new worker, the checkpointer itself can
take care of the work as it is doing today.
IMO introducing a GUC wouldn't be doing users many favors. Their
cluster might work just fine for a long time before they begin
encountering problems during startups/checkpoints. Once the user
discovers the underlying reason, they have to then find a GUC for
enabling a special background worker that makes this problem go away.
Why not just fix the problem for everybody by default?
I've been thinking about what other approaches we could take besides
creating more processes. The root of the problem seems to be that
there are a number of tasks that are performed synchronously that can
take a long time. The process approach essentially makes these tasks
asynchronous so that they do not block startup and checkpointing. But
perhaps this can be done in an existing process, possibly even the
checkpointer. Like the current WAL pre-allocation patch, we could do
this work when the checkpointer isn't checkpointing, and we could also
do small amounts of work in CheckpointWriteDelay() (or a new function
called in a similar way). In theory, this would help avoid delaying
checkpoints too long while doing cleanup at every opportunity to lower
the chances it falls far behind.
Nathan
Here is a rebased patch set.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v4-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchtext/x-diff; charset=us-asciiDownload
From 74489757ee71f650c277d6fcb3b4504e9ab5e57f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v4 6/8] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
---
src/backend/access/heap/rewriteheap.c | 83 +++++++++++++++++++++++----
src/backend/postmaster/checkpointer.c | 33 +++++++++++
src/backend/postmaster/custodian.c | 10 ++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/bgwriter.h | 3 +
5 files changed, 120 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2a53826736..c5a1103687 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,10 +116,13 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
+#include "storage/proc.h"
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
@@ -1182,7 +1185,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1210,6 +1214,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ CheckPointSetLogicalRewriteCutoff(cutoff);
+ if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1240,15 +1249,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1286,3 +1287,65 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
+ * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
+ * disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+ bool value_set = false;
+
+ cutoff = CheckPointGetLogicalRewriteCutoff(&value_set);
+ if (!value_set)
+ return;
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while (!ShutdownRequestPending &&
+ (mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 23f691cd47..fe0934e5a6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -127,6 +127,9 @@ typedef struct
uint32 num_backend_writes; /* counts user backend buffer writes */
uint32 num_backend_fsync; /* counts user backend fsync calls */
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
+ bool logical_rewrite_mappings_cutoff_set;
+
int num_requests; /* current # of requests */
int max_requests; /* allocated array size */
CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1341,3 +1344,33 @@ FirstCallSinceLastCheckpoint(void)
return FirstCall;
}
+
+/*
+ * Used by CheckPointLogicalRewriteHeap() to tell the custodian which logical
+ * rewrite mapping files it can remove.
+ */
+void
+CheckPointSetLogicalRewriteCutoff(XLogRecPtr cutoff)
+{
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ CheckpointerShmem->logical_rewrite_mappings_cutoff = cutoff;
+ CheckpointerShmem->logical_rewrite_mappings_cutoff_set = true;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CheckPointGetLogicalRewriteCutoff(bool *value_set)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ cutoff = CheckpointerShmem->logical_rewrite_mappings_cutoff;
+ *value_set = CheckpointerShmem->logical_rewrite_mappings_cutoff_set;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ return cutoff;
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 0f4dbdd669..9c5479b5cf 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -36,6 +36,7 @@
#include <time.h>
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -218,6 +219,15 @@ CustodianMain(void)
*/
RemoveOldSerializedSnapshots();
+ /*
+ * Remove logical rewrite mapping files that are no longer needed.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ RemoveOldLogicalRewriteMappings();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index aa5c48f219..f493094557 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
void CheckPointLogicalRewriteHeap(void);
+void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 2882efd67b..051e6732cb 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,7 @@ extern void CheckpointerShmemInit(void);
extern bool FirstCallSinceLastCheckpoint(void);
+extern void CheckPointSetLogicalRewriteCutoff(XLogRecPtr cutoff);
+extern XLogRecPtr CheckPointGetLogicalRewriteCutoff(bool *value_set);
+
#endif /* _BGWRITER_H */
--
2.25.1
v4-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From 506aa95dd77f16dc64d7fe9c52ca67dd3c10212e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v4 1/8] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 213 ++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++++-
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 17 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
11 files changed, 300 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index dbbeac5a82..1b7aae60f5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 0587e45920..7eae34884d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..dd86f0f5ce
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,213 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process is new as of Postgres 15. It's main purpose is to
+ * offload tasks that could otherwise delay startup and checkpointing, but
+ * it needn't be restricted to just those things. Offloaded tasks should
+ * not be synchronous (e.g., checkpointing shouldn't need to wait for the
+ * custodian to complete a task before proceeding). Also, ensure that any
+ * offloaded tasks are either not required during single-user mode or are
+ * performed separately during single-user mode.
+ *
+ * The custodian is not an essential process and can shutdown quickly when
+ * requested. The custodian will wake up approximately once every 5
+ * minutes to perform its tasks, but backends can (and should) set its
+ * latch to wake it up sooner.
+ *
+ * Normal termination is by SIGTERM, which instructs the bgwriter to
+ * exit(0). Emergency termination is by SIGQUIT; like any backend, the
+ * custodian will simply abort and exit on SIGQUIT.
+ *
+ * If the custodian exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining
+ * backends should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <time.h>
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+
+#define CUSTODIAN_TIMEOUT_S (300) /* 5 minutes */
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ *
+ * You might wonder why this isn't coded as an infinite loop around a
+ * PG_TRY construct. The reason is that this is the bottom of the
+ * exception stack, and so with PG_TRY there would be no exception handler
+ * in force at all during the CATCH part. By leaving the outermost setjmp
+ * always active, we have at least some chance of recovering from an error
+ * during error recovery. (If we get into an infinite loop thereby, it
+ * will soon be stopped by overflow of elog.c's internal state stack.)
+ *
+ * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
+ * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus,
+ * signals other than SIGQUIT will be blocked until we complete error
+ * recovery. It might seem that this policy makes the HOLD_INTERRUPS()
+ * call redundant, but it is not since InterruptPending might be set
+ * already.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ pg_time_t start_time;
+ pg_time_t end_time;
+ int elapsed_secs;
+ int cur_timeout;
+
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ start_time = (pg_time_t) time(NULL);
+
+ /* TODO: offloaded tasks go here */
+
+ /* Calculate how long to sleep */
+ end_time = (pg_time_t) time(NULL);
+ elapsed_secs = end_time - start_time;
+ if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
+ continue; /* no sleep for us */
+ cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
+
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ cur_timeout * 1000L /* convert to ms */ ,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ce90877154..0911127471 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -250,6 +250,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -556,6 +557,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1817,13 +1819,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2780,6 +2785,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3107,6 +3114,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3209,6 +3218,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3682,6 +3705,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3885,6 +3920,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3922,6 +3960,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -4015,6 +4054,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4220,6 +4260,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37f032e7b9..f9df0259fd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60972c3a75..e10cc2d82b 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0868e5a24f..8b52757ea6 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -274,6 +274,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_STARTUP:
backendDesc = "startup";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0abc3ad540..71f522878e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -328,6 +328,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
@@ -432,6 +433,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -444,6 +446,7 @@ extern AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..cf0a04ca6c
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index a58888f9e9..ad61b4d802 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -357,6 +357,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -374,11 +376,12 @@ extern PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 395d325c5f..1338d06823 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_PGSTAT_MAIN,
--
2.25.1
v4-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From 92e7bde7bab9cf4448f497c9dcd13bf431e8edee Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v4 2/8] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0911127471..c28b5167f7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1115,7 +1115,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..35cb6f7bb6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3160,7 +3160,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3176,7 +3176,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3209,7 +3209,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3247,13 +3247,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (S_ISDIR(statbuf.st_mode))
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3271,6 +3265,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 29209e2724..525847daea 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 2000f51731..35f27e486e 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -143,7 +143,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -241,7 +242,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v4-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From c001aaf3ae13cd66e945834f1326e1d7a6fca592 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v4 3/8] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be renamed to stage them for
removal. Then, all files in pgsql_tmp are removed before removing
the staged directories themselves. This change is being made in
preparation for a follow-up change to offload most temporary file
cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
---
src/backend/postmaster/postmaster.c | 8 +-
src/backend/storage/file/fd.c | 176 ++++++++++++++++++++++++----
src/include/storage/fd.h | 2 +-
3 files changed, 162 insertions(+), 24 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index c28b5167f7..a6bc9feabd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1390,7 +1390,8 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
- RemovePgTempFiles();
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4138,7 +4139,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
- RemovePgTempFiles();
+ {
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 35cb6f7bb6..d3019a4b67 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -112,6 +112,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_DIR_TO_REMOVE_PREFIX (PG_TEMP_FILES_DIR "_to_remove_")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -338,6 +340,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3133,24 +3137,20 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
* Remove temporary and temporary relation files left over from a prior
* postmaster session
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * If stage is true, this function will simply rename all pgsql_tmp directories
+ * to stage them for removal at a later time. If stage is false, this function
+ * will delete all files in the staged directories as well as the directories
+ * themselves.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * If remove_relation_files is true, this function will remove the temporary
+ * relation files. Otherwise, this step is skipped.
*
* NOTE: this function and its subroutines generally report syscall failures
* with ereport(LOG) and keep going. Removing temp files is not so critical
* that we should fail to start the database when we can't do it.
*/
void
-RemovePgTempFiles(void)
+RemovePgTempFiles(bool stage, bool remove_relation_files)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3159,9 +3159,16 @@ RemovePgTempFiles(void)
/*
* First process temp files in pg_default ($PGDATA/base)
*/
- snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
- RemovePgTempRelationFiles("base");
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ RemoveStagedPgTempDirs("base");
+
+ if (remove_relation_files)
+ RemovePgTempRelationFiles("base");
/*
* Cycle through temp directories for all non-default tablespaces.
@@ -3174,13 +3181,26 @@ RemovePgTempFiles(void)
strcmp(spc_de->d_name, "..") == 0)
continue;
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY,
+ PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- RemovePgTempRelationFiles(temp_path);
+ if (remove_relation_files)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemovePgTempRelationFiles(temp_path);
+ }
}
FreeDir(spc_dir);
@@ -3194,7 +3214,121 @@ RemovePgTempFiles(void)
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function renames the given directory with a special prefix that
+ * RemoveStagedPgTempDirs() will know to look for. An integer is appended to
+ * the end of the new directory name in case previously staged pgsql_tmp
+ * directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ DIR *dir;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ dir = AllocateDir(tmp_dir);
+ if (dir == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m", tmp_dir)));
+ return;
+ }
+ FreeDir(dir);
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
+ PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ stage_path)));
+ return;
+ }
+ FreeDir(dir);
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ {
+ ereport(LOG,
+ (errmsg("could not stage \"%s\" for deletion",
+ tmp_dir)));
+ return;
+ }
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(spc_dir);
+ while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
+ strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", spc_dir, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
* Any other problem results in a LOG message. (missing_ok should be true at
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 525847daea..240992ca51 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,7 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
-extern void RemovePgTempFiles(void);
+extern void RemovePgTempFiles(bool stage, bool remove_relation_files);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
--
2.25.1
v4-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From 601dac12627658902b4413b9a0c318f235b8e48f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v4 4/8] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 13 +++++++++++-
src/backend/postmaster/postmaster.c | 14 ++++++++-----
src/backend/storage/file/fd.c | 32 +++++++++++++++++++++--------
3 files changed, 44 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index dd86f0f5ce..79bc4a7065 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -194,7 +194,18 @@ CustodianMain(void)
start_time = (pg_time_t) time(NULL);
- /* TODO: offloaded tasks go here */
+ /*
+ * Remove any pgsql_tmp directories that have been staged for deletion.
+ * Since pgsql_tmp directories can accumulate many files, removing all
+ * of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the temporary directories, and we clean them up here.
+ *
+ * pgsql_tmp directories are not staged or cleaned in single-user mode,
+ * so we don't need any extra handling outside of the custodian process
+ * for this.
+ */
+ RemovePgTempFiles(false, false);
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a6bc9feabd..a8303a6482 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1389,9 +1389,11 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4137,12 +4139,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index d3019a4b67..5d39a31d14 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -97,9 +97,12 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/interrupt.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
#include "utils/guc.h"
#include "utils/resowner_private.h"
@@ -1640,9 +1643,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1840,9 +1843,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
@@ -3175,7 +3178,8 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3211,6 +3215,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
* would create a race condition. It's done separately, earlier in
* postmaster startup.
*/
+
+ /*
+ * If we just staged some pgsql_tmp directories for removal, wake up the
+ * custodian process so that it deletes all the files in the staged
+ * directories as well as the directories themselves.
+ */
+ if (stage && ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
}
/*
@@ -3315,7 +3327,8 @@ RemoveStagedPgTempDirs(const char *spc_dir)
struct dirent *de;
dir = AllocateDir(spc_dir);
- while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
{
if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
@@ -3354,7 +3367,8 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
--
2.25.1
v4-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchtext/x-diff; charset=us-asciiDownload
From 9f3900ee2423c460209217650fbe32b14f125df9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v4 5/8] Move removal of old serialized snapshots to custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 2 --
src/backend/postmaster/custodian.c | 11 +++++++++++
src/backend/replication/logical/snapbuild.c | 13 +++++++------
src/include/replication/snapbuild.h | 2 +-
4 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..369e0711f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -56,7 +56,6 @@
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -9569,7 +9568,6 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 79bc4a7065..0f4dbdd669 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -40,6 +40,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/proc.h"
@@ -207,6 +208,16 @@ CustodianMain(void)
*/
RemovePgTempFiles(false, false);
+ /*
+ * Remove serialized snapshots that are no longer required by any
+ * logical replication slot.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ RemoveOldSerializedSnapshots();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 83fca8a77d..466a6478f3 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -125,6 +125,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -1912,14 +1913,13 @@ snapshot_not_interesting:
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
@@ -1942,7 +1942,8 @@ CheckPointSnapBuild(void)
cutoff = redo;
snap_dir = AllocateDir("pg_logical/snapshots");
- while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
+ while (!ShutdownRequestPending &&
+ (snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
{
uint32 hi;
uint32 lo;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..55a2beb434 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v4-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patchtext/x-diff; charset=us-asciiDownload
From 4a2213f7292aac3ce86a3dc99b2c8ef7ac4e6f2a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Mon, 13 Dec 2021 20:20:12 -0800
Subject: [PATCH v4 7/8] Use syncfs() in CheckPointLogicalRewriteHeap() for
shutdown and end-of-recovery checkpoints.
This may save quite a bit of time when there are many mapping files
to flush to disk.
---
src/backend/access/heap/rewriteheap.c | 35 ++++++++++++++++++++++++++-
src/backend/access/transam/xlog.c | 2 +-
src/include/access/rewriteheap.h | 2 +-
3 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c5a1103687..1a8621c0ef 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1193,7 +1193,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* ---
*/
void
-CheckPointLogicalRewriteHeap(void)
+CheckPointLogicalRewriteHeap(bool shutdown)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
@@ -1219,6 +1219,39 @@ CheckPointLogicalRewriteHeap(void)
if (ProcGlobal->custodianLatch)
SetLatch(ProcGlobal->custodianLatch);
+#ifdef HAVE_SYNCFS
+
+ /*
+ * If we are doing a shutdown or end-of-recovery checkpoint, let's use
+ * syncfs() to flush the mappings to disk instead of flushing each one
+ * individually. This may save us quite a bit of time when there are many
+ * such files to flush.
+ */
+ if (shutdown)
+ {
+ int fd;
+
+ fd = OpenTransientFile("pg_logical/mappings", O_RDONLY);
+ if (fd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"pg_logical/mappings\": %m")));
+
+ if (syncfs(fd) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not synchronize file system for file \"pg_logical/mappings\": %m")));
+
+ if (CloseTransientFile(fd) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file \"pg_logical/mappings\": %m")));
+
+ return;
+ }
+
+#endif /* HAVE_SYNCFS */
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 369e0711f1..07aaee1c07 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9568,7 +9568,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointLogicalRewriteHeap();
+ CheckPointLogicalRewriteHeap(flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY));
CheckPointReplicationOrigin();
/* Write out all dirty data in SLRUs and the main buffer pool */
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index f493094557..79cae034e5 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -52,7 +52,7 @@ typedef struct LogicalRewriteMappingData
* ---
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
-void CheckPointLogicalRewriteHeap(void);
+void CheckPointLogicalRewriteHeap(bool shutdown);
void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
--
2.25.1
v4-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patchtext/x-diff; charset=us-asciiDownload
From fa507e1aa7923bb46b907847c2d6555c78a2219c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Fri, 11 Feb 2022 09:43:57 -0800
Subject: [PATCH v4 8/8] Move removal of spilled logical slot data to
custodian.
If there are many such files, startup can take much longer than
necessary. To handle this, startup creates a new slot directory,
copies the state file, and swaps the new directory with the old
one. The custodian then asynchronously cleans up the old slot
directory.
---
src/backend/access/transam/xlog.c | 15 +-
src/backend/postmaster/custodian.c | 14 +
.../replication/logical/reorderbuffer.c | 292 +++++++++++++++++-
src/backend/replication/slot.c | 4 +
src/include/replication/reorderbuffer.h | 1 +
5 files changed, 317 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 07aaee1c07..4d18798387 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7155,18 +7155,21 @@ StartupXLOG(void)
checkPoint.newestCommitTsXid);
XLogCtl->ckptFullXid = checkPoint.nextXid;
- /*
- * Initialize replication slots, before there's a chance to remove
- * required resources.
- */
- StartupReplicationSlots();
-
/*
* Startup logical state, needs to be setup now so we have proper data
* during crash recovery.
+ *
+ * NB: This also performs some important cleanup that must be done prior to
+ * other replication slot steps (e.g., StartupReplicationSlots()).
*/
StartupReorderBuffer();
+ /*
+ * Initialize replication slots, before there's a chance to remove
+ * required resources.
+ */
+ StartupReplicationSlots();
+
/*
* Startup CLOG. This must be done after ShmemVariableCache->nextXid has
* been initialized and before we accept connections or begin WAL replay.
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 9c5479b5cf..fdc614b1bd 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -41,6 +41,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
@@ -209,6 +210,19 @@ CustodianMain(void)
*/
RemovePgTempFiles(false, false);
+ /*
+ * Remove any replication slot directories that have been staged for
+ * deletion. Since slot directories can accumulate many files, removing
+ * all of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the slot directories, and we clean them up here.
+ *
+ * Replication slot directories are not staged or cleaned in single-user
+ * mode, so we don't need any extra handling outside of the custodian
+ * process for this.
+ */
+ RemoveStagedSlotDirectories();
+
/*
* Remove serialized snapshots that are no longer required by any
* logical replication slot.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c2d9be81fa..ab51e41229 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -126,15 +126,19 @@
#include "access/xlog_internal.h"
#include "catalog/catalog.h"
#include "commands/sequence.h"
+#include "common/string.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/slot.h"
#include "replication/snapbuild.h" /* just for SnapBuildSnapDecRefcount */
#include "storage/bufmgr.h"
+#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/proc.h"
#include "storage/sinval.h"
#include "utils/builtins.h"
#include "utils/combocid.h"
@@ -297,12 +301,15 @@ static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn
static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared);
static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
+static void ReorderBufferCleanup(const char *slotname);
static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
TransactionId xid, XLogSegNo segno);
static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
ReorderBufferTXN *txn, CommandId cid);
+static void StageSlotDirForRemoval(const char *slotname, const char *slotpath);
+static void RemoveStagedSlotDirectory(const char *path);
/*
* ---------------------------------------
@@ -4835,6 +4842,202 @@ ReorderBufferCleanupSerializedTXNs(const char *slotname)
FreeDir(spill_dir);
}
+/*
+ * Cleanup everything in the logical slot directory except for the "state" file.
+ * This is specially written for StartupReorderBuffer(), which has special logic
+ * to handle crashes at inconvenient times.
+ *
+ * NB: If anything except for the "state" file cannot be removed after startup,
+ * this will need to be updated.
+ */
+static void
+ReorderBufferCleanup(const char *slotname)
+{
+ char path[MAXPGPATH];
+ char newpath[MAXPGPATH];
+ char statepath[MAXPGPATH];
+ char newstatepath[MAXPGPATH];
+ struct stat statbuf;
+
+ sprintf(path, "pg_replslot/%s", slotname);
+ sprintf(newpath, "pg_replslot/%s.new", slotname);
+ sprintf(statepath, "pg_replslot/%s/state", slotname);
+ sprintf(newstatepath, "pg_replslot/%s.new/state", slotname);
+
+ /* we're only handling directories here, skip if it's not ours */
+ if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
+ return;
+
+ /*
+ * Build our new slot directory, suffixed with ".new". The caller (likely
+ * StartupReorderBuffer()) should have already ensured that any pre-existing
+ * ".new" directories leftover after a crash have been cleaned up.
+ */
+ if (MakePGDirectory(newpath) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", newpath)));
+
+ copy_file(statepath, newstatepath);
+
+ fsync_fname(newstatepath, false);
+ fsync_fname(newpath, true);
+ fsync_fname("pg_replslot", true);
+
+ /*
+ * Move the slot directory aside for cleanup by the custodian. After this
+ * step, there will be no slot directory. StartupReorderBuffer() has
+ * special logic to make sure we don't lose the slot if we crash at this
+ * point.
+ */
+ StageSlotDirForRemoval(slotname, path);
+
+ /*
+ * Move our ".new" directory to become our new slot directory.
+ */
+ if (rename(newpath, path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m", newpath)));
+
+ fsync_fname(path, true);
+ fsync_fname("pg_replslot", true);
+}
+
+/*
+ * This function renames the given directory with a special suffix that the
+ * custodian will know to look for. An integer is appended to the end of the
+ * new directory name in case previously staged slot directories have not yet
+ * been removed.
+ */
+static void
+StageSlotDirForRemoval(const char *slotname, const char *slotpath)
+{
+ char stage_path[MAXPGPATH];
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ DIR *dir;
+
+ sprintf(stage_path, "pg_replslot/%s.to_remove_%d", slotname, n);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ stage_path)));
+ }
+ FreeDir(dir);
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ ereport(ERROR,
+ (errmsg("could not stage \"%s\" for deletion",
+ slotpath)));
+
+ /*
+ * Rename the slot directory.
+ */
+ if (rename(slotpath, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m", slotpath)));
+
+ fsync_fname(stage_path, true);
+ fsync_fname("pg_replslot", true);
+}
+
+/*
+ * Remove slot directories that have been staged for deletion by
+ * ReorderBufferCleanup().
+ */
+void
+RemoveStagedSlotDirectories(void)
+{
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir("pg_replslot");
+ while (!ShutdownRequestPending &&
+ (de = ReadDir(dir, "pg_replslot")) != NULL)
+ {
+ struct stat st;
+ char path[MAXPGPATH];
+
+ if (strstr(de->d_name, ".to_remove") == NULL)
+ continue;
+
+ sprintf(path, "pg_replslot/%s", de->d_name);
+ if (lstat(path, &st) != 0)
+ ereport(ERROR,
+ (errmsg("could not stat file \"%s\": %m", path)));
+
+ if (!S_ISDIR(st.st_mode))
+ continue;
+
+ RemoveStagedSlotDirectory(path);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Removes one slot directory that has been staged for deletion by
+ * ReorderBufferCleanup(). If a shutdown request is pending, exit as soon as
+ * possible.
+ */
+static void
+RemoveStagedSlotDirectory(const char *path)
+{
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(path);
+ while (!ShutdownRequestPending &&
+ (de = ReadDir(dir, path)) != NULL)
+ {
+ struct stat st;
+ char filepath[MAXPGPATH];
+
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ sprintf(filepath, "%s/%s", path, de->d_name);
+
+ if (lstat(filepath, &st) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", filepath)));
+ else if (S_ISDIR(st.st_mode))
+ RemoveStagedSlotDirectory(filepath);
+ else if (unlink(filepath) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", filepath)));
+ }
+ FreeDir(dir);
+
+ if (rmdir(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m", path)));
+}
+
/*
* Given a replication slot, transaction ID and segment number, fill in the
* corresponding spill file into 'path', which is a caller-owned buffer of size
@@ -4863,6 +5066,83 @@ StartupReorderBuffer(void)
DIR *logical_dir;
struct dirent *logical_de;
+ /*
+ * First, handle any ".new" directories that were leftover after a crash.
+ * These are created and swapped with the actual replication slot
+ * directories so that cleanup of spilled data can be done asynchronously by
+ * the custodian.
+ */
+ logical_dir = AllocateDir("pg_replslot");
+ while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
+ {
+ char name[NAMEDATALEN];
+ char path[NAMEDATALEN + 12];
+ struct stat statbuf;
+
+ if (strcmp(logical_de->d_name, ".") == 0 ||
+ strcmp(logical_de->d_name, "..") == 0)
+ continue;
+
+ /*
+ * Make sure it's a valid ".new" directory.
+ */
+ if (!pg_str_endswith(logical_de->d_name, ".new") ||
+ strlen(logical_de->d_name) >= NAMEDATALEN + 4)
+ continue;
+
+ strncpy(name, logical_de->d_name, sizeof(name));
+ name[strlen(logical_de->d_name) - 4] = '\0';
+ if (!ReplicationSlotValidateName(name, DEBUG2))
+ continue;
+
+ sprintf(path, "pg_replslot/%s", name);
+ if (lstat(path, &statbuf) == 0)
+ {
+ if (!S_ISDIR(statbuf.st_mode))
+ continue;
+
+ /*
+ * If the original directory still exists, just delete the ".new"
+ * directory. We'll try again when we call ReorderBufferCleanup()
+ * later on.
+ */
+ if (!rmtree(path, true))
+ ereport(ERROR,
+ (errmsg("could not remove directory \"%s\"", path)));
+ }
+ else if (errno == ENOENT)
+ {
+ char newpath[NAMEDATALEN + 16];
+
+ /*
+ * If the original directory is gone, we need to rename the ".new"
+ * directory to take its place. We know that the ".new" directory
+ * is ready to be the real deal if we previously made it far enough
+ * to delete the original directory.
+ */
+ sprintf(newpath, "pg_replslot/%s", logical_de->d_name);
+ if (rename(newpath, path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+ newpath, path)));
+
+ fsync_fname(path, true);
+ }
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+
+ fsync_fname("pg_replslot", true);
+ }
+ FreeDir(logical_dir);
+
+ /*
+ * Now we can proceed with deleting all spilled data. (This actually just
+ * moves the directories aside so that the custodian can clean it up
+ * asynchronously.)
+ */
logical_dir = AllocateDir("pg_replslot");
while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
{
@@ -4875,12 +5155,18 @@ StartupReorderBuffer(void)
continue;
/*
- * ok, has to be a surviving logical slot, iterate and delete
- * everything starting with xid-*
+ * ok, has to be a surviving logical slot, delete everything except for
+ * state
*/
- ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
+ ReorderBufferCleanup(logical_de->d_name);
}
FreeDir(logical_dir);
+
+ /*
+ * Wake up the custodian so it cleans up our old slot data.
+ */
+ if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
}
/* ---------------------------------------
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e5e0cf8768..c45f8cf94d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1430,6 +1430,10 @@ StartupReplicationSlots(void)
continue;
}
+ /* if it's an old slot directory that's staged for removal, ignore it */
+ if (strstr(replication_de->d_name, ".to_remove") != NULL)
+ continue;
+
/* looks like a slot in a normal state, restore */
RestoreSlotFromDisk(replication_de->d_name);
}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 859424bbd9..ff56ae0b22 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -719,6 +719,7 @@ TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
void StartupReorderBuffer(void);
+void RemoveStagedSlotDirectories(void);
bool ReorderBufferSequenceIsTransactional(ReorderBuffer *rb,
RelFileNode rnode, bool created);
--
2.25.1
Here is another rebased patch set.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v5-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From c11a6893d2d509df1389a1c03081b6cc563d5683 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v5 1/8] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 214 ++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++++-
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 17 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
11 files changed, 301 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index dbbeac5a82..1b7aae60f5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 0587e45920..7eae34884d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..5f2b647544
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,214 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process is new as of Postgres 15. It's main purpose is to
+ * offload tasks that could otherwise delay startup and checkpointing, but
+ * it needn't be restricted to just those things. Offloaded tasks should
+ * not be synchronous (e.g., checkpointing shouldn't need to wait for the
+ * custodian to complete a task before proceeding). Also, ensure that any
+ * offloaded tasks are either not required during single-user mode or are
+ * performed separately during single-user mode.
+ *
+ * The custodian is not an essential process and can shutdown quickly when
+ * requested. The custodian will wake up approximately once every 5
+ * minutes to perform its tasks, but backends can (and should) set its
+ * latch to wake it up sooner.
+ *
+ * Normal termination is by SIGTERM, which instructs the bgwriter to
+ * exit(0). Emergency termination is by SIGQUIT; like any backend, the
+ * custodian will simply abort and exit on SIGQUIT.
+ *
+ * If the custodian exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining
+ * backends should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <time.h>
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+
+#define CUSTODIAN_TIMEOUT_S (300) /* 5 minutes */
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ *
+ * You might wonder why this isn't coded as an infinite loop around a
+ * PG_TRY construct. The reason is that this is the bottom of the
+ * exception stack, and so with PG_TRY there would be no exception handler
+ * in force at all during the CATCH part. By leaving the outermost setjmp
+ * always active, we have at least some chance of recovering from an error
+ * during error recovery. (If we get into an infinite loop thereby, it
+ * will soon be stopped by overflow of elog.c's internal state stack.)
+ *
+ * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
+ * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus,
+ * signals other than SIGQUIT will be blocked until we complete error
+ * recovery. It might seem that this policy makes the HOLD_INTERRUPS()
+ * call redundant, but it is not since InterruptPending might be set
+ * already.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ pg_time_t start_time;
+ pg_time_t end_time;
+ int elapsed_secs;
+ int cur_timeout;
+
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ start_time = (pg_time_t) time(NULL);
+
+ /* TODO: offloaded tasks go here */
+
+ /* Calculate how long to sleep */
+ end_time = (pg_time_t) time(NULL);
+ elapsed_secs = end_time - start_time;
+ if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
+ continue; /* no sleep for us */
+ cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
+
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ cur_timeout * 1000L /* convert to ms */ ,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 735fed490b..a867412268 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -251,6 +251,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -557,6 +558,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1818,13 +1820,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2781,6 +2786,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3108,6 +3115,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3210,6 +3219,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3683,6 +3706,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3886,6 +3921,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3923,6 +3961,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -4016,6 +4055,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4221,6 +4261,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 90283f8a9f..1e693b69e5 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60972c3a75..e10cc2d82b 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0868e5a24f..8b52757ea6 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -274,6 +274,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_STARTUP:
backendDesc = "startup";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0abc3ad540..71f522878e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -328,6 +328,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
@@ -432,6 +433,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -444,6 +446,7 @@ extern AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..cf0a04ca6c
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index a58888f9e9..ad61b4d802 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -357,6 +357,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -374,11 +376,12 @@ extern PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 395d325c5f..1338d06823 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_PGSTAT_MAIN,
--
2.25.1
v5-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From d9826f75ad2259984d55fc04622f0b91ebbba65a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v5 2/8] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a867412268..54f77cb306 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1116,7 +1116,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..35cb6f7bb6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3160,7 +3160,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3176,7 +3176,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3209,7 +3209,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3247,13 +3247,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (S_ISDIR(statbuf.st_mode))
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3271,6 +3265,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 29209e2724..525847daea 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 6ab3092874..fd93368434 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -138,7 +138,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -236,7 +237,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v5-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From 72baf897a3c733b30bb4bf63e63825b9bc6a6acf Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be renamed to stage them for
removal. Then, all files in pgsql_tmp are removed before removing
the staged directories themselves. This change is being made in
preparation for a follow-up change to offload most temporary file
cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
---
src/backend/postmaster/postmaster.c | 8 +-
src/backend/storage/file/fd.c | 176 ++++++++++++++++++++++++----
src/include/storage/fd.h | 2 +-
3 files changed, 162 insertions(+), 24 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 54f77cb306..8248d55e23 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1391,7 +1391,8 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
- RemovePgTempFiles();
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4139,7 +4140,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
- RemovePgTempFiles();
+ {
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 35cb6f7bb6..d3019a4b67 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -112,6 +112,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_DIR_TO_REMOVE_PREFIX (PG_TEMP_FILES_DIR "_to_remove_")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -338,6 +340,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3133,24 +3137,20 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
* Remove temporary and temporary relation files left over from a prior
* postmaster session
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * If stage is true, this function will simply rename all pgsql_tmp directories
+ * to stage them for removal at a later time. If stage is false, this function
+ * will delete all files in the staged directories as well as the directories
+ * themselves.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * If remove_relation_files is true, this function will remove the temporary
+ * relation files. Otherwise, this step is skipped.
*
* NOTE: this function and its subroutines generally report syscall failures
* with ereport(LOG) and keep going. Removing temp files is not so critical
* that we should fail to start the database when we can't do it.
*/
void
-RemovePgTempFiles(void)
+RemovePgTempFiles(bool stage, bool remove_relation_files)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3159,9 +3159,16 @@ RemovePgTempFiles(void)
/*
* First process temp files in pg_default ($PGDATA/base)
*/
- snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
- RemovePgTempRelationFiles("base");
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ RemoveStagedPgTempDirs("base");
+
+ if (remove_relation_files)
+ RemovePgTempRelationFiles("base");
/*
* Cycle through temp directories for all non-default tablespaces.
@@ -3174,13 +3181,26 @@ RemovePgTempFiles(void)
strcmp(spc_de->d_name, "..") == 0)
continue;
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY,
+ PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- RemovePgTempRelationFiles(temp_path);
+ if (remove_relation_files)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemovePgTempRelationFiles(temp_path);
+ }
}
FreeDir(spc_dir);
@@ -3194,7 +3214,121 @@ RemovePgTempFiles(void)
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function renames the given directory with a special prefix that
+ * RemoveStagedPgTempDirs() will know to look for. An integer is appended to
+ * the end of the new directory name in case previously staged pgsql_tmp
+ * directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ DIR *dir;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ dir = AllocateDir(tmp_dir);
+ if (dir == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m", tmp_dir)));
+ return;
+ }
+ FreeDir(dir);
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
+ PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ stage_path)));
+ return;
+ }
+ FreeDir(dir);
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ {
+ ereport(LOG,
+ (errmsg("could not stage \"%s\" for deletion",
+ tmp_dir)));
+ return;
+ }
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(spc_dir);
+ while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
+ strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", spc_dir, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
* Any other problem results in a LOG message. (missing_ok should be true at
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 525847daea..240992ca51 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,7 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
-extern void RemovePgTempFiles(void);
+extern void RemovePgTempFiles(bool stage, bool remove_relation_files);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
--
2.25.1
v5-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From f0c75b9bdd490ca3290f7fba7cc9fff2423cde30 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v5 4/8] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 13 +++++++++++-
src/backend/postmaster/postmaster.c | 14 ++++++++-----
src/backend/storage/file/fd.c | 32 +++++++++++++++++++++--------
3 files changed, 44 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 5f2b647544..5bad0af474 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -195,7 +195,18 @@ CustodianMain(void)
start_time = (pg_time_t) time(NULL);
- /* TODO: offloaded tasks go here */
+ /*
+ * Remove any pgsql_tmp directories that have been staged for deletion.
+ * Since pgsql_tmp directories can accumulate many files, removing all
+ * of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the temporary directories, and we clean them up here.
+ *
+ * pgsql_tmp directories are not staged or cleaned in single-user mode,
+ * so we don't need any extra handling outside of the custodian process
+ * for this.
+ */
+ RemovePgTempFiles(false, false);
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8248d55e23..56b87d79a3 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1390,9 +1390,11 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
/*
* Initialize stats collection subsystem (this does NOT start the
@@ -4138,12 +4140,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index d3019a4b67..5d39a31d14 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -97,9 +97,12 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/interrupt.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
#include "utils/guc.h"
#include "utils/resowner_private.h"
@@ -1640,9 +1643,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1840,9 +1843,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
@@ -3175,7 +3178,8 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3211,6 +3215,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
* would create a race condition. It's done separately, earlier in
* postmaster startup.
*/
+
+ /*
+ * If we just staged some pgsql_tmp directories for removal, wake up the
+ * custodian process so that it deletes all the files in the staged
+ * directories as well as the directories themselves.
+ */
+ if (stage && ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
}
/*
@@ -3315,7 +3327,8 @@ RemoveStagedPgTempDirs(const char *spc_dir)
struct dirent *de;
dir = AllocateDir(spc_dir);
- while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
{
if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
@@ -3354,7 +3367,8 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while (!ShutdownRequestPending &&
+ (temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
--
2.25.1
v5-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchtext/x-diff; charset=us-asciiDownload
From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 2 --
src/backend/postmaster/custodian.c | 11 +++++++++++
src/backend/replication/logical/snapbuild.c | 13 +++++++------
src/include/replication/snapbuild.h | 2 +-
4 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ce78ac413e..c4a80ea82a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -79,7 +79,6 @@
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -6807,7 +6806,6 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 5bad0af474..8591c5db9b 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -40,6 +40,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -208,6 +209,16 @@ CustodianMain(void)
*/
RemovePgTempFiles(false, false);
+ /*
+ * Remove serialized snapshots that are no longer required by any
+ * logical replication slot.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ RemoveOldSerializedSnapshots();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 83fca8a77d..466a6478f3 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -125,6 +125,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -1912,14 +1913,13 @@ snapshot_not_interesting:
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
@@ -1942,7 +1942,8 @@ CheckPointSnapBuild(void)
cutoff = redo;
snap_dir = AllocateDir("pg_logical/snapshots");
- while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
+ while (!ShutdownRequestPending &&
+ (snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
{
uint32 hi;
uint32 lo;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..55a2beb434 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v5-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchtext/x-diff; charset=us-asciiDownload
From 2a9c103b9ce034647ec878da10c7b194ccebea20 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v5 6/8] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
---
src/backend/access/heap/rewriteheap.c | 83 +++++++++++++++++++++++----
src/backend/postmaster/checkpointer.c | 33 +++++++++++
src/backend/postmaster/custodian.c | 10 ++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/bgwriter.h | 3 +
5 files changed, 120 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2a53826736..c5a1103687 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,10 +116,13 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
+#include "storage/proc.h"
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
@@ -1182,7 +1185,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1210,6 +1214,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ CheckPointSetLogicalRewriteCutoff(cutoff);
+ if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1240,15 +1249,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1286,3 +1287,65 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
+ * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
+ * disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+ bool value_set = false;
+
+ cutoff = CheckPointGetLogicalRewriteCutoff(&value_set);
+ if (!value_set)
+ return;
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while (!ShutdownRequestPending &&
+ (mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4488e3a443..666f2a0368 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -128,6 +128,9 @@ typedef struct
uint32 num_backend_writes; /* counts user backend buffer writes */
uint32 num_backend_fsync; /* counts user backend fsync calls */
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
+ bool logical_rewrite_mappings_cutoff_set;
+
int num_requests; /* current # of requests */
int max_requests; /* allocated array size */
CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1342,3 +1345,33 @@ FirstCallSinceLastCheckpoint(void)
return FirstCall;
}
+
+/*
+ * Used by CheckPointLogicalRewriteHeap() to tell the custodian which logical
+ * rewrite mapping files it can remove.
+ */
+void
+CheckPointSetLogicalRewriteCutoff(XLogRecPtr cutoff)
+{
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ CheckpointerShmem->logical_rewrite_mappings_cutoff = cutoff;
+ CheckpointerShmem->logical_rewrite_mappings_cutoff_set = true;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CheckPointGetLogicalRewriteCutoff(bool *value_set)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
+ cutoff = CheckpointerShmem->logical_rewrite_mappings_cutoff;
+ *value_set = CheckpointerShmem->logical_rewrite_mappings_cutoff_set;
+ SpinLockRelease(&CheckpointerShmem->ckpt_lck);
+
+ return cutoff;
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 8591c5db9b..7f914a617f 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -36,6 +36,7 @@
#include <time.h>
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -219,6 +220,15 @@ CustodianMain(void)
*/
RemoveOldSerializedSnapshots();
+ /*
+ * Remove logical rewrite mapping files that are no longer needed.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ RemoveOldLogicalRewriteMappings();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index aa5c48f219..f493094557 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
void CheckPointLogicalRewriteHeap(void);
+void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 2882efd67b..051e6732cb 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,7 @@ extern void CheckpointerShmemInit(void);
extern bool FirstCallSinceLastCheckpoint(void);
+extern void CheckPointSetLogicalRewriteCutoff(XLogRecPtr cutoff);
+extern XLogRecPtr CheckPointGetLogicalRewriteCutoff(bool *value_set);
+
#endif /* _BGWRITER_H */
--
2.25.1
v5-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patchtext/x-diff; charset=us-asciiDownload
From cfca62dd55d7be7e0025e5625f18d3ab9180029c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Mon, 13 Dec 2021 20:20:12 -0800
Subject: [PATCH v5 7/8] Use syncfs() in CheckPointLogicalRewriteHeap() for
shutdown and end-of-recovery checkpoints.
This may save quite a bit of time when there are many mapping files
to flush to disk.
---
src/backend/access/heap/rewriteheap.c | 35 ++++++++++++++++++++++++++-
src/backend/access/transam/xlog.c | 2 +-
src/include/access/rewriteheap.h | 2 +-
3 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c5a1103687..1a8621c0ef 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1193,7 +1193,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* ---
*/
void
-CheckPointLogicalRewriteHeap(void)
+CheckPointLogicalRewriteHeap(bool shutdown)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
@@ -1219,6 +1219,39 @@ CheckPointLogicalRewriteHeap(void)
if (ProcGlobal->custodianLatch)
SetLatch(ProcGlobal->custodianLatch);
+#ifdef HAVE_SYNCFS
+
+ /*
+ * If we are doing a shutdown or end-of-recovery checkpoint, let's use
+ * syncfs() to flush the mappings to disk instead of flushing each one
+ * individually. This may save us quite a bit of time when there are many
+ * such files to flush.
+ */
+ if (shutdown)
+ {
+ int fd;
+
+ fd = OpenTransientFile("pg_logical/mappings", O_RDONLY);
+ if (fd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"pg_logical/mappings\": %m")));
+
+ if (syncfs(fd) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not synchronize file system for file \"pg_logical/mappings\": %m")));
+
+ if (CloseTransientFile(fd) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file \"pg_logical/mappings\": %m")));
+
+ return;
+ }
+
+#endif /* HAVE_SYNCFS */
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c4a80ea82a..6a3613fd98 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointLogicalRewriteHeap();
+ CheckPointLogicalRewriteHeap(flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY));
CheckPointReplicationOrigin();
/* Write out all dirty data in SLRUs and the main buffer pool */
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index f493094557..79cae034e5 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -52,7 +52,7 @@ typedef struct LogicalRewriteMappingData
* ---
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
-void CheckPointLogicalRewriteHeap(void);
+void CheckPointLogicalRewriteHeap(bool shutdown);
void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
--
2.25.1
v5-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patchtext/x-diff; charset=us-asciiDownload
From b5923b1b76a1fab6c21d6aec086219160473f464 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Fri, 11 Feb 2022 09:43:57 -0800
Subject: [PATCH v5 8/8] Move removal of spilled logical slot data to
custodian.
If there are many such files, startup can take much longer than
necessary. To handle this, startup creates a new slot directory,
copies the state file, and swaps the new directory with the old
one. The custodian then asynchronously cleans up the old slot
directory.
---
src/backend/access/transam/xlog.c | 15 +-
src/backend/postmaster/custodian.c | 14 +
.../replication/logical/reorderbuffer.c | 292 +++++++++++++++++-
src/backend/replication/slot.c | 4 +
src/include/replication/reorderbuffer.h | 1 +
5 files changed, 317 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a3613fd98..36ba3ab147 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5045,18 +5045,21 @@ StartupXLOG(void)
*/
RelationCacheInitFileRemove();
- /*
- * Initialize replication slots, before there's a chance to remove
- * required resources.
- */
- StartupReplicationSlots();
-
/*
* Startup logical state, needs to be setup now so we have proper data
* during crash recovery.
+ *
+ * NB: This also performs some important cleanup that must be done prior to
+ * other replication slot steps (e.g., StartupReplicationSlots()).
*/
StartupReorderBuffer();
+ /*
+ * Initialize replication slots, before there's a chance to remove
+ * required resources.
+ */
+ StartupReplicationSlots();
+
/*
* Startup CLOG. This must be done after ShmemVariableCache->nextXid has
* been initialized and before we accept connections or begin WAL replay.
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 7f914a617f..8cf237e63f 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -41,6 +41,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
@@ -210,6 +211,19 @@ CustodianMain(void)
*/
RemovePgTempFiles(false, false);
+ /*
+ * Remove any replication slot directories that have been staged for
+ * deletion. Since slot directories can accumulate many files, removing
+ * all of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the slot directories, and we clean them up here.
+ *
+ * Replication slot directories are not staged or cleaned in single-user
+ * mode, so we don't need any extra handling outside of the custodian
+ * process for this.
+ */
+ RemoveStagedSlotDirectories();
+
/*
* Remove serialized snapshots that are no longer required by any
* logical replication slot.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c2d9be81fa..ab51e41229 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -126,15 +126,19 @@
#include "access/xlog_internal.h"
#include "catalog/catalog.h"
#include "commands/sequence.h"
+#include "common/string.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/slot.h"
#include "replication/snapbuild.h" /* just for SnapBuildSnapDecRefcount */
#include "storage/bufmgr.h"
+#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/proc.h"
#include "storage/sinval.h"
#include "utils/builtins.h"
#include "utils/combocid.h"
@@ -297,12 +301,15 @@ static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn
static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared);
static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
+static void ReorderBufferCleanup(const char *slotname);
static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
TransactionId xid, XLogSegNo segno);
static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
ReorderBufferTXN *txn, CommandId cid);
+static void StageSlotDirForRemoval(const char *slotname, const char *slotpath);
+static void RemoveStagedSlotDirectory(const char *path);
/*
* ---------------------------------------
@@ -4835,6 +4842,202 @@ ReorderBufferCleanupSerializedTXNs(const char *slotname)
FreeDir(spill_dir);
}
+/*
+ * Cleanup everything in the logical slot directory except for the "state" file.
+ * This is specially written for StartupReorderBuffer(), which has special logic
+ * to handle crashes at inconvenient times.
+ *
+ * NB: If anything except for the "state" file cannot be removed after startup,
+ * this will need to be updated.
+ */
+static void
+ReorderBufferCleanup(const char *slotname)
+{
+ char path[MAXPGPATH];
+ char newpath[MAXPGPATH];
+ char statepath[MAXPGPATH];
+ char newstatepath[MAXPGPATH];
+ struct stat statbuf;
+
+ sprintf(path, "pg_replslot/%s", slotname);
+ sprintf(newpath, "pg_replslot/%s.new", slotname);
+ sprintf(statepath, "pg_replslot/%s/state", slotname);
+ sprintf(newstatepath, "pg_replslot/%s.new/state", slotname);
+
+ /* we're only handling directories here, skip if it's not ours */
+ if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
+ return;
+
+ /*
+ * Build our new slot directory, suffixed with ".new". The caller (likely
+ * StartupReorderBuffer()) should have already ensured that any pre-existing
+ * ".new" directories leftover after a crash have been cleaned up.
+ */
+ if (MakePGDirectory(newpath) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", newpath)));
+
+ copy_file(statepath, newstatepath);
+
+ fsync_fname(newstatepath, false);
+ fsync_fname(newpath, true);
+ fsync_fname("pg_replslot", true);
+
+ /*
+ * Move the slot directory aside for cleanup by the custodian. After this
+ * step, there will be no slot directory. StartupReorderBuffer() has
+ * special logic to make sure we don't lose the slot if we crash at this
+ * point.
+ */
+ StageSlotDirForRemoval(slotname, path);
+
+ /*
+ * Move our ".new" directory to become our new slot directory.
+ */
+ if (rename(newpath, path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m", newpath)));
+
+ fsync_fname(path, true);
+ fsync_fname("pg_replslot", true);
+}
+
+/*
+ * This function renames the given directory with a special suffix that the
+ * custodian will know to look for. An integer is appended to the end of the
+ * new directory name in case previously staged slot directories have not yet
+ * been removed.
+ */
+static void
+StageSlotDirForRemoval(const char *slotname, const char *slotpath)
+{
+ char stage_path[MAXPGPATH];
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ DIR *dir;
+
+ sprintf(stage_path, "pg_replslot/%s.to_remove_%d", slotname, n);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ stage_path)));
+ }
+ FreeDir(dir);
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ ereport(ERROR,
+ (errmsg("could not stage \"%s\" for deletion",
+ slotpath)));
+
+ /*
+ * Rename the slot directory.
+ */
+ if (rename(slotpath, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m", slotpath)));
+
+ fsync_fname(stage_path, true);
+ fsync_fname("pg_replslot", true);
+}
+
+/*
+ * Remove slot directories that have been staged for deletion by
+ * ReorderBufferCleanup().
+ */
+void
+RemoveStagedSlotDirectories(void)
+{
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir("pg_replslot");
+ while (!ShutdownRequestPending &&
+ (de = ReadDir(dir, "pg_replslot")) != NULL)
+ {
+ struct stat st;
+ char path[MAXPGPATH];
+
+ if (strstr(de->d_name, ".to_remove") == NULL)
+ continue;
+
+ sprintf(path, "pg_replslot/%s", de->d_name);
+ if (lstat(path, &st) != 0)
+ ereport(ERROR,
+ (errmsg("could not stat file \"%s\": %m", path)));
+
+ if (!S_ISDIR(st.st_mode))
+ continue;
+
+ RemoveStagedSlotDirectory(path);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Removes one slot directory that has been staged for deletion by
+ * ReorderBufferCleanup(). If a shutdown request is pending, exit as soon as
+ * possible.
+ */
+static void
+RemoveStagedSlotDirectory(const char *path)
+{
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(path);
+ while (!ShutdownRequestPending &&
+ (de = ReadDir(dir, path)) != NULL)
+ {
+ struct stat st;
+ char filepath[MAXPGPATH];
+
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ sprintf(filepath, "%s/%s", path, de->d_name);
+
+ if (lstat(filepath, &st) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", filepath)));
+ else if (S_ISDIR(st.st_mode))
+ RemoveStagedSlotDirectory(filepath);
+ else if (unlink(filepath) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", filepath)));
+ }
+ FreeDir(dir);
+
+ if (rmdir(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m", path)));
+}
+
/*
* Given a replication slot, transaction ID and segment number, fill in the
* corresponding spill file into 'path', which is a caller-owned buffer of size
@@ -4863,6 +5066,83 @@ StartupReorderBuffer(void)
DIR *logical_dir;
struct dirent *logical_de;
+ /*
+ * First, handle any ".new" directories that were leftover after a crash.
+ * These are created and swapped with the actual replication slot
+ * directories so that cleanup of spilled data can be done asynchronously by
+ * the custodian.
+ */
+ logical_dir = AllocateDir("pg_replslot");
+ while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
+ {
+ char name[NAMEDATALEN];
+ char path[NAMEDATALEN + 12];
+ struct stat statbuf;
+
+ if (strcmp(logical_de->d_name, ".") == 0 ||
+ strcmp(logical_de->d_name, "..") == 0)
+ continue;
+
+ /*
+ * Make sure it's a valid ".new" directory.
+ */
+ if (!pg_str_endswith(logical_de->d_name, ".new") ||
+ strlen(logical_de->d_name) >= NAMEDATALEN + 4)
+ continue;
+
+ strncpy(name, logical_de->d_name, sizeof(name));
+ name[strlen(logical_de->d_name) - 4] = '\0';
+ if (!ReplicationSlotValidateName(name, DEBUG2))
+ continue;
+
+ sprintf(path, "pg_replslot/%s", name);
+ if (lstat(path, &statbuf) == 0)
+ {
+ if (!S_ISDIR(statbuf.st_mode))
+ continue;
+
+ /*
+ * If the original directory still exists, just delete the ".new"
+ * directory. We'll try again when we call ReorderBufferCleanup()
+ * later on.
+ */
+ if (!rmtree(path, true))
+ ereport(ERROR,
+ (errmsg("could not remove directory \"%s\"", path)));
+ }
+ else if (errno == ENOENT)
+ {
+ char newpath[NAMEDATALEN + 16];
+
+ /*
+ * If the original directory is gone, we need to rename the ".new"
+ * directory to take its place. We know that the ".new" directory
+ * is ready to be the real deal if we previously made it far enough
+ * to delete the original directory.
+ */
+ sprintf(newpath, "pg_replslot/%s", logical_de->d_name);
+ if (rename(newpath, path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+ newpath, path)));
+
+ fsync_fname(path, true);
+ }
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+
+ fsync_fname("pg_replslot", true);
+ }
+ FreeDir(logical_dir);
+
+ /*
+ * Now we can proceed with deleting all spilled data. (This actually just
+ * moves the directories aside so that the custodian can clean it up
+ * asynchronously.)
+ */
logical_dir = AllocateDir("pg_replslot");
while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
{
@@ -4875,12 +5155,18 @@ StartupReorderBuffer(void)
continue;
/*
- * ok, has to be a surviving logical slot, iterate and delete
- * everything starting with xid-*
+ * ok, has to be a surviving logical slot, delete everything except for
+ * state
*/
- ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
+ ReorderBufferCleanup(logical_de->d_name);
}
FreeDir(logical_dir);
+
+ /*
+ * Wake up the custodian so it cleans up our old slot data.
+ */
+ if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
}
/* ---------------------------------------
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5da5fa825a..fd48d82718 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1455,6 +1455,10 @@ StartupReplicationSlots(void)
continue;
}
+ /* if it's an old slot directory that's staged for removal, ignore it */
+ if (strstr(replication_de->d_name, ".to_remove") != NULL)
+ continue;
+
/* looks like a slot in a normal state, restore */
RestoreSlotFromDisk(replication_de->d_name);
}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 859424bbd9..ff56ae0b22 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -719,6 +719,7 @@ TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
void StartupReorderBuffer(void);
+void RemoveStagedSlotDirectories(void);
bool ReorderBufferSequenceIsTransactional(ReorderBuffer *rb,
RelFileNode rnode, bool created);
--
2.25.1
Hi,
On 2022-02-16 16:50:57 -0800, Nathan Bossart wrote:
+ * The custodian process is new as of Postgres 15.
I think this kind of comment tends to age badly and not be very useful.
It's main purpose is to + * offload tasks that could otherwise delay startup and checkpointing, but + * it needn't be restricted to just those things. Offloaded tasks should + * not be synchronous (e.g., checkpointing shouldn't need to wait for the + * custodian to complete a task before proceeding). Also, ensure that any + * offloaded tasks are either not required during single-user mode or are + * performed separately during single-user mode. + * + * The custodian is not an essential process and can shutdown quickly when + * requested. The custodian will wake up approximately once every 5 + * minutes to perform its tasks, but backends can (and should) set its + * latch to wake it up sooner.
Hm. This kind policy makes it easy to introduce bugs where the occasional runs
mask forgotten notifications etc.
+ * Normal termination is by SIGTERM, which instructs the bgwriter to + * exit(0).
s/bgwriter/.../
Emergency termination is by SIGQUIT; like any backend, the + * custodian will simply abort and exit on SIGQUIT. + * + * If the custodian exits unexpectedly, the postmaster treats that the same + * as a backend crash: shared memory may be corrupted, so remaining + * backends should be killed by SIGQUIT and then a recovery cycle started.
This doesn't really seem useful stuff to me.
+ /* + * If an exception is encountered, processing resumes here. + * + * You might wonder why this isn't coded as an infinite loop around a + * PG_TRY construct. The reason is that this is the bottom of the + * exception stack, and so with PG_TRY there would be no exception handler + * in force at all during the CATCH part. By leaving the outermost setjmp + * always active, we have at least some chance of recovering from an error + * during error recovery. (If we get into an infinite loop thereby, it + * will soon be stopped by overflow of elog.c's internal state stack.) + * + * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask + * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus, + * signals other than SIGQUIT will be blocked until we complete error + * recovery. It might seem that this policy makes the HOLD_INTERRUPS() + * call redundant, but it is not since InterruptPending might be set + * already. + */
I think it's bad to copy this comment into even more places.
+ /* Since not using PG_TRY, must reset error stack by hand */ + if (sigsetjmp(local_sigjmp_buf, 1) != 0) + {
I also think it's a bad idea to introduce even more copies of the error
handling body. I think we need to unify this. And yes, it's unfair to stick
you with it, but it's been a while since a new aux process has been added.
+ /* + * These operations are really just a minimal subset of + * AbortTransaction(). We don't have very many resources to worry + * about. + */
Given what you're proposing this for, are you actually confident that we don't
need more than this?
From d9826f75ad2259984d55fc04622f0b91ebbba65a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v5 2/8] Also remove pgsql_tmp directories during startup.Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
Is this actually safe? Is there a guarantee no process can access a temp table
stored in one of these? Because without WAL guaranteeing consistency, we can't
just access e.g. temp tables written before a crash.
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok, + bool unlink_all);
I don't like functions with multiple consecutive booleans, they tend to get
swapped around. Why not just split unlink_all=true/false into different
functions?
Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be renamed to stage them for
removal.
What if the target name already exists?
Then, all files in pgsql_tmp are removed before removing
the staged directories themselves. This change is being made in
preparation for a follow-up change to offload most temporary file
cleanup to the new custodian process.Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
This should be in the prior commit message, otherwise people will ask the same
question as I did.
+ /* + * Find a name for the stage directory. We just increment an integer at the + * end of the name until we find one that doesn't exist. + */ + for (int n = 0; n <= INT_MAX; n++) + { + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path, + PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
Uninterruptible loops up to INT_MAX do not seem like a good idea.
+ dir = AllocateDir(stage_path); + if (dir == NULL) + {
Why not just use stat()? That's cheaper, and there's no
time-to-check-time-to-use issue here, we're the only one writing.
+ if (errno == ENOENT) + break; + + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open directory \"%s\": %m", + stage_path)));
I think this kind of lenience is just hiding bugs.
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
@@ -3175,7 +3178,8 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
*/
spc_dir = AllocateDir("pg_tblspc");- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) + while (!ShutdownRequestPending && + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
not do their jobs when ShutdownRequestPending is set, particularly without a
huge fat comment.
{ if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) @@ -3211,6 +3215,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files) * would create a race condition. It's done separately, earlier in * postmaster startup. */ + + /* + * If we just staged some pgsql_tmp directories for removal, wake up the + * custodian process so that it deletes all the files in the staged + * directories as well as the directories themselves. + */ + if (stage && ProcGlobal->custodianLatch) + SetLatch(ProcGlobal->custodianLatch);
Just signalling without letting the custodian know what it's expected to do
strikes me as a bad idea.
From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to custodian.This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 2 --
src/backend/postmaster/custodian.c | 11 +++++++++++
src/backend/replication/logical/snapbuild.c | 13 +++++++------
src/include/replication/snapbuild.h | 2 +-
4 files changed, 19 insertions(+), 9 deletions(-)
Why does this not open us up to new xid wraparound issues? Before there was a
hard bound on how long these files could linger around. Now there's not
anymore.
- while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL) + while (!ShutdownRequestPending && + (snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
I really really strenuously object to these checks.
Subject: [PATCH v5 6/8] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Same wraparound concerns.
+#include "postmaster/bgwriter.h"
I think it's a bad idea to put these functions into bgwriter.h
From cfca62dd55d7be7e0025e5625f18d3ab9180029c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Mon, 13 Dec 2021 20:20:12 -0800
Subject: [PATCH v5 7/8] Use syncfs() in CheckPointLogicalRewriteHeap() for
shutdown and end-of-recovery checkpoints.This may save quite a bit of time when there are many mapping files
to flush to disk.
Seems like an a mostly independent proposal.
+#ifdef HAVE_SYNCFS + + /* + * If we are doing a shutdown or end-of-recovery checkpoint, let's use + * syncfs() to flush the mappings to disk instead of flushing each one + * individually. This may save us quite a bit of time when there are many + * such files to flush. + */
I am doubtful this is a good idea. This will cause all dirty files to be
written back, even ones we don't need to be written back. At once. Very
possibly *slowing down* the shutdown.
What is even the theory of the case here? That there's so many dirty mapping
files that fsyncing them will take too long? That iterating would take too
long?
From b5923b1b76a1fab6c21d6aec086219160473f464 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Fri, 11 Feb 2022 09:43:57 -0800
Subject: [PATCH v5 8/8] Move removal of spilled logical slot data to
custodian.If there are many such files, startup can take much longer than
necessary. To handle this, startup creates a new slot directory,
copies the state file, and swaps the new directory with the old
one. The custodian then asynchronously cleans up the old slot
directory.
You guess it: I don't see what prevents wraparound issues.
5 files changed, 317 insertions(+), 9 deletions(-)
This seems such an increase in complexity and fragility that I really doubt
this is a good idea.
+/* + * This function renames the given directory with a special suffix that the + * custodian will know to look for. An integer is appended to the end of the + * new directory name in case previously staged slot directories have not yet + * been removed. + */ +static void +StageSlotDirForRemoval(const char *slotname, const char *slotpath) +{ + char stage_path[MAXPGPATH]; + + /* + * Find a name for the stage directory. We just increment an integer at the + * end of the name until we find one that doesn't exist. + */ + for (int n = 0; n <= INT_MAX; n++) + { + DIR *dir; + + sprintf(stage_path, "pg_replslot/%s.to_remove_%d", slotname, n); + + dir = AllocateDir(stage_path); + if (dir == NULL) + { + if (errno == ENOENT) + break; + + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not open directory \"%s\": %m", + stage_path))); + } + FreeDir(dir); + + stage_path[0] = '\0'; + }
Copy of "find free name" logic.
Greetings,
Andres Freund
Hi Andres,
I appreciate the feedback.
On Wed, Feb 16, 2022 at 05:50:52PM -0800, Andres Freund wrote:
+ /* Since not using PG_TRY, must reset error stack by hand */ + if (sigsetjmp(local_sigjmp_buf, 1) != 0) + {I also think it's a bad idea to introduce even more copies of the error
handling body. I think we need to unify this. And yes, it's unfair to stick
you with it, but it's been a while since a new aux process has been added.
+1, I think this is useful refactoring. I might spin this off to its own
thread.
+ /* + * These operations are really just a minimal subset of + * AbortTransaction(). We don't have very many resources to worry + * about. + */Given what you're proposing this for, are you actually confident that we don't
need more than this?
I will give this a closer look.
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok, + bool unlink_all);I don't like functions with multiple consecutive booleans, they tend to get
swapped around. Why not just split unlink_all=true/false into different
functions?
Will do.
Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be renamed to stage them for
removal.What if the target name already exists?
The integer at the end of the target name is incremented until we find a
unique name.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.This should be in the prior commit message, otherwise people will ask the same
question as I did.
Will do.
+ /* + * Find a name for the stage directory. We just increment an integer at the + * end of the name until we find one that doesn't exist. + */ + for (int n = 0; n <= INT_MAX; n++) + { + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path, + PG_TEMP_DIR_TO_REMOVE_PREFIX, n);Uninterruptible loops up to INT_MAX do not seem like a good idea.
I modeled this after ChooseRelationName() in indexcmds.c. Looking again, I
see that it loops forever until a unique name is found. I suspect this is
unlikely to be a problem in practice. What strategy would you recommend
for choosing a unique name? Should we just append a couple of random
characters?
+ dir = AllocateDir(stage_path); + if (dir == NULL) + {Why not just use stat()? That's cheaper, and there's no
time-to-check-time-to-use issue here, we're the only one writing.
I'm not sure why I didn't use stat(). I will update this.
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) + while (!ShutdownRequestPending && + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
not do their jobs when ShutdownRequestPending is set, particularly without a
huge fat comment.
The idea was to avoid delaying shutdown because we're waiting for the
custodian to finish relatively nonessential tasks. Another option might be
to just exit immediately when the custodian receives a shutdown request.
+ /* + * If we just staged some pgsql_tmp directories for removal, wake up the + * custodian process so that it deletes all the files in the staged + * directories as well as the directories themselves. + */ + if (stage && ProcGlobal->custodianLatch) + SetLatch(ProcGlobal->custodianLatch);Just signalling without letting the custodian know what it's expected to do
strikes me as a bad idea.
Good point. I will work on that.
From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to custodian.This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 2 --
src/backend/postmaster/custodian.c | 11 +++++++++++
src/backend/replication/logical/snapbuild.c | 13 +++++++------
src/include/replication/snapbuild.h | 2 +-
4 files changed, 19 insertions(+), 9 deletions(-)Why does this not open us up to new xid wraparound issues? Before there was a
hard bound on how long these files could linger around. Now there's not
anymore.
Sorry, I'm probably missing something obvious, but I'm not sure how this
adds transaction ID wraparound risk. These files are tied to LSNs, and
AFAIK they won't impact slots' xmins.
+#ifdef HAVE_SYNCFS + + /* + * If we are doing a shutdown or end-of-recovery checkpoint, let's use + * syncfs() to flush the mappings to disk instead of flushing each one + * individually. This may save us quite a bit of time when there are many + * such files to flush. + */I am doubtful this is a good idea. This will cause all dirty files to be
written back, even ones we don't need to be written back. At once. Very
possibly *slowing down* the shutdown.What is even the theory of the case here? That there's so many dirty mapping
files that fsyncing them will take too long? That iterating would take too
long?
Well, yes. My idea was to model this after 61752af, which allows using
syncfs() instead of individually fsync-ing every file in the data
directory. However, I would likely need to introduce a GUC because 1) as
you pointed out, it might be slower and 2) syncfs() doesn't report errors
on older versions of Linux.
TBH I do feel like this one is a bit of a stretch, so I am okay with
leaving it out for now.
5 files changed, 317 insertions(+), 9 deletions(-)
This seems such an increase in complexity and fragility that I really doubt
this is a good idea.
I think that's a fair point. I'm okay with leaving this one out for now,
too.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
On 2022-02-16 20:14:04 -0800, Nathan Bossart wrote:
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) + while (!ShutdownRequestPending && + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
not do their jobs when ShutdownRequestPending is set, particularly without a
huge fat comment.The idea was to avoid delaying shutdown because we're waiting for the
custodian to finish relatively nonessential tasks. Another option might be
to just exit immediately when the custodian receives a shutdown request.
I think we should just not do either of these and let the functions
finish. For the cases where shutdown really needs to be immediate
there's, uhm, immediate mode shutdowns.
Why does this not open us up to new xid wraparound issues? Before there was a
hard bound on how long these files could linger around. Now there's not
anymore.Sorry, I'm probably missing something obvious, but I'm not sure how this
adds transaction ID wraparound risk. These files are tied to LSNs, and
AFAIK they won't impact slots' xmins.
They're accessed by xid. The LSN is just for cleanup. Accessing files
left over from a previous transaction with the same xid wouldn't be
good - we'd read wrong catalog state for decoding...
Andres
On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
On 2022-02-16 20:14:04 -0800, Nathan Bossart wrote:
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) + while (!ShutdownRequestPending && + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
not do their jobs when ShutdownRequestPending is set, particularly without a
huge fat comment.The idea was to avoid delaying shutdown because we're waiting for the
custodian to finish relatively nonessential tasks. Another option might be
to just exit immediately when the custodian receives a shutdown request.I think we should just not do either of these and let the functions
finish. For the cases where shutdown really needs to be immediate
there's, uhm, immediate mode shutdowns.
Alright.
Why does this not open us up to new xid wraparound issues? Before there was a
hard bound on how long these files could linger around. Now there's not
anymore.Sorry, I'm probably missing something obvious, but I'm not sure how this
adds transaction ID wraparound risk. These files are tied to LSNs, and
AFAIK they won't impact slots' xmins.They're accessed by xid. The LSN is just for cleanup. Accessing files
left over from a previous transaction with the same xid wouldn't be
good - we'd read wrong catalog state for decoding...
Okay, that part makes sense to me. However, I'm still confused about how
this is handled today and why moving cleanup to a separate auxiliary
process makes matters worse. I've done quite a bit of reading, and I
haven't found anything that seems intended to prevent this problem. Do you
have any pointers?
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
On 2022-02-17 10:23:37 -0800, Nathan Bossart wrote:
On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
They're accessed by xid. The LSN is just for cleanup. Accessing files
left over from a previous transaction with the same xid wouldn't be
good - we'd read wrong catalog state for decoding...Okay, that part makes sense to me. However, I'm still confused about how
this is handled today and why moving cleanup to a separate auxiliary
process makes matters worse.
Right now cleanup happens every checkpoint. So cleanup can't be deferred all
that far. We currently include a bunch of 32bit xids inside checkspoints, so
if they're rarer than 2^31-1, we're in trouble independent of logical
decoding.
But with this patch cleanup of logical decoding mapping files (and other
pieces) can be *indefinitely* deferred, without being noticeable.
One possible way to improve this would be to switch the on-disk filenames to
be based on 64bit xids. But that might also present some problems (file name
length, cost of converting 32bit xids to 64bit xids).
I've done quite a bit of reading, and I haven't found anything that seems
intended to prevent this problem. Do you have any pointers?
I don't know if we have an iron-clad enforcement of checkpoints happening
every 2*31-1 xids. It's very unlikely to happen - you'd run out of space
etc. But it'd be good to have something better than that.
Greetings,
Andres Freund
On Thu, Feb 17, 2022 at 11:27:09AM -0800, Andres Freund wrote:
On 2022-02-17 10:23:37 -0800, Nathan Bossart wrote:
On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
They're accessed by xid. The LSN is just for cleanup. Accessing files
left over from a previous transaction with the same xid wouldn't be
good - we'd read wrong catalog state for decoding...Okay, that part makes sense to me. However, I'm still confused about how
this is handled today and why moving cleanup to a separate auxiliary
process makes matters worse.Right now cleanup happens every checkpoint. So cleanup can't be deferred all
that far. We currently include a bunch of 32bit xids inside checkspoints, so
if they're rarer than 2^31-1, we're in trouble independent of logical
decoding.But with this patch cleanup of logical decoding mapping files (and other
pieces) can be *indefinitely* deferred, without being noticeable.
I see. The custodian should ordinarily remove the files as quickly as
possible. In fact, I bet it will typically line up with checkpoints for
most users, as the checkpointer will set the latch. However, if there are
many temporary files to clean up, removing the logical decoding files could
be delayed for some time, as you said.
One possible way to improve this would be to switch the on-disk filenames to
be based on 64bit xids. But that might also present some problems (file name
length, cost of converting 32bit xids to 64bit xids).
Okay.
I've done quite a bit of reading, and I haven't found anything that seems
intended to prevent this problem. Do you have any pointers?I don't know if we have an iron-clad enforcement of checkpoints happening
every 2*31-1 xids. It's very unlikely to happen - you'd run out of space
etc. But it'd be good to have something better than that.
Okay. So IIUC the problem might already exist today, but offloading these
tasks to a separate process could make it more likely.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
On 2022-02-17 13:00:22 -0800, Nathan Bossart wrote:
Okay. So IIUC the problem might already exist today, but offloading these
tasks to a separate process could make it more likely.
Vastly more, yes. Before checkpoints not happening would be a (but not a
great) form of backpressure. You can't cancel them without triggering a
crash-restart. Whereas custodian can be cancelled etc.
As I said before, I think this is tackling things from the wrong end. Instead
of moving the sometimes expensive task out of the way, but still expensive,
the focus should be to make the expensive task cheaper.
As far as I understand, the primary concern are logical decoding serialized
snapshots, because a lot of them can accumulate if there e.g. is an old unused
/ far behind slot. It should be easy to reduce the number of those snapshots
by e.g. eliding some redundant ones. Perhaps we could also make backends in
logical decoding occasionally do a bit of cleanup themselves.
I've not seen reports of the number of mapping files to be an real issue?
The improvements around deleting temporary files and serialized snapshots
afaict don't require a dedicated process - they're only relevant during
startup. We could use the approach of renaming the directory out of the way as
done in this patchset but perform the cleanup in the startup process after
we're up.
Greetings,
Andres Freund
On Thu, Feb 17, 2022 at 02:28:29PM -0800, Andres Freund wrote:
As far as I understand, the primary concern are logical decoding serialized
snapshots, because a lot of them can accumulate if there e.g. is an old unused
/ far behind slot. It should be easy to reduce the number of those snapshots
by e.g. eliding some redundant ones. Perhaps we could also make backends in
logical decoding occasionally do a bit of cleanup themselves.I've not seen reports of the number of mapping files to be an real issue?
I routinely see all four of these tasks impacting customers, but I'd say
the most common one is the temporary file cleanup. Besides eliminating
some redundant files and having backends perform some cleanup, what do you
think about skipping the logical decoding cleanup during
end-of-recovery/shutdown checkpoints? This was something that Bharath
brought up a while back [0]/messages/by-id/CALj2ACXkkSL8EBpR7m=Mt=yRGBhevcCs3x4fsp3Bc-D13yyHOg@mail.gmail.com. As I noted in that thread, startup and
shutdown could still take a while if checkpoints are regularly delayed due
to logical decoding cleanup, but that might still help avoid a bit of
downtime.
The improvements around deleting temporary files and serialized snapshots
afaict don't require a dedicated process - they're only relevant during
startup. We could use the approach of renaming the directory out of the way as
done in this patchset but perform the cleanup in the startup process after
we're up.
Perhaps this is a good place to start. As I mentioned above, IME the
temporary file cleanup is the most common problem, so I think even getting
that one fixed would be a huge improvement.
[0]: /messages/by-id/CALj2ACXkkSL8EBpR7m=Mt=yRGBhevcCs3x4fsp3Bc-D13yyHOg@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
On 2022-02-17 14:58:38 -0800, Nathan Bossart wrote:
On Thu, Feb 17, 2022 at 02:28:29PM -0800, Andres Freund wrote:
As far as I understand, the primary concern are logical decoding serialized
snapshots, because a lot of them can accumulate if there e.g. is an old unused
/ far behind slot. It should be easy to reduce the number of those snapshots
by e.g. eliding some redundant ones. Perhaps we could also make backends in
logical decoding occasionally do a bit of cleanup themselves.I've not seen reports of the number of mapping files to be an real issue?
I routinely see all four of these tasks impacting customers, but I'd say
the most common one is the temporary file cleanup.
I took temp file cleanup and StartupReorderBuffer() "out of consideration" for
custodian, because they're not needed during normal running.
Besides eliminating some redundant files and having backends perform some
cleanup, what do you think about skipping the logical decoding cleanup
during end-of-recovery/shutdown checkpoints?
I strongly disagree with it. Then you might never get the cleanup done, but
keep on operating until you hit corruption issues.
The improvements around deleting temporary files and serialized snapshots
afaict don't require a dedicated process - they're only relevant during
startup. We could use the approach of renaming the directory out of the way as
done in this patchset but perform the cleanup in the startup process after
we're up.Perhaps this is a good place to start. As I mentioned above, IME the
temporary file cleanup is the most common problem, so I think even getting
that one fixed would be a huge improvement.
Cool.
Greetings,
Andres Freund
On Thu, Feb 17, 2022 at 03:12:47PM -0800, Andres Freund wrote:
The improvements around deleting temporary files and serialized snapshots
afaict don't require a dedicated process - they're only relevant during
startup. We could use the approach of renaming the directory out of the way as
done in this patchset but perform the cleanup in the startup process after
we're up.Perhaps this is a good place to start. As I mentioned above, IME the
temporary file cleanup is the most common problem, so I think even getting
that one fixed would be a huge improvement.Cool.
Hm. How should this work for standbys? I can think of the following
options:
1. Do temporary file cleanup in the postmaster (as it does today).
2. Pause after allowing connections to clean up temporary files.
3. Do small amounts of temporary file cleanup whenever there is an
opportunity during recovery.
4. Wait until recovery completes before cleaning up temporary files.
I'm not too thrilled about any of these options.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Feb 17, 2022 at 03:12:47PM -0800, Andres Freund wrote:
The improvements around deleting temporary files and serialized snapshots
afaict don't require a dedicated process - they're only relevant during
startup. We could use the approach of renaming the directory out of the way as
done in this patchset but perform the cleanup in the startup process after
we're up.
BTW I know you don't like the dedicated process approach, but one
improvement to that approach could be to shut down the custodian process
when it has nothing to do.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
It seems unlikely that anything discussed in this thread will be committed
for v15, so I've adjusted the commitfest entry to v16 and moved it to the
next commitfest.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, 18 Feb 2022 at 20:51, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Thu, Feb 17, 2022 at 03:12:47PM -0800, Andres Freund wrote:
The improvements around deleting temporary files and serialized snapshots
afaict don't require a dedicated process - they're only relevant during
startup. We could use the approach of renaming the directory out of the way as
done in this patchset but perform the cleanup in the startup process after
we're up.BTW I know you don't like the dedicated process approach, but one
improvement to that approach could be to shut down the custodian process
when it has nothing to do.
Having a central cleanup process makes a lot of sense. There is a long
list of potential tasks for such a process. My understanding is that
autovacuum already has an interface for handling additional workload
types, which is how BRIN indexes are handled. Do we really need a new
process? If so, lets do this now.
Nathan's point that certain tasks are blocking fast startup is a good
one and higher availability is a critical end goal. The thought that
we should complete these tasks during checkpoint is a good one, but
checkpoints should NOT be delayed by long running tasks that are
secondary to availability.
Andres' point that it would be better to avoid long running tasks is
good, if that is possible. That can be done better over time. This
point does not block the higher level goal of better availability
asap, so I support Nathan's overall proposals.
--
Simon Riggs http://www.EnterpriseDB.com/
On Thu, Jun 23, 2022 at 7:58 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
Having a central cleanup process makes a lot of sense. There is a long
list of potential tasks for such a process. My understanding is that
autovacuum already has an interface for handling additional workload
types, which is how BRIN indexes are handled. Do we really need a new
process?
It seems to me that if there's a long list of possible tasks for such
a process, that's actually a trickier situation than if there were
only one or two, because it may happen that when task X is really
urgent, the process is already busy with task Y.
I don't think that piggybacking more stuff onto autovacuum is a very
good idea for this exact reason. We already know that autovacuum
workers can get so busy that they can't keep up with the need to
vacuum and analyze tables. If we give them more things to do, that
figures to make it worse, at least on busy systems.
I do agree that a general mechanism for getting cleanup tasks done in
the background could be a useful thing to have, but I feel like it's
hard to see exactly how to make it work well. We can't just allow it
to spin up a million new processes, but at the same time, if it can't
guarantee that time-critical tasks get performed relatively quickly,
it's pretty worthless.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, 23 Jun 2022 at 14:46, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jun 23, 2022 at 7:58 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:Having a central cleanup process makes a lot of sense. There is a long
list of potential tasks for such a process. My understanding is that
autovacuum already has an interface for handling additional workload
types, which is how BRIN indexes are handled. Do we really need a new
process?It seems to me that if there's a long list of possible tasks for such
a process, that's actually a trickier situation than if there were
only one or two, because it may happen that when task X is really
urgent, the process is already busy with task Y.I don't think that piggybacking more stuff onto autovacuum is a very
good idea for this exact reason. We already know that autovacuum
workers can get so busy that they can't keep up with the need to
vacuum and analyze tables. If we give them more things to do, that
figures to make it worse, at least on busy systems.I do agree that a general mechanism for getting cleanup tasks done in
the background could be a useful thing to have, but I feel like it's
hard to see exactly how to make it work well. We can't just allow it
to spin up a million new processes, but at the same time, if it can't
guarantee that time-critical tasks get performed relatively quickly,
it's pretty worthless.
Most of the tasks mentioned aren't time critical.
I have no objection to a new auxiliary process to execute those tasks,
which can be spawned when needed.
--
Simon Riggs http://www.EnterpriseDB.com/
On Thu, Jun 23, 2022 at 09:46:28AM -0400, Robert Haas wrote:
I do agree that a general mechanism for getting cleanup tasks done in
the background could be a useful thing to have, but I feel like it's
hard to see exactly how to make it work well. We can't just allow it
to spin up a million new processes, but at the same time, if it can't
guarantee that time-critical tasks get performed relatively quickly,
it's pretty worthless.
My intent with this new auxiliary process is to offload tasks that aren't
particularly time-critical. They are only time-critical in the sense that
1) you might eventually run out of space and 2) you might encounter
wraparound with the logical replication files. But AFAICT these same risks
exist today in the checkpointer approach, although maybe not to the same
extent. In any case, 2 seems solvable to me outside of this patch set.
I'm grateful for the discussion in this thread so far, but I'm not seeing a
clear path forward. I'm glad to see threads like the one to stop doing
end-of-recovery checkpoints [0]/messages/by-id/CA+TgmobrM2jvkiccCS9NgFcdjNSgAvk1qcAPx5S6F+oJT3D2mQ@mail.gmail.com, but I don't know if it will be possible to
solve all of these availability concerns in a piecemeal fashion. I remain
open to exploring other suggested approaches beyond creating a new
auxiliary process.
[0]: /messages/by-id/CA+TgmobrM2jvkiccCS9NgFcdjNSgAvk1qcAPx5S6F+oJT3D2mQ@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, 23 Jun 2022 at 18:15, Nathan Bossart <nathandbossart@gmail.com> wrote:
I'm grateful for the discussion in this thread so far, but I'm not seeing a
clear path forward.
+1 to add the new auxiliary process.
--
Simon Riggs http://www.EnterpriseDB.com/
On Fri, Jun 24, 2022 at 11:45:22AM +0100, Simon Riggs wrote:
On Thu, 23 Jun 2022 at 18:15, Nathan Bossart <nathandbossart@gmail.com> wrote:
I'm grateful for the discussion in this thread so far, but I'm not seeing a
clear path forward.+1 to add the new auxiliary process.
I went ahead and put together a new patch set for this in which I've
attempted to address most of the feedback from upthread. Notably, I've
abandoned 0007 and 0008, added a way for processes to request specific
tasks for the custodian, and removed all the checks for
ShutdownRequestPending.
I haven't addressed the existing transaction ID wraparound risk with the
logical replication files. My instinct is that this deserveѕ its own
thread, and it might need to be considered a prerequisite to this change
based on the prior discussion here.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v6-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchtext/x-diff; charset=us-asciiDownload
From a58a6bb70785a557a150680b64cd8ce78ce1b73a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v6 5/6] Move removal of old serialized snapshots to custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 6 ++++--
src/backend/postmaster/custodian.c | 12 ++++++++++++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 3 ++-
src/include/replication/snapbuild.h | 2 +-
5 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8764084e21..621bda0844 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -75,13 +75,13 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/basebackup.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -6840,10 +6840,12 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index a0ec94ea5c..861de882c6 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -31,6 +31,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -210,6 +211,17 @@ CustodianMain(void)
if (flags & CUSTODIAN_REMOVE_TEMP_FILES)
RemovePgTempFiles(false, false);
+ /*
+ * Remove serialized snapshots that are no longer required by any
+ * logical replication slot.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ if (flags & CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS)
+ RemoveOldSerializedSnapshots();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1119a12db9..42eb064bd8 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1911,14 +1911,13 @@ snapshot_not_interesting:
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index f6dcd9ddef..769c07f2c9 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,6 +18,7 @@ extern void CustodianShmemInit(void);
extern void RequestCustodian(int flags);
/* flags for RequestCustodian() */
-#define CUSTODIAN_REMOVE_TEMP_FILES 0x0001
+#define CUSTODIAN_REMOVE_TEMP_FILES 0x0001
+#define CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS 0x0002
#endif /* _CUSTODIAN_H */
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..55a2beb434 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v6-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchtext/x-diff; charset=us-asciiDownload
From 0add8bb19a4ee83c6a6ec1f313329d737bf304a5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v6 6/6] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
---
src/backend/access/heap/rewriteheap.c | 79 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 44 +++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 5 ++
4 files changed, 119 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2a53826736..edeab65e60 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -1182,7 +1183,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1210,6 +1212,10 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ CustodianSetLogicalRewriteCutoff(cutoff);
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS);
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1240,15 +1246,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1286,3 +1284,64 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
+ * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
+ * disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+ bool value_set = false;
+
+ cutoff = CustodianGetLogicalRewriteCutoff(&value_set);
+ if (!value_set)
+ return;
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 861de882c6..0ce4edcf61 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -27,6 +27,7 @@
#include <time.h>
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -46,6 +47,9 @@ typedef struct
{
slock_t cust_lck;
int cust_flags;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
+ bool logical_rewrite_mappings_cutoff_set;
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -222,6 +226,16 @@ CustodianMain(void)
if (flags & CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS)
RemoveOldSerializedSnapshots();
+ /*
+ * Remove logical rewrite mapping files that are no longer needed.
+ *
+ * It is not important for these to be removed in single-user mode, so
+ * we don't need any extra handling outside of the custodian process for
+ * this.
+ */
+ if (flags & CUSTODIAN_REMOVE_REWRITE_MAPPINGS)
+ RemoveOldLogicalRewriteMappings();
+
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
elapsed_secs = end_time - start_time;
@@ -274,3 +288,33 @@ RequestCustodian(int flags)
if (ProcGlobal->custodianLatch)
SetLatch(ProcGlobal->custodianLatch);
}
+
+/*
+ * Used by CheckPointLogicalRewriteHeap() to tell the custodian which logical
+ * rewrite mapping files it can remove.
+ */
+void
+CustodianSetLogicalRewriteCutoff(XLogRecPtr cutoff)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = cutoff;
+ CustodianShmem->logical_rewrite_mappings_cutoff_set = true;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(bool *value_set)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ *value_set = CustodianShmem->logical_rewrite_mappings_cutoff_set;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 3e27790b3f..61d7aa8ed8 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 769c07f2c9..1af96ebd02 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,13 +12,18 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(int flags);
+extern void CustodianSetLogicalRewriteCutoff(XLogRecPtr cutoff);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(bool *value_set);
/* flags for RequestCustodian() */
#define CUSTODIAN_REMOVE_TEMP_FILES 0x0001
#define CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS 0x0002
+#define CUSTODIAN_REMOVE_REWRITE_MAPPINGS 0x0004
#endif /* _CUSTODIAN_H */
--
2.25.1
v6-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From 68e3005c14ba116e372a1724dad079914108ab2d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v6 1/6] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 252 ++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 20 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
12 files changed, 345 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 39ac4490db..620a0b1bae 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..db00282658
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,252 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). Also, ensure that any offloaded
+ * tasks are either not required during single-user mode or are performed
+ * separately during single-user mode.
+ *
+ * The custodian is not an essential process and can shutdown quickly when
+ * requested. The custodian will wake up approximately once every 5
+ * minutes to perform its tasks, but backends can (and should) set its
+ * latch to wake it up sooner.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <time.h>
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+#define CUSTODIAN_TIMEOUT_S (300) /* 5 minutes */
+
+typedef struct
+{
+ slock_t cust_lck;
+ int cust_flags;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * On startup and after an exception, we won't know exactly what tasks need
+ * to be performed, so request all of them.
+ */
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->cust_flags = 0xFFFFFFFF;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ pg_time_t start_time;
+ pg_time_t end_time;
+ int elapsed_secs;
+ int cur_timeout;
+ int flags;
+
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ start_time = (pg_time_t) time(NULL);
+
+ /* Obtain requested tasks */
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ flags = CustodianShmem->cust_flags;
+ CustodianShmem->cust_flags = 0;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* TODO: offloaded tasks go here */
+
+ /* Calculate how long to sleep */
+ end_time = (pg_time_t) time(NULL);
+ elapsed_secs = end_time - start_time;
+ if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
+ continue; /* no sleep for us */
+ cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
+
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ cur_timeout * 1000L /* convert to ms */ ,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ */
+void
+RequestCustodian(int flags)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->cust_flags |= flags;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index dde4bc25b1..5162ee9dec 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -251,6 +251,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -548,6 +549,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1823,13 +1825,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2769,6 +2774,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3089,6 +3096,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3182,6 +3191,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3639,6 +3662,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3816,6 +3851,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3853,6 +3891,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3942,6 +3981,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4135,6 +4175,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1a6f527051..b19d743cab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -129,6 +130,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -277,6 +279,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37aaab1338..f297f489c9 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 87c15b9c6f..469768c4e4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index b25bd0e583..66bf42e5b1 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -273,6 +273,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_STARTUP:
backendDesc = "startup";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0af130fbc5..ffe9404c68 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -330,6 +330,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
@@ -433,6 +434,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -445,6 +447,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..c95a7c7de6
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,20 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(int flags);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..467421e371 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -394,6 +394,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -411,11 +413,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index b578e2ec75..7524e197e5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v6-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From 758003bef540e1174e381f6fd8cdb73dde13cab6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v6 2/6] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5162ee9dec..e67370012f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1127,7 +1127,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 24704b6a02..aa6ac8f219 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3160,7 +3160,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3176,7 +3176,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3209,7 +3209,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3247,13 +3247,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (S_ISDIR(statbuf.st_mode))
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3271,6 +3265,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 69549b000f..67a6ef4dbf 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 53a55c7a8a..8ed8afeadd 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -152,7 +152,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -268,7 +269,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v6-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From 5e95666efa31d6c8aa351e430c37ead6e27acb72 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v6 3/6] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be renamed to stage them for
removal. Then, all files in pgsql_tmp are removed before removing
the staged directories themselves. This change is being made in
preparation for a follow-up change to offload most temporary file
cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
---
src/backend/postmaster/postmaster.c | 8 +-
src/backend/storage/file/fd.c | 174 ++++++++++++++++++++++++----
src/include/storage/fd.h | 2 +-
3 files changed, 160 insertions(+), 24 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e67370012f..82aa0c6307 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
- RemovePgTempFiles();
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
/*
* Initialize the autovacuum subsystem (again, no process start yet)
@@ -4053,7 +4054,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
- RemovePgTempFiles();
+ {
+ RemovePgTempFiles(true, true);
+ RemovePgTempFiles(false, false);
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index aa6ac8f219..79ca3a5be9 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -112,6 +112,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_DIR_TO_REMOVE_PREFIX (PG_TEMP_FILES_DIR "_to_remove_")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -338,6 +340,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3133,24 +3137,20 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
* Remove temporary and temporary relation files left over from a prior
* postmaster session
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * If stage is true, this function will simply rename all pgsql_tmp directories
+ * to stage them for removal at a later time. If stage is false, this function
+ * will delete all files in the staged directories as well as the directories
+ * themselves.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * If remove_relation_files is true, this function will remove the temporary
+ * relation files. Otherwise, this step is skipped.
*
* NOTE: this function and its subroutines generally report syscall failures
* with ereport(LOG) and keep going. Removing temp files is not so critical
* that we should fail to start the database when we can't do it.
*/
void
-RemovePgTempFiles(void)
+RemovePgTempFiles(bool stage, bool remove_relation_files)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3159,9 +3159,16 @@ RemovePgTempFiles(void)
/*
* First process temp files in pg_default ($PGDATA/base)
*/
- snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
- RemovePgTempRelationFiles("base");
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ RemoveStagedPgTempDirs("base");
+
+ if (remove_relation_files)
+ RemovePgTempRelationFiles("base");
/*
* Cycle through temp directories for all non-default tablespaces.
@@ -3174,13 +3181,26 @@ RemovePgTempFiles(void)
strcmp(spc_de->d_name, "..") == 0)
continue;
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ if (stage)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY,
+ PG_TEMP_FILES_DIR);
+ StagePgTempDirForRemoval(temp_path);
+ }
+ else
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
- snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
- spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- RemovePgTempRelationFiles(temp_path);
+ if (remove_relation_files)
+ {
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemovePgTempRelationFiles(temp_path);
+ }
}
FreeDir(spc_dir);
@@ -3194,7 +3214,119 @@ RemovePgTempFiles(void)
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function renames the given directory with a special prefix that
+ * RemoveStagedPgTempDirs() will know to look for. An integer is appended to
+ * the end of the new directory name in case previously staged pgsql_tmp
+ * directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ DIR *dir;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+ struct stat statbuf;
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ dir = AllocateDir(tmp_dir);
+ if (dir == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m", tmp_dir)));
+ return;
+ }
+ FreeDir(dir);
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Find a name for the stage directory. We just increment an integer at the
+ * end of the name until we find one that doesn't exist.
+ */
+ for (int n = 0; n <= INT_MAX; n++)
+ {
+ snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
+ PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
+
+ if (stat(stage_path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ break;
+
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", stage_path)));
+ return;
+ }
+
+ stage_path[0] = '\0';
+ }
+
+ /*
+ * In the unlikely event that we couldn't find a name for the stage
+ * directory, bail out.
+ */
+ if (stage_path[0] == '\0')
+ {
+ ereport(LOG,
+ (errmsg("could not stage \"%s\" for deletion",
+ tmp_dir)));
+ return;
+ }
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ dir = AllocateDir(spc_dir);
+ while ((de = ReadDirExtended(dir, spc_dir, LOG)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_DIR_TO_REMOVE_PREFIX,
+ strlen(PG_TEMP_DIR_TO_REMOVE_PREFIX)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", spc_dir, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
* Any other problem results in a LOG message. (missing_ok should be true at
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 67a6ef4dbf..3b0d6f62d6 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,7 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
-extern void RemovePgTempFiles(void);
+extern void RemovePgTempFiles(bool stage, bool remove_relation_files);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
--
2.25.1
v6-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From 43042799b96b588a446c509637b5acf570e2a325 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v6 4/6] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 14 +++++++++++++-
src/backend/postmaster/postmaster.c | 14 +++++++++-----
src/backend/storage/file/fd.c | 21 +++++++++++++++------
src/include/postmaster/custodian.h | 3 +++
4 files changed, 40 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index db00282658..a0ec94ea5c 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -196,7 +196,19 @@ CustodianMain(void)
CustodianShmem->cust_flags = 0;
SpinLockRelease(&CustodianShmem->cust_lck);
- /* TODO: offloaded tasks go here */
+ /*
+ * Remove any pgsql_tmp directories that have been staged for deletion.
+ * Since pgsql_tmp directories can accumulate many files, removing all
+ * of the files during startup (which we used to do) can take a very
+ * long time. To avoid delaying startup, we simply have startup rename
+ * the temporary directories, and we clean them up here.
+ *
+ * pgsql_tmp directories are not staged or cleaned in single-user mode,
+ * so we don't need any extra handling outside of the custodian process
+ * for this.
+ */
+ if (flags & CUSTODIAN_REMOVE_TEMP_FILES)
+ RemovePgTempFiles(false, false);
/* Calculate how long to sleep */
end_time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 82aa0c6307..b67f8828df 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1401,9 +1401,11 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
/*
* Initialize the autovacuum subsystem (again, no process start yet)
@@ -4052,12 +4054,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
RemovePgTempFiles(true, true);
- RemovePgTempFiles(false, false);
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 79ca3a5be9..46dc1925a2 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -97,6 +97,7 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -1640,9 +1641,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1840,9 +1841,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
@@ -3211,6 +3212,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
* would create a race condition. It's done separately, earlier in
* postmaster startup.
*/
+
+ /*
+ * If we just staged some pgsql_tmp directories for removal, wake up the
+ * custodian process so that it deletes all the files in the staged
+ * directories as well as the directories themselves.
+ */
+ if (stage)
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES);
}
/*
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index c95a7c7de6..f6dcd9ddef 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -17,4 +17,7 @@ extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(int flags);
+/* flags for RequestCustodian() */
+#define CUSTODIAN_REMOVE_TEMP_FILES 0x0001
+
#endif /* _CUSTODIAN_H */
--
2.25.1
Hi,
On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
+ /* Obtain requested tasks */ + SpinLockAcquire(&CustodianShmem->cust_lck); + flags = CustodianShmem->cust_flags; + CustodianShmem->cust_flags = 0; + SpinLockRelease(&CustodianShmem->cust_lck);
Just resetting the flags to 0 is problematic. Consider what happens if there's
two tasks and and the one processed first errors out. You'll loose information
about needing to run the second task.
+ /* TODO: offloaded tasks go here */
Seems we're going to need some sorting of which tasks are most "urgent" / need
to be processed next if we plan to make this into some generic facility.
+/* + * RequestCustodian + * Called to request a custodian task. + */ +void +RequestCustodian(int flags) +{ + SpinLockAcquire(&CustodianShmem->cust_lck); + CustodianShmem->cust_flags |= flags; + SpinLockRelease(&CustodianShmem->cust_lck); + + if (ProcGlobal->custodianLatch) + SetLatch(ProcGlobal->custodianLatch); +}
With this representation we can't really implement waiting for a task or
such. And it doesn't seem like a great API for the caller to just specify a
mix of flags.
+ /* Calculate how long to sleep */ + end_time = (pg_time_t) time(NULL); + elapsed_secs = end_time - start_time; + if (elapsed_secs >= CUSTODIAN_TIMEOUT_S) + continue; /* no sleep for us */ + cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs; + + (void) WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + cur_timeout * 1000L /* convert to ms */ , + WAIT_EVENT_CUSTODIAN_MAIN); + }
I don't think we should have this thing wake up on a regular basis. We're
doing way too much of that already, and I don't think we should add
more. Either we need a list of times when tasks need to be processed and wake
up at that time, or just wake up if somebody requests a task.
From 5e95666efa31d6c8aa351e430c37ead6e27acb72 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v6 3/6] Split pgsql_tmp cleanup into two stages.First, pgsql_tmp directories will be renamed to stage them for
removal. Then, all files in pgsql_tmp are removed before removing
the staged directories themselves. This change is being made in
preparation for a follow-up change to offload most temporary file
cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
---
src/backend/postmaster/postmaster.c | 8 +-
src/backend/storage/file/fd.c | 174 ++++++++++++++++++++++++----
src/include/storage/fd.h | 2 +-
3 files changed, 160 insertions(+), 24 deletions(-)diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index e67370012f..82aa0c6307 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[]) * Remove old temporary files. At this point there can be no other * Postgres processes running in this directory, so this should be safe. */ - RemovePgTempFiles(); + RemovePgTempFiles(true, true); + RemovePgTempFiles(false, false);
This is imo hard to read and easy to get wrong. Make it multiple functions or
pass named flags in.
+ * StagePgTempDirForRemoval + * + * This function renames the given directory with a special prefix that + * RemoveStagedPgTempDirs() will know to look for. An integer is appended to + * the end of the new directory name in case previously staged pgsql_tmp + * directories have not yet been removed. + */
It doesn't seem great to need to iterate through a directory that contains
other files, potentially a significant number. How about having a
staged_for_removal/ directory, and then only scanning that?
+static void +StagePgTempDirForRemoval(const char *tmp_dir) +{ + DIR *dir; + char stage_path[MAXPGPATH * 2]; + char parent_path[MAXPGPATH * 2]; + struct stat statbuf; + + /* + * If tmp_dir doesn't exist, there is nothing to stage. + */ + dir = AllocateDir(tmp_dir); + if (dir == NULL) + { + if (errno != ENOENT) + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open directory \"%s\": %m", tmp_dir))); + return; + } + FreeDir(dir); + + strlcpy(parent_path, tmp_dir, MAXPGPATH * 2); + get_parent_directory(parent_path); + + /* + * get_parent_directory() returns an empty string if the input argument is + * just a file name (see comments in path.c), so handle that as being the + * current directory. + */ + if (strlen(parent_path) == 0) + strlcpy(parent_path, ".", MAXPGPATH * 2); + + /* + * Find a name for the stage directory. We just increment an integer at the + * end of the name until we find one that doesn't exist. + */ + for (int n = 0; n <= INT_MAX; n++) + { + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path, + PG_TEMP_DIR_TO_REMOVE_PREFIX, n); + + if (stat(stage_path, &statbuf) != 0) + { + if (errno == ENOENT) + break; + + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not stat file \"%s\": %m", stage_path))); + return; + } + + stage_path[0] = '\0';
I still dislike this approach. Loops until INT_MAX, not interruptible... Can't
we prevent conflicts by adding a timestamp or such?
+ } + + /* + * In the unlikely event that we couldn't find a name for the stage + * directory, bail out. + */ + if (stage_path[0] == '\0') + { + ereport(LOG, + (errmsg("could not stage \"%s\" for deletion", + tmp_dir))); + return; + }
That's imo very much not ok. Just continuing in unexpected situations is a
recipe for introducing bugs / being hard to debug.
From 43042799b96b588a446c509637b5acf570e2a325 Mon Sep 17 00:00:00 2001
From a58a6bb70785a557a150680b64cd8ce78ce1b73a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v6 5/6] Move removal of old serialized snapshots to custodian.This was only done during checkpoints because it was a convenient
place to put it.
As mentioned before, having it done as part of checkpoints provides pretty
decent wraparound protection - yes, it's not theoretically perfect, but in
reality it's very unlikely you can have an xid wraparound within one
checkpoint. I've mentioned this before, so at the very least I'd like to see
this acknowledged in the commit message.
However, if there are many snapshots to remove, it can significantly extend
checkpoint time.
I'd really like to see a reproducer or profile for this...
+ /* + * Remove serialized snapshots that are no longer required by any + * logical replication slot. + * + * It is not important for these to be removed in single-user mode, so + * we don't need any extra handling outside of the custodian process for + * this. + */
I don't think this claim is correct.
From 0add8bb19a4ee83c6a6ec1f313329d737bf304a5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v6 6/6] Move removal of old logical rewrite mapping files to
custodian.If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
As above I'd like to know why this could take that long. What are you doing
that there's so many mapping files (which only exist for catalog tables!) that
this is a significant fraction of a checkpoint?
---
src/backend/access/heap/rewriteheap.c | 79 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 44 +++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 5 ++
4 files changed, 119 insertions(+), 10 deletions(-)diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 2a53826736..edeab65e60 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -116,6 +116,7 @@ #include "lib/ilist.h" #include "miscadmin.h" #include "pgstat.h" +#include "postmaster/custodian.h" #include "replication/logical.h" #include "replication/slot.h" #include "storage/bufmgr.h" @@ -1182,7 +1183,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r) * Perform a checkpoint for logical rewrite mappings * * This serves two tasks: - * 1) Remove all mappings not needed anymore based on the logical restart LSN + * 1) Alert the custodian to remove all mappings not needed anymore based on the + * logical restart LSN * 2) Flush all remaining mappings to disk, so that replay after a checkpoint * only has to deal with the parts of a mapping that have been written out * after the checkpoint started. @@ -1210,6 +1212,10 @@ CheckPointLogicalRewriteHeap(void) if (cutoff != InvalidXLogRecPtr && redo < cutoff) cutoff = redo;+ /* let the custodian know what it can remove */ + CustodianSetLogicalRewriteCutoff(cutoff);
Setting this variable in a custodian datastructure and then fetching it from
there seems architecturally wrong to me.
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS);
What about single user mode?
ISTM that RequestCustodian() needs to either assert out if called in single
user mode, or execute tasks immediately in that context.
+ +/* + * Remove all mappings not needed anymore based on the logical restart LSN saved + * by the checkpointer. We use this saved value instead of calling + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to + * disk. + */
What interference could there be?
+void +RemoveOldLogicalRewriteMappings(void) +{ + XLogRecPtr cutoff; + DIR *mappings_dir; + struct dirent *mapping_de; + char path[MAXPGPATH + 20]; + bool value_set = false; + + cutoff = CustodianGetLogicalRewriteCutoff(&value_set); + if (!value_set) + return;
Afaics nothing clears values_set - is that a good idea?
Greetings,
Andres Freund
Hi Andres,
Thanks for the prompt review.
On Sat, Jul 02, 2022 at 03:54:56PM -0700, Andres Freund wrote:
On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
+ /* Obtain requested tasks */ + SpinLockAcquire(&CustodianShmem->cust_lck); + flags = CustodianShmem->cust_flags; + CustodianShmem->cust_flags = 0; + SpinLockRelease(&CustodianShmem->cust_lck);Just resetting the flags to 0 is problematic. Consider what happens if there's
two tasks and and the one processed first errors out. You'll loose information
about needing to run the second task.
I think we also want to retry any failed tasks. The way v6 handles this is
by requesting all tasks after an exception. Another way to handle this
could be to reset each individual flag before the task is executed, and
then we could surround each one with a PG_CATCH block that resets the flag.
I'll do it this way in the next revision.
+/* + * RequestCustodian + * Called to request a custodian task. + */ +void +RequestCustodian(int flags) +{ + SpinLockAcquire(&CustodianShmem->cust_lck); + CustodianShmem->cust_flags |= flags; + SpinLockRelease(&CustodianShmem->cust_lck); + + if (ProcGlobal->custodianLatch) + SetLatch(ProcGlobal->custodianLatch); +}With this representation we can't really implement waiting for a task or
such. And it doesn't seem like a great API for the caller to just specify a
mix of flags.
At the moment, the idea is that nothing should need to wait for a task
because the custodian only handles things that are relatively non-critical.
If that changes, this could probably be expanded to look more like
RequestCheckpoint().
What would you suggest using instead of a mix of flags?
+ /* Calculate how long to sleep */ + end_time = (pg_time_t) time(NULL); + elapsed_secs = end_time - start_time; + if (elapsed_secs >= CUSTODIAN_TIMEOUT_S) + continue; /* no sleep for us */ + cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs; + + (void) WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + cur_timeout * 1000L /* convert to ms */ , + WAIT_EVENT_CUSTODIAN_MAIN); + }I don't think we should have this thing wake up on a regular basis. We're
doing way too much of that already, and I don't think we should add
more. Either we need a list of times when tasks need to be processed and wake
up at that time, or just wake up if somebody requests a task.
I agree. I will remove the timeout in the next revision.
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index e67370012f..82aa0c6307 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[]) * Remove old temporary files. At this point there can be no other * Postgres processes running in this directory, so this should be safe. */ - RemovePgTempFiles(); + RemovePgTempFiles(true, true); + RemovePgTempFiles(false, false);This is imo hard to read and easy to get wrong. Make it multiple functions or
pass named flags in.
Will do.
+ * StagePgTempDirForRemoval + * + * This function renames the given directory with a special prefix that + * RemoveStagedPgTempDirs() will know to look for. An integer is appended to + * the end of the new directory name in case previously staged pgsql_tmp + * directories have not yet been removed. + */It doesn't seem great to need to iterate through a directory that contains
other files, potentially a significant number. How about having a
staged_for_removal/ directory, and then only scanning that?
Yeah, that seems like a good idea. Will do.
+ /* + * Find a name for the stage directory. We just increment an integer at the + * end of the name until we find one that doesn't exist. + */ + for (int n = 0; n <= INT_MAX; n++) + { + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path, + PG_TEMP_DIR_TO_REMOVE_PREFIX, n); + + if (stat(stage_path, &statbuf) != 0) + { + if (errno == ENOENT) + break; + + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not stat file \"%s\": %m", stage_path))); + return; + } + + stage_path[0] = '\0';I still dislike this approach. Loops until INT_MAX, not interruptible... Can't
we prevent conflicts by adding a timestamp or such?
I suppose it's highly unlikely that we'd see a conflict if we used the
timestamp instead. I'll do it this way in the next revision if that seems
good enough.
From a58a6bb70785a557a150680b64cd8ce78ce1b73a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v6 5/6] Move removal of old serialized snapshots to custodian.This was only done during checkpoints because it was a convenient
place to put it.As mentioned before, having it done as part of checkpoints provides pretty
decent wraparound protection - yes, it's not theoretically perfect, but in
reality it's very unlikely you can have an xid wraparound within one
checkpoint. I've mentioned this before, so at the very least I'd like to see
this acknowledged in the commit message.
Will do.
+ /* let the custodian know what it can remove */ + CustodianSetLogicalRewriteCutoff(cutoff);Setting this variable in a custodian datastructure and then fetching it from
there seems architecturally wrong to me.
Where do you think it should go? I previously had it in the checkpointer's
shared memory, but you didn't like that the functions were declared in
bgwriter.h (along with the other checkpoint stuff). If the checkpointer
shared memory is the right place, should we create checkpointer.h and use
that instead?
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS);
What about single user mode?
ISTM that RequestCustodian() needs to either assert out if called in single
user mode, or execute tasks immediately in that context.
I like the idea of executing the tasks immediately since that's what
happens today in single-user mode. I will try doing it that way.
+/* + * Remove all mappings not needed anymore based on the logical restart LSN saved + * by the checkpointer. We use this saved value instead of calling + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to + * disk. + */What interference could there be?
My concern is that the custodian could obtain a later cutoff than what the
checkpointer does, which might cause files to be concurrently unlinked and
fsync'd. If we always use the checkpointer's cutoff, that shouldn't be a
problem. This could probably be better explained in this comment.
+void +RemoveOldLogicalRewriteMappings(void) +{ + XLogRecPtr cutoff; + DIR *mappings_dir; + struct dirent *mapping_de; + char path[MAXPGPATH + 20]; + bool value_set = false; + + cutoff = CustodianGetLogicalRewriteCutoff(&value_set); + if (!value_set) + return;Afaics nothing clears values_set - is that a good idea?
I'm using value_set to differentiate the case where InvalidXLogRecPtr means
the checkpointer hasn't determined a value yet versus the case where it
has. In the former, we don't want to take any action. In the latter, we
want to unlink all the files. Since we're moving to a request model for
the custodian, I might be able to remove this value_set stuff completely.
If that's not possible, it probably deserves a better comment.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
On 2022-07-03 10:07:54 -0700, Nathan Bossart wrote:
Thanks for the prompt review.
On Sat, Jul 02, 2022 at 03:54:56PM -0700, Andres Freund wrote:
On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
+ /* Obtain requested tasks */ + SpinLockAcquire(&CustodianShmem->cust_lck); + flags = CustodianShmem->cust_flags; + CustodianShmem->cust_flags = 0; + SpinLockRelease(&CustodianShmem->cust_lck);Just resetting the flags to 0 is problematic. Consider what happens if there's
two tasks and and the one processed first errors out. You'll loose information
about needing to run the second task.I think we also want to retry any failed tasks.
I don't think so, at least not if it's just going to retry that task straight
away - then we'll get stuck on that one task forever. If we had the ability to
"queue" it the end, to be processed after other already dequeued tasks, it'd
be a different story.
The way v6 handles this is by requesting all tasks after an exception.
Ick. That strikes me as a bad idea.
+/* + * RequestCustodian + * Called to request a custodian task. + */ +void +RequestCustodian(int flags) +{ + SpinLockAcquire(&CustodianShmem->cust_lck); + CustodianShmem->cust_flags |= flags; + SpinLockRelease(&CustodianShmem->cust_lck); + + if (ProcGlobal->custodianLatch) + SetLatch(ProcGlobal->custodianLatch); +}With this representation we can't really implement waiting for a task or
such. And it doesn't seem like a great API for the caller to just specify a
mix of flags.At the moment, the idea is that nothing should need to wait for a task
because the custodian only handles things that are relatively non-critical.
Which is just plainly not true as the patchset stands...
I think we're going to have to block if some cleanup as part of a checkpoint
hasn't been completed by the next checkpoint - otherwise it'll just end up
being way too confusing and there's absolutely no backpressure anymore.
If that changes, this could probably be expanded to look more like
RequestCheckpoint().What would you suggest using instead of a mix of flags?
I suspect an array of tasks with requested and completed counters or such?
With a condition variable to wait on?
+ /* let the custodian know what it can remove */ + CustodianSetLogicalRewriteCutoff(cutoff);Setting this variable in a custodian datastructure and then fetching it from
there seems architecturally wrong to me.Where do you think it should go? I previously had it in the checkpointer's
shared memory, but you didn't like that the functions were declared in
bgwriter.h (along with the other checkpoint stuff). If the checkpointer
shared memory is the right place, should we create checkpointer.h and use
that instead?
Well, so far I have not understood what the whole point of the shared state
is, so i have a bit of a hard time answering this ;)
+/* + * Remove all mappings not needed anymore based on the logical restart LSN saved + * by the checkpointer. We use this saved value instead of calling + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to + * disk. + */What interference could there be?
My concern is that the custodian could obtain a later cutoff than what the
checkpointer does, which might cause files to be concurrently unlinked and
fsync'd. If we always use the checkpointer's cutoff, that shouldn't be a
problem. This could probably be better explained in this comment.
How about having a Datum argument to RequestCustodian() that is forwarded to
the task?
+void +RemoveOldLogicalRewriteMappings(void) +{ + XLogRecPtr cutoff; + DIR *mappings_dir; + struct dirent *mapping_de; + char path[MAXPGPATH + 20]; + bool value_set = false; + + cutoff = CustodianGetLogicalRewriteCutoff(&value_set); + if (!value_set) + return;Afaics nothing clears values_set - is that a good idea?
I'm using value_set to differentiate the case where InvalidXLogRecPtr means
the checkpointer hasn't determined a value yet versus the case where it
has. In the former, we don't want to take any action. In the latter, we
want to unlink all the files. Since we're moving to a request model for
the custodian, I might be able to remove this value_set stuff completely.
If that's not possible, it probably deserves a better comment.
It would.
Greetings,
Andres Freund
Here's a new revision where I've attempted to address all the feedback I've
received thus far. Notably, the custodian now uses a queue for registering
tasks and determining which tasks to execute. Other changes include
splitting the temporary file functions apart to avoid consecutive boolean
flags, using a timestamp instead of an integer for the staging name for
temporary directories, moving temporary directories to a dedicated
directory so that the custodian doesn't need to scan relation files,
ERROR-ing when something goes wrong when cleaning up temporary files,
executing requested tasks immediately in single-user mode, and more.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v7-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From f2d5205b7fab3c1dccbe25829c9b46ba26b3cd9f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v7 1/6] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 383 ++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
12 files changed, 488 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 39ac4490db..620a0b1bae 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks(true);
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If retry is true, the custodian will re-enqueue the currently running task if
+ * an exception is encountered.
+ */
+static void
+DoCustodianTasks(bool retry)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (retry)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * If immediate is true, the task is performed immediately in the current
+ * process, and this function will not return until it completes. This is
+ * mostly useful for single-user mode. If immediate is false, the task is added
+ * to the custodian's queue if it is not already enqueued, and this function
+ * returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, bool immediate, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (immediate)
+ DoCustodianTasks(false);
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d7257e4056..1f707d64ac 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -251,6 +251,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -548,6 +549,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1822,13 +1824,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2768,6 +2773,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3088,6 +3095,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3181,6 +3190,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3638,6 +3661,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3815,6 +3850,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3852,6 +3890,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3941,6 +3980,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4134,6 +4174,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1a6f527051..b19d743cab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -129,6 +130,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -277,6 +279,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37aaab1338..f297f489c9 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 87c15b9c6f..469768c4e4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb43b2c5e5..1210c2e7a3 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -273,6 +273,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_STARTUP:
backendDesc = "startup";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0af130fbc5..ffe9404c68 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -330,6 +330,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
@@ -433,6 +434,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -445,6 +447,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..170ca61a21
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..467421e371 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -394,6 +394,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -411,11 +413,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index b578e2ec75..7524e197e5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v7-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From 69f4ef946ca7d0925d30f18dc7ed2da27f65111d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v7 2/6] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1f707d64ac..2f34bf7e55 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1126,7 +1126,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f904f60c08..7e1660eabf 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3097,7 +3097,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3113,7 +3113,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3146,7 +3146,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3184,13 +3184,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (S_ISDIR(statbuf.st_mode))
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3208,6 +3202,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2b4a8e0ffe..079176b153 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 53a55c7a8a..8ed8afeadd 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -152,7 +152,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -268,7 +269,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v7-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From edd79eb3e45c6eac190e00962d7d4f8ec01a3eaa Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v7 3/6] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be moved to a staging directory
and renamed to prepare them for removal. Then, all files in these
directories are removed before removing the directories themselves.
This change is being made in preparation for a follow-up change to
offload most temporary file cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
This change also modifies several ereport(LOG, ...) calls within
the temporary file cleanup code to ERROR instead. While temporary
file cleanup is typically not urgent enough to prevent startup,
excessive lenience might mask bugs.
---
src/backend/postmaster/postmaster.c | 4 +
src/backend/storage/file/fd.c | 214 +++++++++++++++++++++++-----
src/include/storage/fd.h | 1 +
3 files changed, 181 insertions(+), 38 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2f34bf7e55..50f348c42c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1401,6 +1401,7 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
/*
@@ -4052,7 +4053,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
+ {
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7e1660eabf..02c48a668b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -76,6 +76,7 @@
#include <sys/file.h>
#include <sys/param.h>
#include <sys/stat.h>
+#include <sys/time.h>
#include <sys/types.h>
#ifndef WIN32
#include <sys/mman.h>
@@ -90,6 +91,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_tablespace.h"
+#include "common/int.h"
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/pg_prng.h"
@@ -112,6 +114,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_TO_REMOVE_DIR (PG_TEMP_FILES_DIR "_staged_for_removal")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -338,6 +342,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3065,29 +3071,24 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
FreeDesc(&allocatedDescs[0]);
}
-
/*
- * Remove temporary and temporary relation files left over from a prior
- * postmaster session
+ * Stage temporary files left over from a prior postmaster session for removal.
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * This function also removes any leftover temporary relation files. Unlike
+ * temporary files stored in pgsql_tmp directories, temporary relation files do
+ * not live in their own directory, so there isn't a tremendously beneficial way
+ * to stage them for removal at a later time.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * RemovePgTempFiles() should be called at some point after this function in
+ * order to remove the staged temporary directories.
*
- * NOTE: this function and its subroutines generally report syscall failures
- * with ereport(LOG) and keep going. Removing temp files is not so critical
- * that we should fail to start the database when we can't do it.
+ * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
+ * DataDir as well. However, that is *not* cleaned here because doing so would
+ * create a race condition. It's done separately, earlier in postmaster
+ * startup.
*/
void
-RemovePgTempFiles(void)
+StagePgTempFilesForRemoval(void)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3097,7 +3098,8 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
+
RemovePgTempRelationFiles("base");
/*
@@ -3105,7 +3107,7 @@ RemovePgTempFiles(void)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3113,7 +3115,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3121,21 +3123,160 @@ RemovePgTempFiles(void)
}
FreeDir(spc_dir);
+}
+
+/*
+ * Remove temporary files that have been previously staged for removal by
+ * StagePgTempFilesForRemoval().
+ */
+void
+RemovePgTempFiles(void)
+{
+ char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
+ DIR *spc_dir;
+ struct dirent *spc_de;
+
+ /*
+ * First process temp files in pg_default ($PGDATA/base)
+ */
+ RemoveStagedPgTempDirs("base");
+
+ /*
+ * Cycle through temp directories for all non-default tablespaces.
+ */
+ spc_dir = AllocateDir("pg_tblspc");
+
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
+ {
+ if (strcmp(spc_de->d_name, ".") == 0 ||
+ strcmp(spc_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
+
+ FreeDir(spc_dir);
+}
+
+/*
+ * StagePgTempDirForRemoval
+ *
+ * This function moves the given directory to a staging directory and renames
+ * it in preparation for removal by a later call to RemoveStagedPgTempDirs().
+ * The current timestamp is appended to the end of the new directory name in
+ * case previously staged pgsql_tmp directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ struct stat st;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+ char to_remove_path[MAXPGPATH * 2];
+ struct timeval tv;
+ uint64 epoch;
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ if (stat(tmp_dir, &st) != 0)
+ {
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", tmp_dir)));
+ return;
+ }
+ else if (!S_ISDIR(st.st_mode))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("\"%s\" is not a directory", tmp_dir)));
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
/*
- * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
- * DataDir as well. However, that is *not* cleaned here because doing so
- * would create a race condition. It's done separately, earlier in
- * postmaster startup.
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Make sure the pgsql_tmp_staged_for_removal directory exists.
*/
+ snprintf(to_remove_path, sizeof(to_remove_path), "%s/%s", parent_path,
+ PG_TEMP_TO_REMOVE_DIR);
+ if (MakePGDirectory(to_remove_path) != 0 && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ to_remove_path)));
+
+ /*
+ * Pick a sufficiently unique name for the stage directory. We just append
+ * the current timestamp to the end of the name.
+ */
+ gettimeofday(&tv, NULL);
+ if (pg_mul_u64_overflow((uint64) 1000, (uint64) tv.tv_sec, &epoch) ||
+ pg_add_u64_overflow(epoch, (uint64) tv.tv_usec, &epoch))
+ elog(ERROR, "could not stage temporary file directory for removal");
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s." UINT64_FORMAT,
+ to_remove_path, PG_TEMP_FILES_DIR, epoch);
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char stage_path[MAXPGPATH * 2];
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s", spc_dir,
+ PG_TEMP_TO_REMOVE_DIR);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL && errno == ENOENT)
+ return;
+
+ while ((de = ReadDir(dir, stage_path)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_FILES_DIR,
+ strlen(PG_TEMP_FILES_DIR)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", stage_path, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
- * Any other problem results in a LOG message. (missing_ok should be true at
- * the top level, since pgsql_tmp directories are not created until needed.)
+ * Any other problem results in an ERROR. (missing_ok should be true at the
+ * top level, since pgsql_tmp directories are not created until needed.)
*
* At the top level, this should be called with unlink_all = false, so that
* only files matching the temporary name prefix will be unlinked. When
@@ -3157,7 +3298,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while ((temp_de = ReadDir(temp_dir, tmpdirname)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
@@ -3174,12 +3315,9 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
struct stat statbuf;
if (lstat(rm_path, &statbuf) < 0)
- {
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", rm_path)));
- continue;
- }
if (S_ISDIR(statbuf.st_mode))
{
@@ -3189,14 +3327,14 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else
{
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
}
}
else
- ereport(LOG,
+ ereport(ERROR,
(errmsg("unexpected file found in temporary-files directory: \"%s\"",
rm_path)));
}
@@ -3204,7 +3342,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
FreeDir(temp_dir);
if (rmdir(tmpdirname) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove directory \"%s\": %m",
tmpdirname)));
@@ -3220,7 +3358,7 @@ RemovePgTempRelationFiles(const char *tsdirname)
ts_dir = AllocateDir(tsdirname);
- while ((de = ReadDirExtended(ts_dir, tsdirname, LOG)) != NULL)
+ while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
/*
* We're only interested in the per-database directories, which have
@@ -3248,7 +3386,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDirExtended(dbspace_dir, dbspacedirname, LOG)) != NULL)
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
if (!looks_like_temp_rel_name(de->d_name))
continue;
@@ -3257,7 +3395,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspacedirname, de->d_name);
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 079176b153..2efe3d236d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,6 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
+extern void StagePgTempFilesForRemoval(void);
extern void RemovePgTempFiles(void);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
--
2.25.1
v7-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From cd58094d9623a57bfe15a5790120fba694523b0f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v7 4/6] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 1 +
src/backend/postmaster/postmaster.c | 24 +++++++++++++++++++-----
src/backend/storage/file/fd.c | 13 +++++++------
src/include/postmaster/custodian.h | 2 +-
4 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index e90f5d0d1f..fe1f48844e 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -70,6 +70,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 50f348c42c..ed120eb836 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -112,6 +112,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/custodian.h"
#include "postmaster/fork_process.h"
#include "postmaster/interrupt.h"
#include "postmaster/pgarch.h"
@@ -1400,9 +1401,12 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
/*
* Initialize the autovacuum subsystem (again, no process start yet)
@@ -4051,12 +4055,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
@@ -4068,6 +4074,14 @@ PostmasterStateMachine(void)
reset_shared();
+ /*
+ * Now that shared memory is initialized, notify the custodian to clean
+ * up the staged pgsql_tmp directories. We do this even if
+ * remove_temp_files_after_crash is false so that any previously staged
+ * directories are eventually cleaned up.
+ */
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
+
StartupPID = StartupDataBase();
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 02c48a668b..2f93f71d44 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -99,6 +99,7 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -1579,9 +1580,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1779,9 +1780,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 170ca61a21..80890ceadd 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_TEMP_FILES,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
--
2.25.1
v7-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchtext/x-diff; charset=us-asciiDownload
From b1b98f80ce936c651b82b26253bd524b35475fe5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v7 5/6] Move removal of old serialized snapshots to custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 8 ++++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 1 +
src/include/replication/snapbuild.h | 2 +-
5 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1b2f240228..f08a18d273 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -75,13 +75,13 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/basebackup.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -6837,10 +6837,14 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ !IsUnderPostmaster,
+ (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index fe1f48844e..855a756ca0 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -71,6 +72,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 73c0f15214..b945744e9c 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1911,14 +1911,13 @@ snapshot_not_interesting:
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 80890ceadd..37334941cc 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -19,6 +19,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index d179251aad..55a2beb434 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v7-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchtext/x-diff; charset=us-asciiDownload
From f784eef06614d1edf9bb2ec924545845e58c149c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v7 6/6] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 43 +++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
4 files changed, 116 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 197f06b5ec..6de1fb19de 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1182,7 +1184,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1210,6 +1213,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
+ !IsUnderPostmaster,
+ LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1240,15 +1248,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1286,3 +1286,61 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 855a756ca0..d4be19e5de 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(bool retry);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -73,6 +78,7 @@ struct cust_task_funcs_entry
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -384,3 +390,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 353cbb2924..965372b5ff 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 37334941cc..f177d55159 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -20,6 +22,7 @@ typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -29,5 +32,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
On Wed, Jul 06, 2022 at 09:51:10AM -0700, Nathan Bossart wrote:
Here's a new revision where I've attempted to address all the feedback I've
received thus far. Notably, the custodian now uses a queue for registering
tasks and determining which tasks to execute. Other changes include
splitting the temporary file functions apart to avoid consecutive boolean
flags, using a timestamp instead of an integer for the staging name for
temporary directories, moving temporary directories to a dedicated
directory so that the custodian doesn't need to scan relation files,
ERROR-ing when something goes wrong when cleaning up temporary files,
executing requested tasks immediately in single-user mode, and more.
Here is a rebased patch set for cfbot. There are no other differences
between v7 and v8.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v8-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From a0b28421a7a170598f6e60b2c17a8d49fb0ffd55 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v8 1/6] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 383 ++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
12 files changed, 488 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks(true);
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If retry is true, the custodian will re-enqueue the currently running task if
+ * an exception is encountered.
+ */
+static void
+DoCustodianTasks(bool retry)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (retry)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * If immediate is true, the task is performed immediately in the current
+ * process, and this function will not return until it completes. This is
+ * mostly useful for single-user mode. If immediate is false, the task is added
+ * to the custodian's queue if it is not already enqueued, and this function
+ * returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, bool immediate, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (immediate)
+ DoCustodianTasks(false);
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 81cb585891..d705ff6bf0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -251,6 +251,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -547,6 +548,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1826,13 +1828,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2755,6 +2760,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3075,6 +3082,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3168,6 +3177,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3625,6 +3648,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3802,6 +3837,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3839,6 +3877,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3928,6 +3967,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4122,6 +4162,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1a6f527051..b19d743cab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -129,6 +130,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -277,6 +279,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37aaab1338..f297f489c9 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 92f24a6c9b..d8e6ea45bc 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bd973ba613..22037f0d99 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -273,6 +273,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_STARTUP:
backendDesc = "startup";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 067b729d5a..ffd59616b7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -322,6 +322,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
@@ -425,6 +426,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -437,6 +439,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..170ca61a21
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..467421e371 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -394,6 +394,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -411,11 +413,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6f2d5612e0..58455dc016 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v8-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From 2494ee62efd35fa8cd3e09a208af76288ab2a851 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v8 2/6] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d705ff6bf0..c8c8df2bc8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1129,7 +1129,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index efb34d4dcb..c8a4a26385 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3085,7 +3085,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3101,7 +3101,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3134,7 +3134,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3172,13 +3172,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (S_ISDIR(statbuf.st_mode))
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3196,6 +3190,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2b4a8e0ffe..079176b153 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 53a55c7a8a..8ed8afeadd 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -152,7 +152,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -268,7 +269,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v8-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From 969bf1e227ceca4d158b9195911520787f71f29a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v8 3/6] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be moved to a staging directory
and renamed to prepare them for removal. Then, all files in these
directories are removed before removing the directories themselves.
This change is being made in preparation for a follow-up change to
offload most temporary file cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
This change also modifies several ereport(LOG, ...) calls within
the temporary file cleanup code to ERROR instead. While temporary
file cleanup is typically not urgent enough to prevent startup,
excessive lenience might mask bugs.
---
src/backend/postmaster/postmaster.c | 4 +
src/backend/storage/file/fd.c | 214 +++++++++++++++++++++++-----
src/include/storage/fd.h | 1 +
3 files changed, 181 insertions(+), 38 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index c8c8df2bc8..c3a466552e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1404,6 +1404,7 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
/*
@@ -4039,7 +4040,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
+ {
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index c8a4a26385..a9312b83aa 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -76,6 +76,7 @@
#include <sys/file.h>
#include <sys/param.h>
#include <sys/stat.h>
+#include <sys/time.h>
#include <sys/types.h>
#ifndef WIN32
#include <sys/mman.h>
@@ -90,6 +91,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_tablespace.h"
+#include "common/int.h"
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/pg_prng.h"
@@ -112,6 +114,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_TO_REMOVE_DIR (PG_TEMP_FILES_DIR "_staged_for_removal")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -338,6 +342,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3053,29 +3059,24 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
FreeDesc(&allocatedDescs[0]);
}
-
/*
- * Remove temporary and temporary relation files left over from a prior
- * postmaster session
+ * Stage temporary files left over from a prior postmaster session for removal.
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * This function also removes any leftover temporary relation files. Unlike
+ * temporary files stored in pgsql_tmp directories, temporary relation files do
+ * not live in their own directory, so there isn't a tremendously beneficial way
+ * to stage them for removal at a later time.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * RemovePgTempFiles() should be called at some point after this function in
+ * order to remove the staged temporary directories.
*
- * NOTE: this function and its subroutines generally report syscall failures
- * with ereport(LOG) and keep going. Removing temp files is not so critical
- * that we should fail to start the database when we can't do it.
+ * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
+ * DataDir as well. However, that is *not* cleaned here because doing so would
+ * create a race condition. It's done separately, earlier in postmaster
+ * startup.
*/
void
-RemovePgTempFiles(void)
+StagePgTempFilesForRemoval(void)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3085,7 +3086,8 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
+
RemovePgTempRelationFiles("base");
/*
@@ -3093,7 +3095,7 @@ RemovePgTempFiles(void)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3101,7 +3103,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3109,21 +3111,160 @@ RemovePgTempFiles(void)
}
FreeDir(spc_dir);
+}
+
+/*
+ * Remove temporary files that have been previously staged for removal by
+ * StagePgTempFilesForRemoval().
+ */
+void
+RemovePgTempFiles(void)
+{
+ char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
+ DIR *spc_dir;
+ struct dirent *spc_de;
+
+ /*
+ * First process temp files in pg_default ($PGDATA/base)
+ */
+ RemoveStagedPgTempDirs("base");
+
+ /*
+ * Cycle through temp directories for all non-default tablespaces.
+ */
+ spc_dir = AllocateDir("pg_tblspc");
+
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
+ {
+ if (strcmp(spc_de->d_name, ".") == 0 ||
+ strcmp(spc_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
+
+ FreeDir(spc_dir);
+}
+
+/*
+ * StagePgTempDirForRemoval
+ *
+ * This function moves the given directory to a staging directory and renames
+ * it in preparation for removal by a later call to RemoveStagedPgTempDirs().
+ * The current timestamp is appended to the end of the new directory name in
+ * case previously staged pgsql_tmp directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ struct stat st;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+ char to_remove_path[MAXPGPATH * 2];
+ struct timeval tv;
+ uint64 epoch;
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ if (stat(tmp_dir, &st) != 0)
+ {
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", tmp_dir)));
+ return;
+ }
+ else if (!S_ISDIR(st.st_mode))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("\"%s\" is not a directory", tmp_dir)));
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
/*
- * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
- * DataDir as well. However, that is *not* cleaned here because doing so
- * would create a race condition. It's done separately, earlier in
- * postmaster startup.
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Make sure the pgsql_tmp_staged_for_removal directory exists.
*/
+ snprintf(to_remove_path, sizeof(to_remove_path), "%s/%s", parent_path,
+ PG_TEMP_TO_REMOVE_DIR);
+ if (MakePGDirectory(to_remove_path) != 0 && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ to_remove_path)));
+
+ /*
+ * Pick a sufficiently unique name for the stage directory. We just append
+ * the current timestamp to the end of the name.
+ */
+ gettimeofday(&tv, NULL);
+ if (pg_mul_u64_overflow((uint64) 1000, (uint64) tv.tv_sec, &epoch) ||
+ pg_add_u64_overflow(epoch, (uint64) tv.tv_usec, &epoch))
+ elog(ERROR, "could not stage temporary file directory for removal");
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s." UINT64_FORMAT,
+ to_remove_path, PG_TEMP_FILES_DIR, epoch);
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char stage_path[MAXPGPATH * 2];
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s", spc_dir,
+ PG_TEMP_TO_REMOVE_DIR);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL && errno == ENOENT)
+ return;
+
+ while ((de = ReadDir(dir, stage_path)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_FILES_DIR,
+ strlen(PG_TEMP_FILES_DIR)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", stage_path, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
- * Any other problem results in a LOG message. (missing_ok should be true at
- * the top level, since pgsql_tmp directories are not created until needed.)
+ * Any other problem results in an ERROR. (missing_ok should be true at the
+ * top level, since pgsql_tmp directories are not created until needed.)
*
* At the top level, this should be called with unlink_all = false, so that
* only files matching the temporary name prefix will be unlinked. When
@@ -3145,7 +3286,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while ((temp_de = ReadDir(temp_dir, tmpdirname)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
@@ -3162,12 +3303,9 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
struct stat statbuf;
if (lstat(rm_path, &statbuf) < 0)
- {
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", rm_path)));
- continue;
- }
if (S_ISDIR(statbuf.st_mode))
{
@@ -3177,14 +3315,14 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else
{
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
}
}
else
- ereport(LOG,
+ ereport(ERROR,
(errmsg("unexpected file found in temporary-files directory: \"%s\"",
rm_path)));
}
@@ -3192,7 +3330,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
FreeDir(temp_dir);
if (rmdir(tmpdirname) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove directory \"%s\": %m",
tmpdirname)));
@@ -3208,7 +3346,7 @@ RemovePgTempRelationFiles(const char *tsdirname)
ts_dir = AllocateDir(tsdirname);
- while ((de = ReadDirExtended(ts_dir, tsdirname, LOG)) != NULL)
+ while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
/*
* We're only interested in the per-database directories, which have
@@ -3236,7 +3374,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDirExtended(dbspace_dir, dbspacedirname, LOG)) != NULL)
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
if (!looks_like_temp_rel_name(de->d_name))
continue;
@@ -3245,7 +3383,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspacedirname, de->d_name);
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 079176b153..2efe3d236d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,6 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
+extern void StagePgTempFilesForRemoval(void);
extern void RemovePgTempFiles(void);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
--
2.25.1
v8-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From 87f0b1462ad7bb3e7ed1c221619b1422575fa3ea Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v8 4/6] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 1 +
src/backend/postmaster/postmaster.c | 24 +++++++++++++++++++-----
src/backend/storage/file/fd.c | 13 +++++++------
src/include/postmaster/custodian.h | 2 +-
4 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index e90f5d0d1f..fe1f48844e 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -70,6 +70,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index c3a466552e..151375be03 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -112,6 +112,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/custodian.h"
#include "postmaster/fork_process.h"
#include "postmaster/interrupt.h"
#include "postmaster/pgarch.h"
@@ -1403,9 +1404,12 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
/*
* Initialize the autovacuum subsystem (again, no process start yet)
@@ -4038,12 +4042,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
@@ -4056,6 +4062,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ /*
+ * Now that shared memory is initialized, notify the custodian to clean
+ * up the staged pgsql_tmp directories. We do this even if
+ * remove_temp_files_after_crash is false so that any previously staged
+ * directories are eventually cleaned up.
+ */
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
+
StartupPID = StartupDataBase();
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a9312b83aa..c705a77e46 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -99,6 +99,7 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -1567,9 +1568,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1767,9 +1768,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 170ca61a21..80890ceadd 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_TEMP_FILES,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
--
2.25.1
v8-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchtext/x-diff; charset=us-asciiDownload
From e6b00d5df4954a00bd47cfc64ba02b2c1888937b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v8 5/6] Move removal of old serialized snapshots to custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 8 ++++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 1 +
src/include/replication/snapbuild.h | 2 +-
5 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9cedd6876f..72645f1fe6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -6846,10 +6846,14 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ !IsUnderPostmaster,
+ (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index fe1f48844e..855a756ca0 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -71,6 +72,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1ff2c12240..abafdb52b2 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2014,14 +2014,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 80890ceadd..37334941cc 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -19,6 +19,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index e6adea24f2..e1de013ece 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v8-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchtext/x-diff; charset=us-asciiDownload
From 9ee271e0a6dcfc3d61ebcf83e046f555397f5196 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v8 6/6] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 43 +++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
4 files changed, 116 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 9dd885d936..a08dd4a524 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1182,7 +1184,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1210,6 +1213,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
+ !IsUnderPostmaster,
+ LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1240,15 +1248,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1286,3 +1286,61 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 855a756ca0..d4be19e5de 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(bool retry);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -73,6 +78,7 @@ struct cust_task_funcs_entry
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -384,3 +390,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 353cbb2924..965372b5ff 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 37334941cc..f177d55159 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -20,6 +22,7 @@ typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -29,5 +32,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
On Thu, Aug 11, 2022 at 04:09:21PM -0700, Nathan Bossart wrote:
Here is a rebased patch set for cfbot. There are no other differences
between v7 and v8.
Another rebase for cfbot.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v9-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From 6810355cb3d1a03326b152aebe3c907f7544be4f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v9 1/6] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 383 ++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
12 files changed, 488 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks(true);
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If retry is true, the custodian will re-enqueue the currently running task if
+ * an exception is encountered.
+ */
+static void
+DoCustodianTasks(bool retry)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (retry)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * If immediate is true, the task is performed immediately in the current
+ * process, and this function will not return until it completes. This is
+ * mostly useful for single-user mode. If immediate is false, the task is added
+ * to the custodian's queue if it is not already enqueued, and this function
+ * returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, bool immediate, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (immediate)
+ DoCustodianTasks(false);
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1664fcee2a..b25c180886 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -248,6 +248,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -544,6 +545,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1821,13 +1823,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2750,6 +2755,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3070,6 +3077,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3163,6 +3172,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3620,6 +3643,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3797,6 +3832,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3834,6 +3872,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3923,6 +3962,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4117,6 +4157,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1a6f527051..b19d743cab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -129,6 +130,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -277,6 +279,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37aaab1338..f297f489c9 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 92f24a6c9b..d8e6ea45bc 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 683f616b1a..0131862973 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 65cf4ba50f..36a83018e2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -426,6 +427,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -438,6 +440,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..170ca61a21
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..467421e371 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -394,6 +394,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -411,11 +413,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6f2d5612e0..58455dc016 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v9-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From f1385ed846d21b0a544894c737eb06a99d0dab4b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v9 2/6] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b25c180886..180c9a0400 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1126,7 +1126,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e3b19ca1ed..790fcb3a34 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3083,7 +3083,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3099,7 +3099,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3132,7 +3132,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3170,13 +3170,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (S_ISDIR(statbuf.st_mode))
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3194,6 +3188,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2b4a8e0ffe..079176b153 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 53a55c7a8a..8ed8afeadd 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -152,7 +152,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -268,7 +269,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v9-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From 07e0c67c8042f429451af75f704ae8c4648c4194 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v9 3/6] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be moved to a staging directory
and renamed to prepare them for removal. Then, all files in these
directories are removed before removing the directories themselves.
This change is being made in preparation for a follow-up change to
offload most temporary file cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
This change also modifies several ereport(LOG, ...) calls within
the temporary file cleanup code to ERROR instead. While temporary
file cleanup is typically not urgent enough to prevent startup,
excessive lenience might mask bugs.
---
src/backend/postmaster/postmaster.c | 4 +
src/backend/storage/file/fd.c | 214 +++++++++++++++++++++++-----
src/include/storage/fd.h | 1 +
3 files changed, 181 insertions(+), 38 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 180c9a0400..6edae456f1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1399,6 +1399,7 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
/*
@@ -4034,7 +4035,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
+ {
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 790fcb3a34..a687bd05d7 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -77,6 +77,7 @@
#include <sys/param.h>
#include <sys/resource.h> /* for getrlimit */
#include <sys/stat.h>
+#include <sys/time.h>
#include <sys/types.h>
#ifndef WIN32
#include <sys/mman.h>
@@ -88,6 +89,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_tablespace.h"
+#include "common/int.h"
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/pg_prng.h"
@@ -110,6 +112,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_TO_REMOVE_DIR (PG_TEMP_FILES_DIR "_staged_for_removal")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -336,6 +340,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3051,29 +3057,24 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
FreeDesc(&allocatedDescs[0]);
}
-
/*
- * Remove temporary and temporary relation files left over from a prior
- * postmaster session
+ * Stage temporary files left over from a prior postmaster session for removal.
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * This function also removes any leftover temporary relation files. Unlike
+ * temporary files stored in pgsql_tmp directories, temporary relation files do
+ * not live in their own directory, so there isn't a tremendously beneficial way
+ * to stage them for removal at a later time.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * RemovePgTempFiles() should be called at some point after this function in
+ * order to remove the staged temporary directories.
*
- * NOTE: this function and its subroutines generally report syscall failures
- * with ereport(LOG) and keep going. Removing temp files is not so critical
- * that we should fail to start the database when we can't do it.
+ * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
+ * DataDir as well. However, that is *not* cleaned here because doing so would
+ * create a race condition. It's done separately, earlier in postmaster
+ * startup.
*/
void
-RemovePgTempFiles(void)
+StagePgTempFilesForRemoval(void)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3083,7 +3084,8 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
+
RemovePgTempRelationFiles("base");
/*
@@ -3091,7 +3093,7 @@ RemovePgTempFiles(void)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3099,7 +3101,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3107,21 +3109,160 @@ RemovePgTempFiles(void)
}
FreeDir(spc_dir);
+}
+
+/*
+ * Remove temporary files that have been previously staged for removal by
+ * StagePgTempFilesForRemoval().
+ */
+void
+RemovePgTempFiles(void)
+{
+ char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
+ DIR *spc_dir;
+ struct dirent *spc_de;
+
+ /*
+ * First process temp files in pg_default ($PGDATA/base)
+ */
+ RemoveStagedPgTempDirs("base");
+
+ /*
+ * Cycle through temp directories for all non-default tablespaces.
+ */
+ spc_dir = AllocateDir("pg_tblspc");
+
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
+ {
+ if (strcmp(spc_de->d_name, ".") == 0 ||
+ strcmp(spc_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
+
+ FreeDir(spc_dir);
+}
+
+/*
+ * StagePgTempDirForRemoval
+ *
+ * This function moves the given directory to a staging directory and renames
+ * it in preparation for removal by a later call to RemoveStagedPgTempDirs().
+ * The current timestamp is appended to the end of the new directory name in
+ * case previously staged pgsql_tmp directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ struct stat st;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+ char to_remove_path[MAXPGPATH * 2];
+ struct timeval tv;
+ uint64 epoch;
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ if (stat(tmp_dir, &st) != 0)
+ {
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", tmp_dir)));
+ return;
+ }
+ else if (!S_ISDIR(st.st_mode))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("\"%s\" is not a directory", tmp_dir)));
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
/*
- * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
- * DataDir as well. However, that is *not* cleaned here because doing so
- * would create a race condition. It's done separately, earlier in
- * postmaster startup.
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Make sure the pgsql_tmp_staged_for_removal directory exists.
*/
+ snprintf(to_remove_path, sizeof(to_remove_path), "%s/%s", parent_path,
+ PG_TEMP_TO_REMOVE_DIR);
+ if (MakePGDirectory(to_remove_path) != 0 && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ to_remove_path)));
+
+ /*
+ * Pick a sufficiently unique name for the stage directory. We just append
+ * the current timestamp to the end of the name.
+ */
+ gettimeofday(&tv, NULL);
+ if (pg_mul_u64_overflow((uint64) 1000, (uint64) tv.tv_sec, &epoch) ||
+ pg_add_u64_overflow(epoch, (uint64) tv.tv_usec, &epoch))
+ elog(ERROR, "could not stage temporary file directory for removal");
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s." UINT64_FORMAT,
+ to_remove_path, PG_TEMP_FILES_DIR, epoch);
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char stage_path[MAXPGPATH * 2];
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s", spc_dir,
+ PG_TEMP_TO_REMOVE_DIR);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL && errno == ENOENT)
+ return;
+
+ while ((de = ReadDir(dir, stage_path)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_FILES_DIR,
+ strlen(PG_TEMP_FILES_DIR)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", stage_path, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
- * Any other problem results in a LOG message. (missing_ok should be true at
- * the top level, since pgsql_tmp directories are not created until needed.)
+ * Any other problem results in an ERROR. (missing_ok should be true at the
+ * top level, since pgsql_tmp directories are not created until needed.)
*
* At the top level, this should be called with unlink_all = false, so that
* only files matching the temporary name prefix will be unlinked. When
@@ -3143,7 +3284,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while ((temp_de = ReadDir(temp_dir, tmpdirname)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
@@ -3160,12 +3301,9 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
struct stat statbuf;
if (lstat(rm_path, &statbuf) < 0)
- {
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", rm_path)));
- continue;
- }
if (S_ISDIR(statbuf.st_mode))
{
@@ -3175,14 +3313,14 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else
{
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
}
}
else
- ereport(LOG,
+ ereport(ERROR,
(errmsg("unexpected file found in temporary-files directory: \"%s\"",
rm_path)));
}
@@ -3190,7 +3328,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
FreeDir(temp_dir);
if (rmdir(tmpdirname) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove directory \"%s\": %m",
tmpdirname)));
@@ -3206,7 +3344,7 @@ RemovePgTempRelationFiles(const char *tsdirname)
ts_dir = AllocateDir(tsdirname);
- while ((de = ReadDirExtended(ts_dir, tsdirname, LOG)) != NULL)
+ while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
/*
* We're only interested in the per-database directories, which have
@@ -3234,7 +3372,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDirExtended(dbspace_dir, dbspacedirname, LOG)) != NULL)
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
if (!looks_like_temp_rel_name(de->d_name))
continue;
@@ -3243,7 +3381,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspacedirname, de->d_name);
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 079176b153..2efe3d236d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,6 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
+extern void StagePgTempFilesForRemoval(void);
extern void RemovePgTempFiles(void);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
--
2.25.1
v9-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From 81b71b5d7e2aea1cec2b3116414dd9e2fb1dbc7c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v9 4/6] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 1 +
src/backend/postmaster/postmaster.c | 24 +++++++++++++++++++-----
src/backend/storage/file/fd.c | 13 +++++++------
src/include/postmaster/custodian.h | 2 +-
4 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index e90f5d0d1f..fe1f48844e 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -70,6 +70,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6edae456f1..c0500fe4df 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -109,6 +109,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/custodian.h"
#include "postmaster/fork_process.h"
#include "postmaster/interrupt.h"
#include "postmaster/pgarch.h"
@@ -1398,9 +1399,12 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
/*
* Initialize the autovacuum subsystem (again, no process start yet)
@@ -4033,12 +4037,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
@@ -4051,6 +4057,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ /*
+ * Now that shared memory is initialized, notify the custodian to clean
+ * up the staged pgsql_tmp directories. We do this even if
+ * remove_temp_files_after_crash is false so that any previously staged
+ * directories are eventually cleaned up.
+ */
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
+
StartupPID = StartupDataBase();
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a687bd05d7..067e5920d6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -97,6 +97,7 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -1565,9 +1566,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1765,9 +1766,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 170ca61a21..80890ceadd 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_TEMP_FILES,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
--
2.25.1
v9-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchtext/x-diff; charset=us-asciiDownload
From 11ab136fc7c05993c2e2a7aaef6e5e686faba329 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v9 5/6] Move removal of old serialized snapshots to custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 8 ++++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 1 +
src/include/replication/snapbuild.h | 2 +-
5 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87b243e0d4..88d10874e2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -6842,10 +6842,14 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ !IsUnderPostmaster,
+ (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index fe1f48844e..855a756ca0 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -71,6 +72,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1ff2c12240..abafdb52b2 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2014,14 +2014,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 80890ceadd..37334941cc 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -19,6 +19,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index e6adea24f2..e1de013ece 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v9-0006-Move-removal-of-old-logical-rewrite-mapping-files.patchtext/x-diff; charset=us-asciiDownload
From 20927c6a9d245a781e476d626ee5f88ade1b7a7d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v9 6/6] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 43 +++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
4 files changed, 116 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 9dd885d936..a08dd4a524 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1182,7 +1184,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1210,6 +1213,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
+ !IsUnderPostmaster,
+ LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1240,15 +1248,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1286,3 +1286,61 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 855a756ca0..d4be19e5de 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(bool retry);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -73,6 +78,7 @@ struct cust_task_funcs_entry
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -384,3 +390,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 353cbb2924..965372b5ff 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 37334941cc..f177d55159 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -20,6 +22,7 @@ typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -29,5 +32,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
On Wed, Aug 24, 2022 at 09:46:24AM -0700, Nathan Bossart wrote:
Another rebase for cfbot.
And another.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v10-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From 63a470be1ac8af3b12684f136f70b2d7b6f87b81 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v10 1/6] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 383 ++++++++++++++++++++++++
src/backend/postmaster/postmaster.c | 44 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
12 files changed, 488 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks(true);
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If retry is true, the custodian will re-enqueue the currently running task if
+ * an exception is encountered.
+ */
+static void
+DoCustodianTasks(bool retry)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (retry)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * If immediate is true, the task is performed immediately in the current
+ * process, and this function will not return until it completes. This is
+ * mostly useful for single-user mode. If immediate is false, the task is added
+ * to the custodian's queue if it is not already enqueued, and this function
+ * returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, bool immediate, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (immediate)
+ DoCustodianTasks(false);
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1664fcee2a..b25c180886 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -248,6 +248,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -544,6 +545,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1821,13 +1823,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2750,6 +2755,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3070,6 +3077,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3163,6 +3172,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3620,6 +3643,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3797,6 +3832,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3834,6 +3872,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3923,6 +3962,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4117,6 +4157,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1a6f527051..b19d743cab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -129,6 +130,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -277,6 +279,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37aaab1338..f297f489c9 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 92f24a6c9b..d8e6ea45bc 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 683f616b1a..0131862973 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 65cf4ba50f..36a83018e2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -426,6 +427,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -438,6 +440,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..170ca61a21
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..467421e371 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -394,6 +394,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -411,11 +413,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6f2d5612e0..58455dc016 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v10-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From eaeca8a96c641bdee5e421ebbfb375e92219803f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v10 2/6] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b25c180886..180c9a0400 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1126,7 +1126,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 20c3741aa1..87eafdd78a 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3082,7 +3082,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3098,7 +3098,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3131,7 +3131,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3163,13 +3163,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else if (type == PGFILETYPE_DIR)
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3187,6 +3181,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2b4a8e0ffe..079176b153 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 53a55c7a8a..8ed8afeadd 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -152,7 +152,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -268,7 +269,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v10-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From 88d992388e383c1c59e2166e91e0ff961a18e6fc Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v10 3/6] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be moved to a staging directory
and renamed to prepare them for removal. Then, all files in these
directories are removed before removing the directories themselves.
This change is being made in preparation for a follow-up change to
offload most temporary file cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
This change also modifies several ereport(LOG, ...) calls within
the temporary file cleanup code to ERROR instead. While temporary
file cleanup is typically not urgent enough to prevent startup,
excessive lenience might mask bugs.
---
src/backend/postmaster/postmaster.c | 4 +
src/backend/storage/file/fd.c | 215 +++++++++++++++++++++++-----
src/include/storage/fd.h | 1 +
3 files changed, 182 insertions(+), 38 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 180c9a0400..6edae456f1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1399,6 +1399,7 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
/*
@@ -4034,7 +4035,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
+ {
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 87eafdd78a..64c844ab87 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -77,6 +77,7 @@
#include <sys/param.h>
#include <sys/resource.h> /* for getrlimit */
#include <sys/stat.h>
+#include <sys/time.h>
#include <sys/types.h>
#ifndef WIN32
#include <sys/mman.h>
@@ -88,6 +89,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_tablespace.h"
+#include "common/int.h"
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/pg_prng.h"
@@ -110,6 +112,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_TO_REMOVE_DIR (PG_TEMP_FILES_DIR "_staged_for_removal")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -336,6 +340,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3050,29 +3056,24 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
FreeDesc(&allocatedDescs[0]);
}
-
/*
- * Remove temporary and temporary relation files left over from a prior
- * postmaster session
+ * Stage temporary files left over from a prior postmaster session for removal.
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * This function also removes any leftover temporary relation files. Unlike
+ * temporary files stored in pgsql_tmp directories, temporary relation files do
+ * not live in their own directory, so there isn't a tremendously beneficial way
+ * to stage them for removal at a later time.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * RemovePgTempFiles() should be called at some point after this function in
+ * order to remove the staged temporary directories.
*
- * NOTE: this function and its subroutines generally report syscall failures
- * with ereport(LOG) and keep going. Removing temp files is not so critical
- * that we should fail to start the database when we can't do it.
+ * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
+ * DataDir as well. However, that is *not* cleaned here because doing so would
+ * create a race condition. It's done separately, earlier in postmaster
+ * startup.
*/
void
-RemovePgTempFiles(void)
+StagePgTempFilesForRemoval(void)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3082,7 +3083,8 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
+
RemovePgTempRelationFiles("base");
/*
@@ -3090,7 +3092,7 @@ RemovePgTempFiles(void)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3098,7 +3100,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3106,21 +3108,160 @@ RemovePgTempFiles(void)
}
FreeDir(spc_dir);
+}
+
+/*
+ * Remove temporary files that have been previously staged for removal by
+ * StagePgTempFilesForRemoval().
+ */
+void
+RemovePgTempFiles(void)
+{
+ char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
+ DIR *spc_dir;
+ struct dirent *spc_de;
+
+ /*
+ * First process temp files in pg_default ($PGDATA/base)
+ */
+ RemoveStagedPgTempDirs("base");
/*
- * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
- * DataDir as well. However, that is *not* cleaned here because doing so
- * would create a race condition. It's done separately, earlier in
- * postmaster startup.
+ * Cycle through temp directories for all non-default tablespaces.
*/
+ spc_dir = AllocateDir("pg_tblspc");
+
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
+ {
+ if (strcmp(spc_de->d_name, ".") == 0 ||
+ strcmp(spc_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
+
+ FreeDir(spc_dir);
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function moves the given directory to a staging directory and renames
+ * it in preparation for removal by a later call to RemoveStagedPgTempDirs().
+ * The current timestamp is appended to the end of the new directory name in
+ * case previously staged pgsql_tmp directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ struct stat st;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+ char to_remove_path[MAXPGPATH * 2];
+ struct timeval tv;
+ uint64 epoch;
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ if (stat(tmp_dir, &st) != 0)
+ {
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", tmp_dir)));
+ return;
+ }
+ else if (!S_ISDIR(st.st_mode))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("\"%s\" is not a directory", tmp_dir)));
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Make sure the pgsql_tmp_staged_for_removal directory exists.
+ */
+ snprintf(to_remove_path, sizeof(to_remove_path), "%s/%s", parent_path,
+ PG_TEMP_TO_REMOVE_DIR);
+ if (MakePGDirectory(to_remove_path) != 0 && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ to_remove_path)));
+
+ /*
+ * Pick a sufficiently unique name for the stage directory. We just append
+ * the current timestamp to the end of the name.
+ */
+ gettimeofday(&tv, NULL);
+ if (pg_mul_u64_overflow((uint64) 1000, (uint64) tv.tv_sec, &epoch) ||
+ pg_add_u64_overflow(epoch, (uint64) tv.tv_usec, &epoch))
+ elog(ERROR, "could not stage temporary file directory for removal");
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s." UINT64_FORMAT,
+ to_remove_path, PG_TEMP_FILES_DIR, epoch);
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char stage_path[MAXPGPATH * 2];
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s", spc_dir,
+ PG_TEMP_TO_REMOVE_DIR);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL && errno == ENOENT)
+ return;
+
+ while ((de = ReadDir(dir, stage_path)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_FILES_DIR,
+ strlen(PG_TEMP_FILES_DIR)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", stage_path, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
- * Any other problem results in a LOG message. (missing_ok should be true at
- * the top level, since pgsql_tmp directories are not created until needed.)
+ * Any other problem results in an ERROR. (missing_ok should be true at the
+ * top level, since pgsql_tmp directories are not created until needed.)
*
* At the top level, this should be called with unlink_all = false, so that
* only files matching the temporary name prefix will be unlinked. When
@@ -3142,7 +3283,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while ((temp_de = ReadDir(temp_dir, tmpdirname)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
@@ -3156,11 +3297,9 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
PG_TEMP_FILE_PREFIX,
strlen(PG_TEMP_FILE_PREFIX)) == 0)
{
- PGFileType type = get_dirent_type(rm_path, temp_de, false, LOG);
+ PGFileType type = get_dirent_type(rm_path, temp_de, false, ERROR);
- if (type == PGFILETYPE_ERROR)
- continue;
- else if (type == PGFILETYPE_DIR)
+ if (type == PGFILETYPE_DIR)
{
/* recursively remove contents, then directory itself */
RemovePgTempDir(rm_path, false, true);
@@ -3168,14 +3307,14 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else
{
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
}
}
else
- ereport(LOG,
+ ereport(ERROR,
(errmsg("unexpected file found in temporary-files directory: \"%s\"",
rm_path)));
}
@@ -3183,7 +3322,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
FreeDir(temp_dir);
if (rmdir(tmpdirname) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove directory \"%s\": %m",
tmpdirname)));
@@ -3199,7 +3338,7 @@ RemovePgTempRelationFiles(const char *tsdirname)
ts_dir = AllocateDir(tsdirname);
- while ((de = ReadDirExtended(ts_dir, tsdirname, LOG)) != NULL)
+ while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
/*
* We're only interested in the per-database directories, which have
@@ -3227,7 +3366,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDirExtended(dbspace_dir, dbspacedirname, LOG)) != NULL)
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
if (!looks_like_temp_rel_name(de->d_name))
continue;
@@ -3236,7 +3375,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspacedirname, de->d_name);
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 079176b153..2efe3d236d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,6 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
+extern void StagePgTempFilesForRemoval(void);
extern void RemovePgTempFiles(void);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
--
2.25.1
v10-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From 4d048c3e6829ef826c4efa020d64ca514a0b5c75 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v10 4/6] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 1 +
src/backend/postmaster/postmaster.c | 24 +++++++++++++++++++-----
src/backend/storage/file/fd.c | 13 +++++++------
src/include/postmaster/custodian.h | 2 +-
4 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index e90f5d0d1f..fe1f48844e 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -70,6 +70,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6edae456f1..c0500fe4df 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -109,6 +109,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/custodian.h"
#include "postmaster/fork_process.h"
#include "postmaster/interrupt.h"
#include "postmaster/pgarch.h"
@@ -1398,9 +1399,12 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
/*
* Initialize the autovacuum subsystem (again, no process start yet)
@@ -4033,12 +4037,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
@@ -4051,6 +4057,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ /*
+ * Now that shared memory is initialized, notify the custodian to clean
+ * up the staged pgsql_tmp directories. We do this even if
+ * remove_temp_files_after_crash is false so that any previously staged
+ * directories are eventually cleaned up.
+ */
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
+
StartupPID = StartupDataBase();
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 64c844ab87..2475b35c1e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -97,6 +97,7 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -1565,9 +1566,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1765,9 +1766,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 170ca61a21..80890ceadd 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_TEMP_FILES,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
--
2.25.1
v10-0005-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From c78157b63fd9cdaacc7d1bec314b5eb553c36e4c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v10 5/6] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 8 ++++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 1 +
src/include/replication/snapbuild.h | 2 +-
5 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7a710e6490..cbe86c6822 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -6848,10 +6848,14 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ !IsUnderPostmaster,
+ (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index fe1f48844e..855a756ca0 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -71,6 +72,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1d8ebb4c0d..d3bbc59389 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2027,14 +2027,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 80890ceadd..37334941cc 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -19,6 +19,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index e6adea24f2..e1de013ece 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v10-0006-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From 86209771fde99a3485ff063321edb8620aa473f6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v10 6/6] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 43 +++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
4 files changed, 116 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2f08fbe8d3..a01edf8a1f 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -117,6 +117,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -124,6 +125,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1183,7 +1185,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1211,6 +1214,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
+ !IsUnderPostmaster,
+ LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1243,15 +1251,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1289,3 +1289,61 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 855a756ca0..d4be19e5de 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(bool retry);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -73,6 +78,7 @@ struct cust_task_funcs_entry
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -384,3 +390,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 353cbb2924..965372b5ff 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 37334941cc..f177d55159 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -20,6 +22,7 @@ typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -29,5 +32,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
On Fri, Sep 02, 2022 at 03:07:44PM -0700, Nathan Bossart wrote:
And another.
v11 adds support for building with meson.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v11-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From 56c9ff2bf1a6524518b62193c0da02372f9674a1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v11 1/6] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 383 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 44 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
13 files changed, 489 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks(true);
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If retry is true, the custodian will re-enqueue the currently running task if
+ * an exception is encountered.
+ */
+static void
+DoCustodianTasks(bool retry)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (retry)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * If immediate is true, the task is performed immediately in the current
+ * process, and this function will not return until it completes. This is
+ * mostly useful for single-user mode. If immediate is false, the task is added
+ * to the custodian's queue if it is not already enqueued, and this function
+ * returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, bool immediate, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (immediate)
+ DoCustodianTasks(false);
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 293a44ca29..ac72a8a07f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 383bc4776e..b1b249cc90 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -248,6 +248,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -544,6 +545,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1821,13 +1823,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2746,6 +2751,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3066,6 +3073,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3159,6 +3168,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3616,6 +3639,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3793,6 +3828,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3830,6 +3868,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3919,6 +3958,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4113,6 +4153,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b204ecdbc3..cf80e65779 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37aaab1338..f297f489c9 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 92f24a6c9b..d8e6ea45bc 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 683f616b1a..0131862973 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ee48e392ed..c2e9bb3a75 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -426,6 +427,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -438,6 +440,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..170ca61a21
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 91824b4691..86acd3a5b9 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -396,6 +396,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -413,11 +415,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6f2d5612e0..58455dc016 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v11-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From 2f8871e908bdd5f16f207376882d1200ec3ea253 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v11 2/6] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b1b249cc90..6c823407b5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1126,7 +1126,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 073dab2be5..b92b08c94a 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3082,7 +3082,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3098,7 +3098,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3131,7 +3131,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3163,13 +3163,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else if (type == PGFILETYPE_DIR)
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3187,6 +3181,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 5a48fccd9c..7a21085ad4 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -169,8 +169,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 53a55c7a8a..8ed8afeadd 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -152,7 +152,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -268,7 +269,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v11-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From 341b24353c9f2897346a007b2ecfb2fc6d7e54eb Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v11 3/6] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be moved to a staging directory
and renamed to prepare them for removal. Then, all files in these
directories are removed before removing the directories themselves.
This change is being made in preparation for a follow-up change to
offload most temporary file cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
This change also modifies several ereport(LOG, ...) calls within
the temporary file cleanup code to ERROR instead. While temporary
file cleanup is typically not urgent enough to prevent startup,
excessive lenience might mask bugs.
---
src/backend/postmaster/postmaster.c | 4 +
src/backend/storage/file/fd.c | 215 +++++++++++++++++++++++-----
src/include/storage/fd.h | 1 +
3 files changed, 182 insertions(+), 38 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6c823407b5..e13bc11daf 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1399,6 +1399,7 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
/*
@@ -4030,7 +4031,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
+ {
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b92b08c94a..c8ffb53b2c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -77,6 +77,7 @@
#include <sys/param.h>
#include <sys/resource.h> /* for getrlimit */
#include <sys/stat.h>
+#include <sys/time.h>
#include <sys/types.h>
#ifndef WIN32
#include <sys/mman.h>
@@ -88,6 +89,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_tablespace.h"
+#include "common/int.h"
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/pg_prng.h"
@@ -110,6 +112,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_TO_REMOVE_DIR (PG_TEMP_FILES_DIR "_staged_for_removal")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -336,6 +340,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3050,29 +3056,24 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
FreeDesc(&allocatedDescs[0]);
}
-
/*
- * Remove temporary and temporary relation files left over from a prior
- * postmaster session
+ * Stage temporary files left over from a prior postmaster session for removal.
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * This function also removes any leftover temporary relation files. Unlike
+ * temporary files stored in pgsql_tmp directories, temporary relation files do
+ * not live in their own directory, so there isn't a tremendously beneficial way
+ * to stage them for removal at a later time.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * RemovePgTempFiles() should be called at some point after this function in
+ * order to remove the staged temporary directories.
*
- * NOTE: this function and its subroutines generally report syscall failures
- * with ereport(LOG) and keep going. Removing temp files is not so critical
- * that we should fail to start the database when we can't do it.
+ * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
+ * DataDir as well. However, that is *not* cleaned here because doing so would
+ * create a race condition. It's done separately, earlier in postmaster
+ * startup.
*/
void
-RemovePgTempFiles(void)
+StagePgTempFilesForRemoval(void)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3082,7 +3083,8 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
+
RemovePgTempRelationFiles("base");
/*
@@ -3090,7 +3092,7 @@ RemovePgTempFiles(void)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3098,7 +3100,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3106,21 +3108,160 @@ RemovePgTempFiles(void)
}
FreeDir(spc_dir);
+}
+
+/*
+ * Remove temporary files that have been previously staged for removal by
+ * StagePgTempFilesForRemoval().
+ */
+void
+RemovePgTempFiles(void)
+{
+ char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
+ DIR *spc_dir;
+ struct dirent *spc_de;
+
+ /*
+ * First process temp files in pg_default ($PGDATA/base)
+ */
+ RemoveStagedPgTempDirs("base");
/*
- * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
- * DataDir as well. However, that is *not* cleaned here because doing so
- * would create a race condition. It's done separately, earlier in
- * postmaster startup.
+ * Cycle through temp directories for all non-default tablespaces.
*/
+ spc_dir = AllocateDir("pg_tblspc");
+
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
+ {
+ if (strcmp(spc_de->d_name, ".") == 0 ||
+ strcmp(spc_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
+
+ FreeDir(spc_dir);
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function moves the given directory to a staging directory and renames
+ * it in preparation for removal by a later call to RemoveStagedPgTempDirs().
+ * The current timestamp is appended to the end of the new directory name in
+ * case previously staged pgsql_tmp directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ struct stat st;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+ char to_remove_path[MAXPGPATH * 2];
+ struct timeval tv;
+ uint64 epoch;
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ if (stat(tmp_dir, &st) != 0)
+ {
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", tmp_dir)));
+ return;
+ }
+ else if (!S_ISDIR(st.st_mode))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("\"%s\" is not a directory", tmp_dir)));
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Make sure the pgsql_tmp_staged_for_removal directory exists.
+ */
+ snprintf(to_remove_path, sizeof(to_remove_path), "%s/%s", parent_path,
+ PG_TEMP_TO_REMOVE_DIR);
+ if (MakePGDirectory(to_remove_path) != 0 && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ to_remove_path)));
+
+ /*
+ * Pick a sufficiently unique name for the stage directory. We just append
+ * the current timestamp to the end of the name.
+ */
+ gettimeofday(&tv, NULL);
+ if (pg_mul_u64_overflow((uint64) 1000, (uint64) tv.tv_sec, &epoch) ||
+ pg_add_u64_overflow(epoch, (uint64) tv.tv_usec, &epoch))
+ elog(ERROR, "could not stage temporary file directory for removal");
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s." UINT64_FORMAT,
+ to_remove_path, PG_TEMP_FILES_DIR, epoch);
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char stage_path[MAXPGPATH * 2];
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s", spc_dir,
+ PG_TEMP_TO_REMOVE_DIR);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL && errno == ENOENT)
+ return;
+
+ while ((de = ReadDir(dir, stage_path)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_FILES_DIR,
+ strlen(PG_TEMP_FILES_DIR)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", stage_path, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
- * Any other problem results in a LOG message. (missing_ok should be true at
- * the top level, since pgsql_tmp directories are not created until needed.)
+ * Any other problem results in an ERROR. (missing_ok should be true at the
+ * top level, since pgsql_tmp directories are not created until needed.)
*
* At the top level, this should be called with unlink_all = false, so that
* only files matching the temporary name prefix will be unlinked. When
@@ -3142,7 +3283,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while ((temp_de = ReadDir(temp_dir, tmpdirname)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
@@ -3156,11 +3297,9 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
PG_TEMP_FILE_PREFIX,
strlen(PG_TEMP_FILE_PREFIX)) == 0)
{
- PGFileType type = get_dirent_type(rm_path, temp_de, false, LOG);
+ PGFileType type = get_dirent_type(rm_path, temp_de, false, ERROR);
- if (type == PGFILETYPE_ERROR)
- continue;
- else if (type == PGFILETYPE_DIR)
+ if (type == PGFILETYPE_DIR)
{
/* recursively remove contents, then directory itself */
RemovePgTempDir(rm_path, false, true);
@@ -3168,14 +3307,14 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else
{
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
}
}
else
- ereport(LOG,
+ ereport(ERROR,
(errmsg("unexpected file found in temporary-files directory: \"%s\"",
rm_path)));
}
@@ -3183,7 +3322,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
FreeDir(temp_dir);
if (rmdir(tmpdirname) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove directory \"%s\": %m",
tmpdirname)));
@@ -3199,7 +3338,7 @@ RemovePgTempRelationFiles(const char *tsdirname)
ts_dir = AllocateDir(tsdirname);
- while ((de = ReadDirExtended(ts_dir, tsdirname, LOG)) != NULL)
+ while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
/*
* We're only interested in the per-database directories, which have
@@ -3227,7 +3366,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDirExtended(dbspace_dir, dbspacedirname, LOG)) != NULL)
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
if (!looks_like_temp_rel_name(de->d_name))
continue;
@@ -3236,7 +3375,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspacedirname, de->d_name);
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7a21085ad4..4d2b3afda8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -168,6 +168,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
+extern void StagePgTempFilesForRemoval(void);
extern void RemovePgTempFiles(void);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
--
2.25.1
v11-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From 41697ef58e797428555c43d57fb0e01fee9a895a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v11 4/6] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 1 +
src/backend/postmaster/postmaster.c | 24 +++++++++++++++++++-----
src/backend/storage/file/fd.c | 13 +++++++------
src/include/postmaster/custodian.h | 2 +-
4 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index e90f5d0d1f..fe1f48844e 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -70,6 +70,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e13bc11daf..44479eec60 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -109,6 +109,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/custodian.h"
#include "postmaster/fork_process.h"
#include "postmaster/interrupt.h"
#include "postmaster/pgarch.h"
@@ -1398,9 +1399,12 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
/*
* Initialize the autovacuum subsystem (again, no process start yet)
@@ -4029,12 +4033,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
@@ -4047,6 +4053,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ /*
+ * Now that shared memory is initialized, notify the custodian to clean
+ * up the staged pgsql_tmp directories. We do this even if
+ * remove_temp_files_after_crash is false so that any previously staged
+ * directories are eventually cleaned up.
+ */
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
+
StartupPID = StartupDataBase();
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index c8ffb53b2c..64546ca738 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -97,6 +97,7 @@
#include "pgstat.h"
#include "port/pg_iovec.h"
#include "portability/mem.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -1565,9 +1566,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1765,9 +1766,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 170ca61a21..80890ceadd 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_TEMP_FILES,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
--
2.25.1
v11-0005-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From d672abc22253484a179cbbcae679cd63366ebf35 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v11 5/6] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 8 ++++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 1 +
src/include/replication/snapbuild.h | 2 +-
5 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f32b2124e6..677cead44e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -7028,10 +7028,14 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ !IsUnderPostmaster,
+ (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index fe1f48844e..855a756ca0 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -71,6 +72,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1d8ebb4c0d..d3bbc59389 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2027,14 +2027,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 80890ceadd..37334941cc 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -19,6 +19,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index f126ff2e08..4877afb1bd 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v11-0006-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From 596b7a766222d8cce8f6b3eba0b1996dea5ba2e4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v11 6/6] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 43 +++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
4 files changed, 116 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2f08fbe8d3..a01edf8a1f 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -117,6 +117,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -124,6 +125,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1183,7 +1185,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1211,6 +1214,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
+ !IsUnderPostmaster,
+ LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1243,15 +1251,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1289,3 +1289,61 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ struct stat statbuf;
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 855a756ca0..d4be19e5de 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(bool retry);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -73,6 +78,7 @@ struct cust_task_funcs_entry
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -384,3 +390,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5cc04756a5..bc875330d7 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 37334941cc..f177d55159 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -20,6 +22,7 @@ typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -29,5 +32,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
On Fri, Sep 23, 2022 at 10:41:54AM -0700, Nathan Bossart wrote:
v11 adds support for building with meson.
rebased
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v12-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From 367c5f3863457cfbd0fe8add0e8df3e630aaaea9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v12 1/6] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 383 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 44 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
13 files changed, 489 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks(true);
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If retry is true, the custodian will re-enqueue the currently running task if
+ * an exception is encountered.
+ */
+static void
+DoCustodianTasks(bool retry)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (retry)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * If immediate is true, the task is performed immediately in the current
+ * process, and this function will not return until it completes. This is
+ * mostly useful for single-user mode. If immediate is false, the task is added
+ * to the custodian's queue if it is not already enqueued, and this function
+ * returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, bool immediate, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (immediate)
+ DoCustodianTasks(false);
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 293a44ca29..ac72a8a07f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0b637ba6a2..3706eec25e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -248,6 +248,7 @@ bool remove_temp_files_after_crash = true;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -544,6 +545,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1821,13 +1823,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2746,6 +2751,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3066,6 +3073,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3159,6 +3168,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3616,6 +3639,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) CustodianPID)));
+ signal_child(CustodianPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3793,6 +3828,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3830,6 +3868,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3919,6 +3958,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4113,6 +4153,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b204ecdbc3..cf80e65779 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 13fa07b0ff..1bae34d1ee 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 92f24a6c9b..d8e6ea45bc 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb1046450b..f19f4c3075 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..59a95dd7c0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -429,6 +430,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -441,6 +443,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..170ca61a21
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 8d096fdeeb..448dde0161 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -400,6 +400,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -417,11 +419,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6f2d5612e0..58455dc016 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v12-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From cff96525c02259394a483e6edc1879a774b6d2ce Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v12 2/6] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3706eec25e..faaf13d537 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1126,7 +1126,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 4151cafec5..0e1398d07c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3081,7 +3081,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3097,7 +3097,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3130,7 +3130,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3162,13 +3162,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else if (type == PGFILETYPE_DIR)
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3186,6 +3180,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index c0a212487d..790b9a9a14 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -167,8 +167,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 53a55c7a8a..8ed8afeadd 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -152,7 +152,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -268,7 +269,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v12-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From 5354b935e2927de5d78cdb9a94b700855ba3f350 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v12 3/6] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be moved to a staging directory
and renamed to prepare them for removal. Then, all files in these
directories are removed before removing the directories themselves.
This change is being made in preparation for a follow-up change to
offload most temporary file cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
This change also modifies several ereport(LOG, ...) calls within
the temporary file cleanup code to ERROR instead. While temporary
file cleanup is typically not urgent enough to prevent startup,
excessive lenience might mask bugs.
---
src/backend/postmaster/postmaster.c | 4 +
src/backend/storage/file/fd.c | 215 +++++++++++++++++++++++-----
src/include/storage/fd.h | 1 +
3 files changed, 182 insertions(+), 38 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index faaf13d537..54548e28e9 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1399,6 +1399,7 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
/*
@@ -4030,7 +4031,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
+ {
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0e1398d07c..9610850d45 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -77,6 +77,7 @@
#include <sys/param.h>
#include <sys/resource.h> /* for getrlimit */
#include <sys/stat.h>
+#include <sys/time.h>
#include <sys/types.h>
#ifndef WIN32
#include <sys/mman.h>
@@ -88,6 +89,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_tablespace.h"
+#include "common/int.h"
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/pg_prng.h"
@@ -109,6 +111,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_TO_REMOVE_DIR (PG_TEMP_FILES_DIR "_staged_for_removal")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -335,6 +339,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3049,29 +3055,24 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
FreeDesc(&allocatedDescs[0]);
}
-
/*
- * Remove temporary and temporary relation files left over from a prior
- * postmaster session
+ * Stage temporary files left over from a prior postmaster session for removal.
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * This function also removes any leftover temporary relation files. Unlike
+ * temporary files stored in pgsql_tmp directories, temporary relation files do
+ * not live in their own directory, so there isn't a tremendously beneficial way
+ * to stage them for removal at a later time.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * RemovePgTempFiles() should be called at some point after this function in
+ * order to remove the staged temporary directories.
*
- * NOTE: this function and its subroutines generally report syscall failures
- * with ereport(LOG) and keep going. Removing temp files is not so critical
- * that we should fail to start the database when we can't do it.
+ * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
+ * DataDir as well. However, that is *not* cleaned here because doing so would
+ * create a race condition. It's done separately, earlier in postmaster
+ * startup.
*/
void
-RemovePgTempFiles(void)
+StagePgTempFilesForRemoval(void)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3081,7 +3082,8 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
+
RemovePgTempRelationFiles("base");
/*
@@ -3089,7 +3091,7 @@ RemovePgTempFiles(void)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3097,7 +3099,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3105,21 +3107,160 @@ RemovePgTempFiles(void)
}
FreeDir(spc_dir);
+}
+
+/*
+ * Remove temporary files that have been previously staged for removal by
+ * StagePgTempFilesForRemoval().
+ */
+void
+RemovePgTempFiles(void)
+{
+ char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
+ DIR *spc_dir;
+ struct dirent *spc_de;
+
+ /*
+ * First process temp files in pg_default ($PGDATA/base)
+ */
+ RemoveStagedPgTempDirs("base");
/*
- * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
- * DataDir as well. However, that is *not* cleaned here because doing so
- * would create a race condition. It's done separately, earlier in
- * postmaster startup.
+ * Cycle through temp directories for all non-default tablespaces.
*/
+ spc_dir = AllocateDir("pg_tblspc");
+
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
+ {
+ if (strcmp(spc_de->d_name, ".") == 0 ||
+ strcmp(spc_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
+
+ FreeDir(spc_dir);
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function moves the given directory to a staging directory and renames
+ * it in preparation for removal by a later call to RemoveStagedPgTempDirs().
+ * The current timestamp is appended to the end of the new directory name in
+ * case previously staged pgsql_tmp directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ struct stat st;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+ char to_remove_path[MAXPGPATH * 2];
+ struct timeval tv;
+ uint64 epoch;
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ if (stat(tmp_dir, &st) != 0)
+ {
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", tmp_dir)));
+ return;
+ }
+ else if (!S_ISDIR(st.st_mode))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("\"%s\" is not a directory", tmp_dir)));
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Make sure the pgsql_tmp_staged_for_removal directory exists.
+ */
+ snprintf(to_remove_path, sizeof(to_remove_path), "%s/%s", parent_path,
+ PG_TEMP_TO_REMOVE_DIR);
+ if (MakePGDirectory(to_remove_path) != 0 && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ to_remove_path)));
+
+ /*
+ * Pick a sufficiently unique name for the stage directory. We just append
+ * the current timestamp to the end of the name.
+ */
+ gettimeofday(&tv, NULL);
+ if (pg_mul_u64_overflow((uint64) 1000, (uint64) tv.tv_sec, &epoch) ||
+ pg_add_u64_overflow(epoch, (uint64) tv.tv_usec, &epoch))
+ elog(ERROR, "could not stage temporary file directory for removal");
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s." UINT64_FORMAT,
+ to_remove_path, PG_TEMP_FILES_DIR, epoch);
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char stage_path[MAXPGPATH * 2];
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s", spc_dir,
+ PG_TEMP_TO_REMOVE_DIR);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL && errno == ENOENT)
+ return;
+
+ while ((de = ReadDir(dir, stage_path)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_FILES_DIR,
+ strlen(PG_TEMP_FILES_DIR)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", stage_path, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
- * Any other problem results in a LOG message. (missing_ok should be true at
- * the top level, since pgsql_tmp directories are not created until needed.)
+ * Any other problem results in an ERROR. (missing_ok should be true at the
+ * top level, since pgsql_tmp directories are not created until needed.)
*
* At the top level, this should be called with unlink_all = false, so that
* only files matching the temporary name prefix will be unlinked. When
@@ -3141,7 +3282,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while ((temp_de = ReadDir(temp_dir, tmpdirname)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
@@ -3155,11 +3296,9 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
PG_TEMP_FILE_PREFIX,
strlen(PG_TEMP_FILE_PREFIX)) == 0)
{
- PGFileType type = get_dirent_type(rm_path, temp_de, false, LOG);
+ PGFileType type = get_dirent_type(rm_path, temp_de, false, ERROR);
- if (type == PGFILETYPE_ERROR)
- continue;
- else if (type == PGFILETYPE_DIR)
+ if (type == PGFILETYPE_DIR)
{
/* recursively remove contents, then directory itself */
RemovePgTempDir(rm_path, false, true);
@@ -3167,14 +3306,14 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else
{
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
}
}
else
- ereport(LOG,
+ ereport(ERROR,
(errmsg("unexpected file found in temporary-files directory: \"%s\"",
rm_path)));
}
@@ -3182,7 +3321,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
FreeDir(temp_dir);
if (rmdir(tmpdirname) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove directory \"%s\": %m",
tmpdirname)));
@@ -3198,7 +3337,7 @@ RemovePgTempRelationFiles(const char *tsdirname)
ts_dir = AllocateDir(tsdirname);
- while ((de = ReadDirExtended(ts_dir, tsdirname, LOG)) != NULL)
+ while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
/*
* We're only interested in the per-database directories, which have
@@ -3226,7 +3365,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDirExtended(dbspace_dir, dbspacedirname, LOG)) != NULL)
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
if (!looks_like_temp_rel_name(de->d_name))
continue;
@@ -3235,7 +3374,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspacedirname, de->d_name);
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 790b9a9a14..cf27e90aea 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -166,6 +166,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
+extern void StagePgTempFilesForRemoval(void);
extern void RemovePgTempFiles(void);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
--
2.25.1
v12-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From d9acb6e06b0a8ea8c29da93b2336d19ecb958c15 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v12 4/6] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 1 +
src/backend/postmaster/postmaster.c | 24 +++++++++++++++++++-----
src/backend/storage/file/fd.c | 13 +++++++------
src/include/postmaster/custodian.h | 2 +-
4 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index e90f5d0d1f..fe1f48844e 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -70,6 +70,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 54548e28e9..6dc33724f4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -109,6 +109,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/custodian.h"
#include "postmaster/fork_process.h"
#include "postmaster/interrupt.h"
#include "postmaster/pgarch.h"
@@ -1398,9 +1399,12 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
/*
* Initialize the autovacuum subsystem (again, no process start yet)
@@ -4029,12 +4033,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
@@ -4047,6 +4053,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ /*
+ * Now that shared memory is initialized, notify the custodian to clean
+ * up the staged pgsql_tmp directories. We do this even if
+ * remove_temp_files_after_crash is false so that any previously staged
+ * directories are eventually cleaned up.
+ */
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
+
StartupPID = StartupDataBase();
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9610850d45..625355c56a 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -96,6 +96,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "portability/mem.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -1564,9 +1565,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1764,9 +1765,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 170ca61a21..80890ceadd 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_TEMP_FILES,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
--
2.25.1
v12-0005-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From 1106caf8a697be7f7d59aefe9dad5db2596f6f04 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v12 5/6] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 8 ++++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 1 +
src/include/replication/snapbuild.h | 2 +-
5 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index be54c23187..03fdcf2c07 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -7024,10 +7024,14 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ !IsUnderPostmaster,
+ (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index fe1f48844e..855a756ca0 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -71,6 +72,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5006a5c464..a161cf3995 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2030,14 +2030,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 80890ceadd..37334941cc 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -19,6 +19,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 2a697e57c3..9eba403e0c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v12-0006-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From f444c6b61013d3a67959b2319c45419a385403e0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v12 6/6] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
src/backend/access/heap/rewriteheap.c | 80 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 43 ++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
4 files changed, 118 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..07976504cc 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1179,7 +1181,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1207,6 +1210,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
+ !IsUnderPostmaster,
+ LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1239,15 +1247,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1285,3 +1285,63 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+ PGFileType de_type;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ de_type = get_dirent_type(path, mapping_de, false, DEBUG1);
+
+ if (de_type != PGFILETYPE_ERROR && de_type != PGFILETYPE_REG)
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 855a756ca0..d4be19e5de 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(bool retry);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -73,6 +78,7 @@ struct cust_task_funcs_entry
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -384,3 +390,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5cc04756a5..bc875330d7 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 37334941cc..f177d55159 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -20,6 +22,7 @@ typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -29,5 +32,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
On Sun, Nov 06, 2022 at 02:38:42PM -0800, Nathan Bossart wrote:
rebased
another rebase for cfbot
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v13-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From b2c36a6d0d8ca5cde374b1c8b34aafaabbd7f6c2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v13 1/6] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 383 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 38 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
13 files changed, 483 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks(true);
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If retry is true, the custodian will re-enqueue the currently running task if
+ * an exception is encountered.
+ */
+static void
+DoCustodianTasks(bool retry)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (retry)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * If immediate is true, the task is performed immediately in the current
+ * process, and this function will not return until it completes. This is
+ * mostly useful for single-user mode. If immediate is false, the task is added
+ * to the custodian's queue if it is not already enqueued, and this function
+ * returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, bool immediate, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (immediate)
+ DoCustodianTasks(false);
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 293a44ca29..ac72a8a07f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index c83cc8cc6c..00d18ee761 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -240,6 +240,7 @@ bool send_abort_for_kill = false;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -537,6 +538,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1808,13 +1810,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2728,6 +2733,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3025,6 +3032,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3118,6 +3127,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3532,6 +3555,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (CheckpointerPID != 0 && take_action)
sigquit_child(CheckpointerPID);
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ sigquit_child(CustodianPID);
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3685,6 +3714,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3722,6 +3754,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3815,6 +3848,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4027,6 +4061,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b204ecdbc3..cf80e65779 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b1c35653fc..6a8485e865 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..63fd242b1e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb1046450b..f19f4c3075 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..59a95dd7c0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -429,6 +430,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -441,6 +443,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..170ca61a21
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index aa13e1d66e..8f0e696663 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -400,6 +400,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -417,11 +419,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..48602c8a16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v13-0002-Also-remove-pgsql_tmp-directories-during-startup.patchtext/x-diff; charset=us-asciiDownload
From 020e50cf89b7e87c2073c2684b63c1c4e844e6b3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 19:38:20 -0800
Subject: [PATCH v13 2/6] Also remove pgsql_tmp directories during startup.
Presently, the server only removes the contents of the temporary
directories during startup, not the directory itself. This changes
that to prepare for future commits that will move temporary file
cleanup to a separate auxiliary process.
---
src/backend/postmaster/postmaster.c | 2 +-
src/backend/storage/file/fd.c | 20 ++++++++++----------
src/include/storage/fd.h | 4 ++--
src/test/recovery/t/022_crash_temp_files.pl | 6 ++++--
4 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 00d18ee761..bda8b7f532 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1113,7 +1113,7 @@ PostmasterMain(int argc, char *argv[])
* safe to do so now, because we verified earlier that there are no
* conflicting Postgres processes in this data directory.
*/
- RemovePgTempFilesInDir(PG_TEMP_FILES_DIR, true, false);
+ RemovePgTempDir(PG_TEMP_FILES_DIR, true, false);
#endif
/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 4151cafec5..0e1398d07c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3081,7 +3081,7 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
RemovePgTempRelationFiles("base");
/*
@@ -3097,7 +3097,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempFilesInDir(temp_path, true, false);
+ RemovePgTempDir(temp_path, true, false);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3130,7 +3130,7 @@ RemovePgTempFiles(void)
* them separate.)
*/
void
-RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
+RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
{
DIR *temp_dir;
struct dirent *temp_de;
@@ -3162,13 +3162,7 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else if (type == PGFILETYPE_DIR)
{
/* recursively remove contents, then directory itself */
- RemovePgTempFilesInDir(rm_path, false, true);
-
- if (rmdir(rm_path) < 0)
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove directory \"%s\": %m",
- rm_path)));
+ RemovePgTempDir(rm_path, false, true);
}
else
{
@@ -3186,6 +3180,12 @@ RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
}
FreeDir(temp_dir);
+
+ if (rmdir(tmpdirname) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not remove directory \"%s\": %m",
+ tmpdirname)));
}
/* Process one tablespace directory, look for per-DB subdirectories */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index c0a212487d..790b9a9a14 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -167,8 +167,8 @@ extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
extern void RemovePgTempFiles(void);
-extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
- bool unlink_all);
+extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
+ bool unlink_all);
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 53a55c7a8a..8ed8afeadd 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -152,7 +152,8 @@ $node->poll_query_until('postgres', undef, '');
# Check for temporary files
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'no temporary files');
@@ -268,7 +269,8 @@ $node->restart();
# Check the temporary files -- should be gone
is( $node->safe_psql(
- 'postgres', 'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+ 'postgres',
+ 'SELECT COUNT(1) FROM pg_ls_dir($$base$$) WHERE pg_ls_dir = \'pgsql_tmp\''),
qq(0),
'temporary file was removed');
--
2.25.1
v13-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchtext/x-diff; charset=us-asciiDownload
From 38c0101d73a36281a1a5b96c13ee9a39fdd9346e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:16:44 -0800
Subject: [PATCH v13 3/6] Split pgsql_tmp cleanup into two stages.
First, pgsql_tmp directories will be moved to a staging directory
and renamed to prepare them for removal. Then, all files in these
directories are removed before removing the directories themselves.
This change is being made in preparation for a follow-up change to
offload most temporary file cleanup to the new custodian process.
Note that temporary relation files cannot be cleaned up via the
aforementioned strategy and will not be offloaded to the custodian.
This change also modifies several ereport(LOG, ...) calls within
the temporary file cleanup code to ERROR instead. While temporary
file cleanup is typically not urgent enough to prevent startup,
excessive lenience might mask bugs.
---
src/backend/postmaster/postmaster.c | 4 +
src/backend/storage/file/fd.c | 215 +++++++++++++++++++++++-----
src/include/storage/fd.h | 1 +
3 files changed, 182 insertions(+), 38 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bda8b7f532..3840da94ce 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1386,6 +1386,7 @@ PostmasterMain(int argc, char *argv[])
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
*/
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
/*
@@ -3920,7 +3921,10 @@ PostmasterStateMachine(void)
/* remove leftover temporary files after a crash */
if (remove_temp_files_after_crash)
+ {
+ StagePgTempFilesForRemoval();
RemovePgTempFiles();
+ }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0e1398d07c..9610850d45 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -77,6 +77,7 @@
#include <sys/param.h>
#include <sys/resource.h> /* for getrlimit */
#include <sys/stat.h>
+#include <sys/time.h>
#include <sys/types.h>
#ifndef WIN32
#include <sys/mman.h>
@@ -88,6 +89,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_tablespace.h"
+#include "common/int.h"
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/pg_prng.h"
@@ -109,6 +111,8 @@
#define PG_FLUSH_DATA_WORKS 1
#endif
+#define PG_TEMP_TO_REMOVE_DIR (PG_TEMP_FILES_DIR "_staged_for_removal")
+
/*
* We must leave some file descriptors free for system(), the dynamic loader,
* and other code that tries to open files without consulting fd.c. This
@@ -335,6 +339,8 @@ static void BeforeShmemExit_Files(int code, Datum arg);
static void CleanupTempFiles(bool isCommit, bool isProcExit);
static void RemovePgTempRelationFiles(const char *tsdirname);
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
+static void StagePgTempDirForRemoval(const char *tmp_dir);
+static void RemoveStagedPgTempDirs(const char *spc_dir);
static void walkdir(const char *path,
void (*action) (const char *fname, bool isdir, int elevel),
@@ -3049,29 +3055,24 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
FreeDesc(&allocatedDescs[0]);
}
-
/*
- * Remove temporary and temporary relation files left over from a prior
- * postmaster session
+ * Stage temporary files left over from a prior postmaster session for removal.
*
- * This should be called during postmaster startup. It will forcibly
- * remove any leftover files created by OpenTemporaryFile and any leftover
- * temporary relation files created by mdcreate.
+ * This function also removes any leftover temporary relation files. Unlike
+ * temporary files stored in pgsql_tmp directories, temporary relation files do
+ * not live in their own directory, so there isn't a tremendously beneficial way
+ * to stage them for removal at a later time.
*
- * During post-backend-crash restart cycle, this routine is called when
- * remove_temp_files_after_crash GUC is enabled. Multiple crashes while
- * queries are using temp files could result in useless storage usage that can
- * only be reclaimed by a service restart. The argument against enabling it is
- * that someone might want to examine the temporary files for debugging
- * purposes. This does however mean that OpenTemporaryFile had better allow for
- * collision with an existing temp file name.
+ * RemovePgTempFiles() should be called at some point after this function in
+ * order to remove the staged temporary directories.
*
- * NOTE: this function and its subroutines generally report syscall failures
- * with ereport(LOG) and keep going. Removing temp files is not so critical
- * that we should fail to start the database when we can't do it.
+ * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
+ * DataDir as well. However, that is *not* cleaned here because doing so would
+ * create a race condition. It's done separately, earlier in postmaster
+ * startup.
*/
void
-RemovePgTempFiles(void)
+StagePgTempFilesForRemoval(void)
{
char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
DIR *spc_dir;
@@ -3081,7 +3082,8 @@ RemovePgTempFiles(void)
* First process temp files in pg_default ($PGDATA/base)
*/
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
+
RemovePgTempRelationFiles("base");
/*
@@ -3089,7 +3091,7 @@ RemovePgTempFiles(void)
*/
spc_dir = AllocateDir("pg_tblspc");
- while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
@@ -3097,7 +3099,7 @@ RemovePgTempFiles(void)
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
- RemovePgTempDir(temp_path, true, false);
+ StagePgTempDirForRemoval(temp_path);
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
@@ -3105,21 +3107,160 @@ RemovePgTempFiles(void)
}
FreeDir(spc_dir);
+}
+
+/*
+ * Remove temporary files that have been previously staged for removal by
+ * StagePgTempFilesForRemoval().
+ */
+void
+RemovePgTempFiles(void)
+{
+ char temp_path[MAXPGPATH + 10 + sizeof(TABLESPACE_VERSION_DIRECTORY) + sizeof(PG_TEMP_FILES_DIR)];
+ DIR *spc_dir;
+ struct dirent *spc_de;
+
+ /*
+ * First process temp files in pg_default ($PGDATA/base)
+ */
+ RemoveStagedPgTempDirs("base");
/*
- * In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
- * DataDir as well. However, that is *not* cleaned here because doing so
- * would create a race condition. It's done separately, earlier in
- * postmaster startup.
+ * Cycle through temp directories for all non-default tablespaces.
*/
+ spc_dir = AllocateDir("pg_tblspc");
+
+ while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
+ {
+ if (strcmp(spc_de->d_name, ".") == 0 ||
+ strcmp(spc_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
+ spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
+ RemoveStagedPgTempDirs(temp_path);
+ }
+
+ FreeDir(spc_dir);
}
/*
- * Process one pgsql_tmp directory for RemovePgTempFiles.
+ * StagePgTempDirForRemoval
+ *
+ * This function moves the given directory to a staging directory and renames
+ * it in preparation for removal by a later call to RemoveStagedPgTempDirs().
+ * The current timestamp is appended to the end of the new directory name in
+ * case previously staged pgsql_tmp directories have not yet been removed.
+ */
+static void
+StagePgTempDirForRemoval(const char *tmp_dir)
+{
+ struct stat st;
+ char stage_path[MAXPGPATH * 2];
+ char parent_path[MAXPGPATH * 2];
+ char to_remove_path[MAXPGPATH * 2];
+ struct timeval tv;
+ uint64 epoch;
+
+ /*
+ * If tmp_dir doesn't exist, there is nothing to stage.
+ */
+ if (stat(tmp_dir, &st) != 0)
+ {
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", tmp_dir)));
+ return;
+ }
+ else if (!S_ISDIR(st.st_mode))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("\"%s\" is not a directory", tmp_dir)));
+
+ strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
+ get_parent_directory(parent_path);
+
+ /*
+ * get_parent_directory() returns an empty string if the input argument is
+ * just a file name (see comments in path.c), so handle that as being the
+ * current directory.
+ */
+ if (strlen(parent_path) == 0)
+ strlcpy(parent_path, ".", MAXPGPATH * 2);
+
+ /*
+ * Make sure the pgsql_tmp_staged_for_removal directory exists.
+ */
+ snprintf(to_remove_path, sizeof(to_remove_path), "%s/%s", parent_path,
+ PG_TEMP_TO_REMOVE_DIR);
+ if (MakePGDirectory(to_remove_path) != 0 && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ to_remove_path)));
+
+ /*
+ * Pick a sufficiently unique name for the stage directory. We just append
+ * the current timestamp to the end of the name.
+ */
+ gettimeofday(&tv, NULL);
+ if (pg_mul_u64_overflow((uint64) 1000, (uint64) tv.tv_sec, &epoch) ||
+ pg_add_u64_overflow(epoch, (uint64) tv.tv_usec, &epoch))
+ elog(ERROR, "could not stage temporary file directory for removal");
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s." UINT64_FORMAT,
+ to_remove_path, PG_TEMP_FILES_DIR, epoch);
+
+ /*
+ * Rename the temporary directory.
+ */
+ if (rename(tmp_dir, stage_path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename directory \"%s\" to \"%s\": %m",
+ tmp_dir, stage_path)));
+}
+
+/*
+ * RemoveStagedPgTempDirs
+ *
+ * This function removes all pgsql_tmp directories that have been staged for
+ * removal by StagePgTempDirForRemoval() in the given tablespace directory.
+ */
+static void
+RemoveStagedPgTempDirs(const char *spc_dir)
+{
+ char stage_path[MAXPGPATH * 2];
+ char temp_path[MAXPGPATH * 2];
+ DIR *dir;
+ struct dirent *de;
+
+ snprintf(stage_path, sizeof(stage_path), "%s/%s", spc_dir,
+ PG_TEMP_TO_REMOVE_DIR);
+
+ dir = AllocateDir(stage_path);
+ if (dir == NULL && errno == ENOENT)
+ return;
+
+ while ((de = ReadDir(dir, stage_path)) != NULL)
+ {
+ if (strncmp(de->d_name, PG_TEMP_FILES_DIR,
+ strlen(PG_TEMP_FILES_DIR)) != 0)
+ continue;
+
+ snprintf(temp_path, sizeof(temp_path), "%s/%s", stage_path, de->d_name);
+ RemovePgTempDir(temp_path, true, false);
+ }
+ FreeDir(dir);
+}
+
+/*
+ * Process one pgsql_tmp directory for RemoveStagedPgTempDirs.
*
* If missing_ok is true, it's all right for the named directory to not exist.
- * Any other problem results in a LOG message. (missing_ok should be true at
- * the top level, since pgsql_tmp directories are not created until needed.)
+ * Any other problem results in an ERROR. (missing_ok should be true at the
+ * top level, since pgsql_tmp directories are not created until needed.)
*
* At the top level, this should be called with unlink_all = false, so that
* only files matching the temporary name prefix will be unlinked. When
@@ -3141,7 +3282,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
if (temp_dir == NULL && errno == ENOENT && missing_ok)
return;
- while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)
+ while ((temp_de = ReadDir(temp_dir, tmpdirname)) != NULL)
{
if (strcmp(temp_de->d_name, ".") == 0 ||
strcmp(temp_de->d_name, "..") == 0)
@@ -3155,11 +3296,9 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
PG_TEMP_FILE_PREFIX,
strlen(PG_TEMP_FILE_PREFIX)) == 0)
{
- PGFileType type = get_dirent_type(rm_path, temp_de, false, LOG);
+ PGFileType type = get_dirent_type(rm_path, temp_de, false, ERROR);
- if (type == PGFILETYPE_ERROR)
- continue;
- else if (type == PGFILETYPE_DIR)
+ if (type == PGFILETYPE_DIR)
{
/* recursively remove contents, then directory itself */
RemovePgTempDir(rm_path, false, true);
@@ -3167,14 +3306,14 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
else
{
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
}
}
else
- ereport(LOG,
+ ereport(ERROR,
(errmsg("unexpected file found in temporary-files directory: \"%s\"",
rm_path)));
}
@@ -3182,7 +3321,7 @@ RemovePgTempDir(const char *tmpdirname, bool missing_ok, bool unlink_all)
FreeDir(temp_dir);
if (rmdir(tmpdirname) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove directory \"%s\": %m",
tmpdirname)));
@@ -3198,7 +3337,7 @@ RemovePgTempRelationFiles(const char *tsdirname)
ts_dir = AllocateDir(tsdirname);
- while ((de = ReadDirExtended(ts_dir, tsdirname, LOG)) != NULL)
+ while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
/*
* We're only interested in the per-database directories, which have
@@ -3226,7 +3365,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDirExtended(dbspace_dir, dbspacedirname, LOG)) != NULL)
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
if (!looks_like_temp_rel_name(de->d_name))
continue;
@@ -3235,7 +3374,7 @@ RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
dbspacedirname, de->d_name);
if (unlink(rm_path) < 0)
- ereport(LOG,
+ ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
rm_path)));
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 790b9a9a14..cf27e90aea 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -166,6 +166,7 @@ extern Oid GetNextTempTableSpace(void);
extern void AtEOXact_Files(bool isCommit);
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
+extern void StagePgTempFilesForRemoval(void);
extern void RemovePgTempFiles(void);
extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
bool unlink_all);
--
2.25.1
v13-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchtext/x-diff; charset=us-asciiDownload
From 01f01a2304e8bee652c429a601fa25783bfe967f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 21:42:52 -0800
Subject: [PATCH v13 4/6] Move pgsql_tmp file removal to custodian process.
With this change, startup (and restart after a crash) simply
renames the pgsql_tmp directories, and the custodian process
actually removes all the files in the staged directories as well as
the staged directories themselves. This should help avoid long
startup delays due to many leftover temporary files.
---
src/backend/postmaster/custodian.c | 1 +
src/backend/postmaster/postmaster.c | 24 +++++++++++++++++++-----
src/backend/storage/file/fd.c | 13 +++++++------
src/include/postmaster/custodian.h | 2 +-
4 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index e90f5d0d1f..fe1f48844e 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -70,6 +70,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3840da94ce..86000935fd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -109,6 +109,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/custodian.h"
#include "postmaster/fork_process.h"
#include "postmaster/interrupt.h"
#include "postmaster/pgarch.h"
@@ -1385,9 +1386,12 @@ PostmasterMain(int argc, char *argv[])
/*
* Remove old temporary files. At this point there can be no other
* Postgres processes running in this directory, so this should be safe.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion. The
+ * custodian process is responsible for actually removing the files.
*/
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
/*
* Initialize the autovacuum subsystem (again, no process start yet)
@@ -3919,12 +3923,14 @@ PostmasterStateMachine(void)
ereport(LOG,
(errmsg("all server processes terminated; reinitializing")));
- /* remove leftover temporary files after a crash */
+ /*
+ * Remove leftover temporary files after a crash.
+ *
+ * Note that this just stages the pgsql_tmp directories for deletion.
+ * The custodian process is responsible for actually removing the files.
+ */
if (remove_temp_files_after_crash)
- {
StagePgTempFilesForRemoval();
- RemovePgTempFiles();
- }
/* allow background workers to immediately restart */
ResetBackgroundWorkerCrashTimes();
@@ -3937,6 +3943,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ /*
+ * Now that shared memory is initialized, notify the custodian to clean
+ * up the staged pgsql_tmp directories. We do this even if
+ * remove_temp_files_after_crash is false so that any previously staged
+ * directories are eventually cleaned up.
+ */
+ RequestCustodian(CUSTODIAN_REMOVE_TEMP_FILES, false, (Datum) 0);
+
StartupPID = StartupDataBase();
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9610850d45..625355c56a 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -96,6 +96,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "portability/mem.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -1564,9 +1565,9 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
*
* Directories created within the top-level temporary directory should begin
* with PG_TEMP_FILE_PREFIX, so that they can be identified as temporary and
- * deleted at startup by RemovePgTempFiles(). Further subdirectories below
- * that do not need any particular prefix.
-*/
+ * deleted by RemovePgTempFiles(). Further subdirectories below that do not
+ * need any particular prefix.
+ */
void
PathNameCreateTemporaryDir(const char *basedir, const char *directory)
{
@@ -1764,9 +1765,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
*
* If the file is inside the top-level temporary directory, its name should
* begin with PG_TEMP_FILE_PREFIX so that it can be identified as temporary
- * and deleted at startup by RemovePgTempFiles(). Alternatively, it can be
- * inside a directory created with PathNameCreateTemporaryDir(), in which case
- * the prefix isn't needed.
+ * and deleted by RemovePgTempFiles(). Alternatively, it can be inside a
+ * directory created with PathNameCreateTemporaryDir(), in which case the prefix
+ * isn't needed.
*/
File
PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 170ca61a21..80890ceadd 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_TEMP_FILES,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
--
2.25.1
v13-0005-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From 674f54c720bfd7d56de04370e5a015fe929c6b65 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v13 5/6] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 8 ++++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 1 +
src/include/replication/snapbuild.h | 2 +-
5 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..4991c10f86 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -7001,10 +7001,14 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ !IsUnderPostmaster,
+ (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index fe1f48844e..855a756ca0 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -71,6 +72,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a1fd1d92d6..f957b9aa49 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2037,14 +2037,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 80890ceadd..37334941cc 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -19,6 +19,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 2a697e57c3..9eba403e0c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v13-0006-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From 1490ce000e87bdca7edcb9e3e952a04ffdea8335 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v13 6/6] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
src/backend/access/heap/rewriteheap.c | 80 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 43 ++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
4 files changed, 118 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..07976504cc 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1179,7 +1181,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1207,6 +1210,11 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
+ !IsUnderPostmaster,
+ LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1239,15 +1247,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1285,3 +1285,63 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+ PGFileType de_type;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ de_type = get_dirent_type(path, mapping_de, false, DEBUG1);
+
+ if (de_type != PGFILETYPE_ERROR && de_type != PGFILETYPE_REG)
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 855a756ca0..d4be19e5de 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(bool retry);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -73,6 +78,7 @@ struct cust_task_funcs_entry
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_TEMP_FILES, RemovePgTempFiles, NULL},
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -384,3 +390,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5cc04756a5..bc875330d7 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 37334941cc..f177d55159 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -20,6 +22,7 @@ typedef enum CustodianTask
{
CUSTODIAN_REMOVE_TEMP_FILES,
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -29,5 +32,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, bool immediate, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
On Thu, 24 Nov 2022 at 00:19, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Sun, Nov 06, 2022 at 02:38:42PM -0800, Nathan Bossart wrote:
rebased
another rebase for cfbot
0001 seems good to me
* I like that it sleeps forever until requested
* not sure I believe that everything it does can always be aborted out
of and shutdown - to achieve that you will need a
CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least
* not sure why you want immediate execution of custodian tasks - I
feel supporting two modes will be a lot harder. For me, I would run
locally when !IsUnderPostmaster and also in an Assert build, so we can
test it works right - i.e. running in its own process is just a
production optimization for performance (which is the stated reason
for having this)
0005 seems good from what I know
* There is no check to see if it worked in any sane time
* It seems possible that "Old" might change meaning - will that make
it break/fail?
0006 seems good also
* same comments for 5
Rather than explicitly use DEBUG1 everywhere I would have an
#define CUSTODIAN_LOG_LEVEL LOG
so we can run with it in LOG mode and then set it to DEBUG1 with a one
line change in a later phase of Beta
I can't really comment with knowledge on sub-patches 0002 to 0004.
Perhaps you should aim to get 1, 5, 6 committed first and then return
to the others in a later CF/separate thread?
--
Simon Riggs http://www.EnterpriseDB.com/
Thanks for taking a look!
On Thu, Nov 24, 2022 at 05:31:02PM +0000, Simon Riggs wrote:
* not sure I believe that everything it does can always be aborted out
of and shutdown - to achieve that you will need a
CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least
I did something like this earlier, but was advised to simply let the
functions finish as usual during shutdown [0]/messages/by-id/20220217065938.x2esfdppzypegn5j@alap3.anarazel.de. I think this is what the
checkpointer process does today, anyway.
* not sure why you want immediate execution of custodian tasks - I
feel supporting two modes will be a lot harder. For me, I would run
locally when !IsUnderPostmaster and also in an Assert build, so we can
test it works right - i.e. running in its own process is just a
production optimization for performance (which is the stated reason
for having this)
I added this because 0004 involves requesting a task from the postmaster,
so checking for IsUnderPostmaster doesn't work. Those tasks would always
run immediately. However, we could use IsPostmasterEnvironment instead,
which would allow us to remove the "immediate" argument. I did it this way
in v14.
I'm not sure about running locally in Assert builds. It's true that would
help ensure there's test coverage for the task logic, but it would also
reduce coverage for the custodian logic. And in general, I'm worried about
having Assert builds use a different code path than production builds.
0005 seems good from what I know
* There is no check to see if it worked in any sane time
What did you have in mind? Should the custodian begin emitting WARNINGs
after a while?
* It seems possible that "Old" might change meaning - will that make
it break/fail?
I don't believe so.
Rather than explicitly use DEBUG1 everywhere I would have an
#define CUSTODIAN_LOG_LEVEL LOG
so we can run with it in LOG mode and then set it to DEBUG1 with a one
line change in a later phase of Beta
I can create a separate patch for this, but I don't think I've ever seen
this sort of thing before. Is the idea just to help with debugging during
the development phase?
I can't really comment with knowledge on sub-patches 0002 to 0004.
Perhaps you should aim to get 1, 5, 6 committed first and then return
to the others in a later CF/separate thread?
That seems like a good idea since those are all relatively self-contained.
I removed 0002-0004 in v14.
[0]: /messages/by-id/20220217065938.x2esfdppzypegn5j@alap3.anarazel.de
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v14-0003-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From 91f87579f81c9b7cae5d48a118368ba6a69f4dc8 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v14 3/3] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 43 +++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
4 files changed, 116 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..ff4cd8cef9 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1179,7 +1181,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1207,6 +1210,9 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS, LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1239,15 +1245,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1285,3 +1283,63 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+ PGFileType de_type;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ de_type = get_dirent_type(path, mapping_de, false, DEBUG1);
+
+ if (de_type != PGFILETYPE_ERROR && de_type != PGFILETYPE_REG)
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index d0fd955d4b..c4d0a22451 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(void);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -72,6 +77,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -382,3 +388,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5cc04756a5..bc875330d7 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index ab6d4283b9..00280c203b 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -19,6 +21,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -28,5 +31,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
v14-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From 443c3f842785554476b1a353bcb1af13f426116b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v14 1/3] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 382 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 38 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
13 files changed, 482 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..a94381bc21
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,382 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(void);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks();
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If we are not in a standalone backend, the custodian will re-enqueue the
+ * currently running task if an exception is encountered.
+ */
+static void
+DoCustodianTasks(void)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (IsPostmasterEnvironment)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * In standalone backends, the task is performed immediately in the current
+ * process, and this function will not return until it completes. Otherwise,
+ * the task is added to the custodian's queue if it is not already enqueued,
+ * and this function returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (!IsPostmasterEnvironment)
+ DoCustodianTasks();
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 293a44ca29..ac72a8a07f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a8a246921f..6a74423172 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -240,6 +240,7 @@ bool send_abort_for_kill = false;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -537,6 +538,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1808,13 +1810,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2728,6 +2733,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3025,6 +3032,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3118,6 +3127,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3532,6 +3555,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (CheckpointerPID != 0 && take_action)
sigquit_child(CheckpointerPID);
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ sigquit_child(CustodianPID);
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3685,6 +3714,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3722,6 +3754,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3815,6 +3848,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4027,6 +4061,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b204ecdbc3..cf80e65779 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b1c35653fc..6a8485e865 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..63fd242b1e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb1046450b..f19f4c3075 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..59a95dd7c0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -429,6 +430,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -441,6 +443,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..73d0bc5f02
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index aa13e1d66e..8f0e696663 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -400,6 +400,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -417,11 +419,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..48602c8a16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v14-0002-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From bf26ec2eb0b1a26ce98cd68717ea6f6491b81493 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v14 2/3] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 6 ++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 2 +-
src/include/replication/snapbuild.h | 2 +-
5 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..c153c32a77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -7001,10 +7001,12 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index a94381bc21..d0fd955d4b 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -70,6 +71,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a1fd1d92d6..f957b9aa49 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2037,14 +2037,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 73d0bc5f02..ab6d4283b9 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 2a697e57c3..9eba403e0c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote:
Thanks for taking a look!
On Thu, Nov 24, 2022 at 05:31:02PM +0000, Simon Riggs wrote:
* not sure I believe that everything it does can always be aborted out
of and shutdown - to achieve that you will need a
CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at leastI did something like this earlier, but was advised to simply let the
functions finish as usual during shutdown [0]. I think this is what the
checkpointer process does today, anyway.
If we say "The custodian is not an essential process and can shutdown
quickly when requested.", and yet we know its not true in all cases,
then that will lead to misunderstandings and bugs.
If we perform a restart and the custodian is performing extra work
that delays shutdown, then it also delays restart. Given the title of
the thread, we should be looking to improve that, or at least know it
occurred.
* not sure why you want immediate execution of custodian tasks - I
feel supporting two modes will be a lot harder. For me, I would run
locally when !IsUnderPostmaster and also in an Assert build, so we can
test it works right - i.e. running in its own process is just a
production optimization for performance (which is the stated reason
for having this)I added this because 0004 involves requesting a task from the postmaster,
so checking for IsUnderPostmaster doesn't work. Those tasks would always
run immediately. However, we could use IsPostmasterEnvironment instead,
which would allow us to remove the "immediate" argument. I did it this way
in v14.
Thanks
0005 seems good from what I know
* There is no check to see if it worked in any sane timeWhat did you have in mind? Should the custodian begin emitting WARNINGs
after a while?
I think it might be useful if it logged anything that took an
"extended period", TBD.
Maybe that is already covered by startup process logging. Please tell
me that still works?
Rather than explicitly use DEBUG1 everywhere I would have an
#define CUSTODIAN_LOG_LEVEL LOG
so we can run with it in LOG mode and then set it to DEBUG1 with a one
line change in a later phase of BetaI can create a separate patch for this, but I don't think I've ever seen
this sort of thing before.
Much of recovery is coded that way, for the same reason.
Is the idea just to help with debugging during
the development phase?
"Just", yes. Tests would be desirable also, under src/test/modules.
--
Simon Riggs http://www.EnterpriseDB.com/
On 2022-11-28 13:08:57 +0000, Simon Riggs wrote:
On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote:
Rather than explicitly use DEBUG1 everywhere I would have an
#define CUSTODIAN_LOG_LEVEL LOG
so we can run with it in LOG mode and then set it to DEBUG1 with a one
line change in a later phase of BetaI can create a separate patch for this, but I don't think I've ever seen
this sort of thing before.Much of recovery is coded that way, for the same reason.
I think that's not a good thing to copy without a lot more justification than
"some old code also does it that way". It's sometimes justified, but also
makes code harder to read (one doesn't know what it does without looking up
the #define, line length).
On Mon, Nov 28, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-11-28 13:08:57 +0000, Simon Riggs wrote:
On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote:
Rather than explicitly use DEBUG1 everywhere I would have an
#define CUSTODIAN_LOG_LEVEL LOG
so we can run with it in LOG mode and then set it to DEBUG1 with a one
line change in a later phase of BetaI can create a separate patch for this, but I don't think I've ever seen
this sort of thing before.Much of recovery is coded that way, for the same reason.
I think that's not a good thing to copy without a lot more justification than
"some old code also does it that way". It's sometimes justified, but also
makes code harder to read (one doesn't know what it does without looking up
the #define, line length).
Yeah. If people need some of the log messages at a higher level during
development, they can patch their own copies.
I think there might be some argument for having a facility that lets
you pick subsystems or even individual messages that you want to trace
and pump up the log level for just those call sites. But I don't know
exactly what that would look like, and I don't think inventing one-off
mechanisms for particular cases is a good idea.
--
Robert Haas
EDB: http://www.enterprisedb.com
Okay, here is a new patch set. 0004 adds logic to prevent custodian tasks
from delaying shutdown.
I haven't added any logging for long-running tasks yet. Tasks might
ordinarily take a while, so such logs wouldn't necessarily indicate
something is wrong. Perhaps we could add a GUC for the amount of time to
wait before logging. This feature would be off by default. Another option
could be to create a log_custodian GUC that causes tasks to be logged when
completed, similar to log_checkpoints. Thoughts?
On Mon, Nov 28, 2022 at 01:37:01PM -0500, Robert Haas wrote:
On Mon, Nov 28, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-11-28 13:08:57 +0000, Simon Riggs wrote:
On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote:
Rather than explicitly use DEBUG1 everywhere I would have an
#define CUSTODIAN_LOG_LEVEL LOG
so we can run with it in LOG mode and then set it to DEBUG1 with a one
line change in a later phase of BetaI can create a separate patch for this, but I don't think I've ever seen
this sort of thing before.Much of recovery is coded that way, for the same reason.
I think that's not a good thing to copy without a lot more justification than
"some old code also does it that way". It's sometimes justified, but also
makes code harder to read (one doesn't know what it does without looking up
the #define, line length).Yeah. If people need some of the log messages at a higher level during
development, they can patch their own copies.I think there might be some argument for having a facility that lets
you pick subsystems or even individual messages that you want to trace
and pump up the log level for just those call sites. But I don't know
exactly what that would look like, and I don't think inventing one-off
mechanisms for particular cases is a good idea.
Given this discussion, I haven't made any changes to the logging in the new
patch set.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v15-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From 7fa5c047781dddedb1f9c5a4e96622a23c0c0835 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v15 1/4] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 382 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 38 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
13 files changed, 482 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..a94381bc21
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,382 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(void);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks();
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If we are not in a standalone backend, the custodian will re-enqueue the
+ * currently running task if an exception is encountered.
+ */
+static void
+DoCustodianTasks(void)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (IsPostmasterEnvironment)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * In standalone backends, the task is performed immediately in the current
+ * process, and this function will not return until it completes. Otherwise,
+ * the task is added to the custodian's queue if it is not already enqueued,
+ * and this function returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (!IsPostmasterEnvironment)
+ DoCustodianTasks();
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 293a44ca29..ac72a8a07f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a8a246921f..6a74423172 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -240,6 +240,7 @@ bool send_abort_for_kill = false;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -537,6 +538,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1808,13 +1810,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2728,6 +2733,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3025,6 +3032,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3118,6 +3127,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3532,6 +3555,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (CheckpointerPID != 0 && take_action)
sigquit_child(CheckpointerPID);
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ sigquit_child(CustodianPID);
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3685,6 +3714,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3722,6 +3754,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3815,6 +3848,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4027,6 +4061,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b204ecdbc3..cf80e65779 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b1c35653fc..6a8485e865 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..63fd242b1e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb1046450b..f19f4c3075 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..59a95dd7c0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -429,6 +430,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -441,6 +443,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..73d0bc5f02
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index aa13e1d66e..8f0e696663 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -400,6 +400,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -417,11 +419,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..48602c8a16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v15-0002-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From 3a433ba60d7c9da685b117dd4d51dbf189760687 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v15 2/4] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
src/backend/access/transam/xlog.c | 6 ++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 2 +-
src/include/replication/snapbuild.h | 2 +-
5 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..c153c32a77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -7001,10 +7001,12 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index a94381bc21..d0fd955d4b 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -70,6 +71,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a1fd1d92d6..f957b9aa49 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2037,14 +2037,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 73d0bc5f02..ab6d4283b9 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 2a697e57c3..9eba403e0c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v15-0003-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From f89191c37f626eafddfebc3829c01f7efa64978d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v15 3/4] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++++++----
src/backend/postmaster/custodian.c | 43 +++++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
4 files changed, 116 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..ff4cd8cef9 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1179,7 +1181,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1207,6 +1210,9 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS, LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1239,15 +1245,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1285,3 +1283,63 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+ PGFileType de_type;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ de_type = get_dirent_type(path, mapping_de, false, DEBUG1);
+
+ if (de_type != PGFILETYPE_ERROR && de_type != PGFILETYPE_REG)
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index d0fd955d4b..c4d0a22451 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(void);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -72,6 +77,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -382,3 +388,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5cc04756a5..bc875330d7 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index ab6d4283b9..00280c203b 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -19,6 +21,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -28,5 +31,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
v15-0004-Do-not-delay-shutdown-due-to-long-running-custod.patchtext/x-diff; charset=us-asciiDownload
From aad4b8e66c49e9d1a06286407e84e9740a2a9f7f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Mon, 28 Nov 2022 15:15:37 -0800
Subject: [PATCH v15 4/4] Do not delay shutdown due to long-running custodian
tasks.
These tasks are not essential enough to delay shutdown and can be
retried the next time the server is running.
---
src/backend/access/heap/rewriteheap.c | 9 +++++++++
src/backend/postmaster/custodian.c | 8 ++++++++
src/backend/replication/logical/snapbuild.c | 9 +++++++++
3 files changed, 26 insertions(+)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index ff4cd8cef9..a098060d76 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -117,6 +117,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -1313,6 +1314,14 @@ RemoveOldLogicalRewriteMappings(void)
lo;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(mapping_de->d_name, ".") == 0 ||
strcmp(mapping_de->d_name, "..") == 0)
continue;
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index c4d0a22451..394b7047af 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -231,6 +231,14 @@ DoCustodianTasks(void)
{
CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+ /*
+ * Custodian tasks are not essential enough to delay shutdown, so bail
+ * out if there's a pending shutdown request. Tasks should be
+ * requested again and retried the next time the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
PG_TRY();
{
(*func) ();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index f957b9aa49..2a3d5ccf73 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,6 +126,7 @@
#include "common/file_utils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -2073,6 +2074,14 @@ RemoveOldSerializedSnapshots(void)
XLogRecPtr lsn;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(snap_de->d_name, ".") == 0 ||
strcmp(snap_de->d_name, "..") == 0)
continue;
--
2.25.1
On Mon, 28 Nov 2022 at 23:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
Okay, here is a new patch set. 0004 adds logic to prevent custodian tasks
from delaying shutdown.
That all seems good, thanks.
The last important point for me is tests, in src/test/modules
probably. It might be possible to reuse the final state of other
modules' tests to test cleanup, or at least integrate a custodian test
into each module.
--
Simon Riggs http://www.EnterpriseDB.com/
On Tue, Nov 29, 2022 at 12:02:44PM +0000, Simon Riggs wrote:
The last important point for me is tests, in src/test/modules
probably. It might be possible to reuse the final state of other
modules' tests to test cleanup, or at least integrate a custodian test
into each module.
Of course. I found some existing tests for the test_decoding plugin that
appear to reliably generate the files we want the custodian to clean up, so
I added them there.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v16-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From d8342d121d39d04e995986b4244abf369b833730 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v16 1/4] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 382 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 38 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
13 files changed, 482 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..a94381bc21
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,382 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(void);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks();
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If we are not in a standalone backend, the custodian will re-enqueue the
+ * currently running task if an exception is encountered.
+ */
+static void
+DoCustodianTasks(void)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (IsPostmasterEnvironment)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * In standalone backends, the task is performed immediately in the current
+ * process, and this function will not return until it completes. Otherwise,
+ * the task is added to the custodian's queue if it is not already enqueued,
+ * and this function returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (!IsPostmasterEnvironment)
+ DoCustodianTasks();
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 293a44ca29..ac72a8a07f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a8a246921f..6a74423172 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -240,6 +240,7 @@ bool send_abort_for_kill = false;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -537,6 +538,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1808,13 +1810,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2728,6 +2733,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3025,6 +3032,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3118,6 +3127,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3532,6 +3555,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (CheckpointerPID != 0 && take_action)
sigquit_child(CheckpointerPID);
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ sigquit_child(CustodianPID);
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3685,6 +3714,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3722,6 +3754,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3815,6 +3848,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4027,6 +4061,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b204ecdbc3..cf80e65779 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b1c35653fc..6a8485e865 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..63fd242b1e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb1046450b..f19f4c3075 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..59a95dd7c0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -429,6 +430,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -441,6 +443,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..73d0bc5f02
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index aa13e1d66e..8f0e696663 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -400,6 +400,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -417,11 +419,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..48602c8a16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v16-0002-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From 71407bf47926c707401278d6274db7641549d975 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v16 2/4] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
contrib/test_decoding/expected/spill.out | 21 +++++++++++++++++++++
contrib/test_decoding/sql/spill.sql | 17 +++++++++++++++++
src/backend/access/transam/xlog.c | 6 ++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 2 +-
src/include/replication/snapbuild.h | 2 +-
7 files changed, 50 insertions(+), 9 deletions(-)
diff --git a/contrib/test_decoding/expected/spill.out b/contrib/test_decoding/expected/spill.out
index 10734bdb6a..75acbd5d5c 100644
--- a/contrib/test_decoding/expected/spill.out
+++ b/contrib/test_decoding/expected/spill.out
@@ -248,6 +248,27 @@ GROUP BY 1 ORDER BY 1;
(2 rows)
DROP TABLE spill_test;
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
--------------------------
diff --git a/contrib/test_decoding/sql/spill.sql b/contrib/test_decoding/sql/spill.sql
index e638cacd3f..94d522f548 100644
--- a/contrib/test_decoding/sql/spill.sql
+++ b/contrib/test_decoding/sql/spill.sql
@@ -176,4 +176,21 @@ GROUP BY 1 ORDER BY 1;
DROP TABLE spill_test;
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+
SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..c153c32a77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -7001,10 +7001,12 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index a94381bc21..d0fd955d4b 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -70,6 +71,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index beddcbcdea..e7c4f69b42 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2036,14 +2036,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 73d0bc5f02..ab6d4283b9 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 2a697e57c3..9eba403e0c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v16-0003-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From f3c7ff4ee56a66bd94d43bfabcf866f57d9eb829 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v16 3/4] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
contrib/test_decoding/expected/rewrite.out | 21 ++++++
contrib/test_decoding/sql/rewrite.sql | 17 +++++
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++---
src/backend/postmaster/custodian.c | 43 ++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
6 files changed, 154 insertions(+), 10 deletions(-)
diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out
index b30999c436..00b505ef67 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -152,6 +152,27 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
COMMIT
(6 rows)
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
--------------------------
diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql
index 62dead3a9b..767eccbed4 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -100,6 +100,23 @@ VACUUM FULL pg_proc; VACUUM FULL pg_description; VACUUM FULL pg_shdescription; V
INSERT INTO replication_example(somedata, testcolumn1, testcolumn3) VALUES (9, 7, 1);
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
+
SELECT pg_drop_replication_slot('regression_slot');
DROP TABLE IF EXISTS replication_example;
DROP FUNCTION iamalongfunction();
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..ff4cd8cef9 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1179,7 +1181,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1207,6 +1210,9 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS, LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1239,15 +1245,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1285,3 +1283,63 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+ PGFileType de_type;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ de_type = get_dirent_type(path, mapping_de, false, DEBUG1);
+
+ if (de_type != PGFILETYPE_ERROR && de_type != PGFILETYPE_REG)
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index d0fd955d4b..c4d0a22451 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(void);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -72,6 +77,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -382,3 +388,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5cc04756a5..bc875330d7 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index ab6d4283b9..00280c203b 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -19,6 +21,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -28,5 +31,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
v16-0004-Do-not-delay-shutdown-due-to-long-running-custod.patchtext/x-diff; charset=us-asciiDownload
From 6282487d43a83590edd749c77618839a1291e36e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Mon, 28 Nov 2022 15:15:37 -0800
Subject: [PATCH v16 4/4] Do not delay shutdown due to long-running custodian
tasks.
These tasks are not essential enough to delay shutdown and can be
retried the next time the server is running.
---
src/backend/access/heap/rewriteheap.c | 9 +++++++++
src/backend/postmaster/custodian.c | 8 ++++++++
src/backend/replication/logical/snapbuild.c | 9 +++++++++
3 files changed, 26 insertions(+)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index ff4cd8cef9..a098060d76 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -117,6 +117,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -1313,6 +1314,14 @@ RemoveOldLogicalRewriteMappings(void)
lo;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(mapping_de->d_name, ".") == 0 ||
strcmp(mapping_de->d_name, "..") == 0)
continue;
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index c4d0a22451..394b7047af 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -231,6 +231,14 @@ DoCustodianTasks(void)
{
CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+ /*
+ * Custodian tasks are not essential enough to delay shutdown, so bail
+ * out if there's a pending shutdown request. Tasks should be
+ * requested again and retried the next time the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
PG_TRY();
{
(*func) ();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e7c4f69b42..939ad4c4ab 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,6 +126,7 @@
#include "common/file_utils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -2072,6 +2073,14 @@ RemoveOldSerializedSnapshots(void)
XLogRecPtr lsn;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(snap_de->d_name, ".") == 0 ||
strcmp(snap_de->d_name, "..") == 0)
continue;
--
2.25.1
On Tue, Nov 29, 2022 at 07:56:53PM -0800, Nathan Bossart wrote:
On Tue, Nov 29, 2022 at 12:02:44PM +0000, Simon Riggs wrote:
The last important point for me is tests, in src/test/modules
probably. It might be possible to reuse the final state of other
modules' tests to test cleanup, or at least integrate a custodian test
into each module.Of course. I found some existing tests for the test_decoding plugin that
appear to reliably generate the files we want the custodian to clean up, so
I added them there.
cfbot is not happy with v16. AFAICT this is just due to poor placement, so
here is another attempt with the tests moved to a new location. Apologies
for the noise.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v17-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From d8342d121d39d04e995986b4244abf369b833730 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v17 1/4] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 382 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 38 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
13 files changed, 482 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..a94381bc21
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,382 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(void);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise out latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks();
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If we are not in a standalone backend, the custodian will re-enqueue the
+ * currently running task if an exception is encountered.
+ */
+static void
+DoCustodianTasks(void)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (IsPostmasterEnvironment)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * In standalone backends, the task is performed immediately in the current
+ * process, and this function will not return until it completes. Otherwise,
+ * the task is added to the custodian's queue if it is not already enqueued,
+ * and this function returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (!IsPostmasterEnvironment)
+ DoCustodianTasks();
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 293a44ca29..ac72a8a07f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a8a246921f..6a74423172 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -240,6 +240,7 @@ bool send_abort_for_kill = false;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -537,6 +538,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1808,13 +1810,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2728,6 +2733,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3025,6 +3032,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3118,6 +3127,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3532,6 +3555,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (CheckpointerPID != 0 && take_action)
sigquit_child(CheckpointerPID);
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ sigquit_child(CustodianPID);
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3685,6 +3714,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3722,6 +3754,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3815,6 +3848,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4027,6 +4061,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b204ecdbc3..cf80e65779 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b1c35653fc..6a8485e865 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..63fd242b1e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb1046450b..f19f4c3075 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..59a95dd7c0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -429,6 +430,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -441,6 +443,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..73d0bc5f02
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index aa13e1d66e..8f0e696663 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -400,6 +400,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -417,11 +419,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..48602c8a16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v17-0002-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From e6fea13aafab0a85402f1c048db461284a6068f5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v17 2/4] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
contrib/test_decoding/expected/rewrite.out | 21 +++++++++++++++++++++
contrib/test_decoding/sql/rewrite.sql | 17 +++++++++++++++++
src/backend/access/transam/xlog.c | 6 ++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 2 +-
src/include/replication/snapbuild.h | 2 +-
7 files changed, 50 insertions(+), 9 deletions(-)
diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out
index b30999c436..8b97f15f6f 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -162,3 +162,24 @@ DROP TABLE IF EXISTS replication_example;
DROP FUNCTION iamalongfunction();
DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql
index 62dead3a9b..d268fa559a 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -105,3 +105,20 @@ DROP TABLE IF EXISTS replication_example;
DROP FUNCTION iamalongfunction();
DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
+
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..c153c32a77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -7001,10 +7001,12 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index a94381bc21..d0fd955d4b 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -70,6 +71,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index beddcbcdea..e7c4f69b42 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2036,14 +2036,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 73d0bc5f02..ab6d4283b9 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 2a697e57c3..9eba403e0c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v17-0003-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From 13dcd5d51c82c460f762df91546eab6d3d105204 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v17 3/4] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
contrib/test_decoding/expected/rewrite.out | 19 ++++++
contrib/test_decoding/sql/rewrite.sql | 14 ++++
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++---
src/backend/postmaster/custodian.c | 43 ++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
6 files changed, 149 insertions(+), 10 deletions(-)
diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out
index 8b97f15f6f..214a514a0a 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -183,3 +183,22 @@ SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
t
(1 row)
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql
index d268fa559a..d66f70f837 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -122,3 +122,17 @@ BEGIN
END
$$;
SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..ff4cd8cef9 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1179,7 +1181,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1207,6 +1210,9 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS, LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1239,15 +1245,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1285,3 +1283,63 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+ PGFileType de_type;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ de_type = get_dirent_type(path, mapping_de, false, DEBUG1);
+
+ if (de_type != PGFILETYPE_ERROR && de_type != PGFILETYPE_REG)
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index d0fd955d4b..c4d0a22451 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(void);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -72,6 +77,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -382,3 +388,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5cc04756a5..bc875330d7 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index ab6d4283b9..00280c203b 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -19,6 +21,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -28,5 +31,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
v17-0004-Do-not-delay-shutdown-due-to-long-running-custod.patchtext/x-diff; charset=us-asciiDownload
From b6e8831d1c1ceccc1e2c7c1749082987304f18a3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Mon, 28 Nov 2022 15:15:37 -0800
Subject: [PATCH v17 4/4] Do not delay shutdown due to long-running custodian
tasks.
These tasks are not essential enough to delay shutdown and can be
retried the next time the server is running.
---
src/backend/access/heap/rewriteheap.c | 9 +++++++++
src/backend/postmaster/custodian.c | 8 ++++++++
src/backend/replication/logical/snapbuild.c | 9 +++++++++
3 files changed, 26 insertions(+)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index ff4cd8cef9..a098060d76 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -117,6 +117,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -1313,6 +1314,14 @@ RemoveOldLogicalRewriteMappings(void)
lo;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(mapping_de->d_name, ".") == 0 ||
strcmp(mapping_de->d_name, "..") == 0)
continue;
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index c4d0a22451..394b7047af 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -231,6 +231,14 @@ DoCustodianTasks(void)
{
CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+ /*
+ * Custodian tasks are not essential enough to delay shutdown, so bail
+ * out if there's a pending shutdown request. Tasks should be
+ * requested again and retried the next time the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
PG_TRY();
{
(*func) ();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e7c4f69b42..939ad4c4ab 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,6 +126,7 @@
#include "common/file_utils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -2072,6 +2073,14 @@ RemoveOldSerializedSnapshots(void)
XLogRecPtr lsn;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(snap_de->d_name, ".") == 0 ||
strcmp(snap_de->d_name, "..") == 0)
continue;
--
2.25.1
On Wed, 30 Nov 2022 at 03:56, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Nov 29, 2022 at 12:02:44PM +0000, Simon Riggs wrote:
The last important point for me is tests, in src/test/modules
probably. It might be possible to reuse the final state of other
modules' tests to test cleanup, or at least integrate a custodian test
into each module.Of course. I found some existing tests for the test_decoding plugin that
appear to reliably generate the files we want the custodian to clean up, so
I added them there.
Thanks for adding the tests; I can see they run clean.
The only minor thing I would personally add is a note in each piece of
code to explain where the tests are for each one and/or something in
the main custodian file that says tests exist within src/test/module.
Otherwise, ready for committer.
--
Simon Riggs http://www.EnterpriseDB.com/
On Wed, Nov 30, 2022 at 10:48 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
cfbot is not happy with v16. AFAICT this is just due to poor placement, so
here is another attempt with the tests moved to a new location. Apologies
for the noise.
Thanks for the patches. I spent some time on reviewing v17 patch set
and here are my comments:
0001:
1. I think the custodian process needs documentation - it needs a
definition in glossary.sgml and perhaps a dedicated page describing
what tasks it takes care of.
2.
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ AbortBufferIO();
+ UnlockBuffers();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Buffers(false);
+ AtEOXact_SMgr();
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
Do we need all of these in the exit path? Isn't the stuff that
ShutdownAuxiliaryProcess() does enough for the custodian process?
AFAICS, the custodian process uses LWLocks (which the
ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared
buffers and so on.
Having said that, I'm fine to keep them for future use and all of
those cleanup functions exit if nothing related occurs.
3.
+ * Advertise out latch that backends can use to wake us up while we're
Typo - %s/out/our
4. Is it a good idea to add log messages in the DoCustodianTasks()
loop? Maybe at a debug level? The log message can say the current task
the custodian is processing. And/Or setting the custodian's status on
the ps display is also a good idea IMO.
0002 and 0003:
1.
+CHECKPOINT;
+DO $$
I think we need to ensure that there are some snapshot files before
the checkpoint. Otherwise, it may happen that the above test case
exits without the custodian process doing anything.
2. I think the best way to test the custodian process code is by
adding a TAP test module to see actually the custodian process kicks
in. Perhaps, add elog(DEBUGX,...) messages to various custodian
process functions and see if we see the logs in server logs.
0004:
I think the 0004 patch can be merged into 0001, 0002 and 0003 patches.
Otherwise the patch LGTM.
Few thoughts:
1. I think we can trivially extend the custodian process to remove any
future WAL files on the old timeline, something like the attached
0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file).
While this offloads the recovery a bit, the server may archive such
WAL files before the custodian removes them. We can do a bit more to
stop the server from archiving such WAL files, but that needs more
coding. I don't think we need to do all that now, perhaps, we can give
it a try once the basic custodian stuff gets in.
2. Moving RemovePgTempFiles() to the custodian can bring up the server
soon. The idea is that the postmaster just renames the temp
directories and informs the custodian so that it can go delete such
temp files and directories. I have personally seen cases where the
server spent a good amount of time cleaning up temp files. We can park
it for later.
3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster.
4. PreallocXlogFiles() - if we ever have plans to make pre-allocation
more aggressive (pre-allocate more than 1 WAL file), perhaps letting
custodian do that is a good idea. Again, too many tasks for a single
process.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Nov 30, 2022 at 4:52 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, Nov 30, 2022 at 10:48 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:cfbot is not happy with v16. AFAICT this is just due to poor placement, so
here is another attempt with the tests moved to a new location. Apologies
for the noise.Thanks for the patches. I spent some time on reviewing v17 patch set
and here are my comments:0001:
1. I think the custodian process needs documentation - it needs a
definition in glossary.sgml and perhaps a dedicated page describing
what tasks it takes care of.2. + LWLockReleaseAll(); + ConditionVariableCancelSleep(); + AbortBufferIO(); + UnlockBuffers(); + ReleaseAuxProcessResources(false); + AtEOXact_Buffers(false); + AtEOXact_SMgr(); + AtEOXact_Files(false); + AtEOXact_HashTables(false); Do we need all of these in the exit path? Isn't the stuff that ShutdownAuxiliaryProcess() does enough for the custodian process? AFAICS, the custodian process uses LWLocks (which the ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared buffers and so on. Having said that, I'm fine to keep them for future use and all of those cleanup functions exit if nothing related occurs.3.
+ * Advertise out latch that backends can use to wake us up while we're
Typo - %s/out/our4. Is it a good idea to add log messages in the DoCustodianTasks()
loop? Maybe at a debug level? The log message can say the current task
the custodian is processing. And/Or setting the custodian's status on
the ps display is also a good idea IMO.0002 and 0003:
1.
+CHECKPOINT;
+DO $$
I think we need to ensure that there are some snapshot files before
the checkpoint. Otherwise, it may happen that the above test case
exits without the custodian process doing anything.2. I think the best way to test the custodian process code is by
adding a TAP test module to see actually the custodian process kicks
in. Perhaps, add elog(DEBUGX,...) messages to various custodian
process functions and see if we see the logs in server logs.0004:
I think the 0004 patch can be merged into 0001, 0002 and 0003 patches.
Otherwise the patch LGTM.Few thoughts:
1. I think we can trivially extend the custodian process to remove any
future WAL files on the old timeline, something like the attached
0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file).
While this offloads the recovery a bit, the server may archive such
WAL files before the custodian removes them. We can do a bit more to
stop the server from archiving such WAL files, but that needs more
coding. I don't think we need to do all that now, perhaps, we can give
it a try once the basic custodian stuff gets in.
2. Moving RemovePgTempFiles() to the custodian can bring up the server
soon. The idea is that the postmaster just renames the temp
directories and informs the custodian so that it can go delete such
temp files and directories. I have personally seen cases where the
server spent a good amount of time cleaning up temp files. We can park
it for later.
3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster.
4. PreallocXlogFiles() - if we ever have plans to make pre-allocation
more aggressive (pre-allocate more than 1 WAL file), perhaps letting
custodian do that is a good idea. Again, too many tasks for a single
process.
Another comment:
IIUC, there's no custodian_delay GUC as we want to avoid unnecessary
wakeups for power savings (being discussed in the other thread).
However, can it happen that the custodian missed to capture SetLatch
wakeups by other backends? In other words, can the custodian process
be sleeping when there's work to do?
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Nov 30, 2022 at 05:27:10PM +0530, Bharath Rupireddy wrote:
On Wed, Nov 30, 2022 at 4:52 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:Thanks for the patches. I spent some time on reviewing v17 patch set
and here are my comments:
Thanks for reviewing!
0001:
1. I think the custodian process needs documentation - it needs a
definition in glossary.sgml and perhaps a dedicated page describing
what tasks it takes care of.
Good catch. I added this in v18. I stopped short of adding a dedicated
page to describe the tasks because 1) there are no parameters for the
custodian and 2) AFAICT none of its tasks are described in the docs today.
2. + LWLockReleaseAll(); + ConditionVariableCancelSleep(); + AbortBufferIO(); + UnlockBuffers(); + ReleaseAuxProcessResources(false); + AtEOXact_Buffers(false); + AtEOXact_SMgr(); + AtEOXact_Files(false); + AtEOXact_HashTables(false); Do we need all of these in the exit path? Isn't the stuff that ShutdownAuxiliaryProcess() does enough for the custodian process? AFAICS, the custodian process uses LWLocks (which the ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared buffers and so on. Having said that, I'm fine to keep them for future use and all of those cleanup functions exit if nothing related occurs.
Yeah, I don't think we need a few of these. In v18, I've kept the
following:
* LWLockReleaseAll()
* ConditionVariableCancelSleep()
* ReleaseAuxProcessResources(false)
* AtEOXact_Files(false)
3.
+ * Advertise out latch that backends can use to wake us up while we're
Typo - %s/out/our
fixed
4. Is it a good idea to add log messages in the DoCustodianTasks()
loop? Maybe at a debug level? The log message can say the current task
the custodian is processing. And/Or setting the custodian's status on
the ps display is also a good idea IMO.
I'd like to pick these up in a new thread if/when this initial patch set is
committed. The tasks already do some logging, and the checkpointer process
doesn't update the ps display for these tasks today.
0002 and 0003:
1.
+CHECKPOINT;
+DO $$
I think we need to ensure that there are some snapshot files before
the checkpoint. Otherwise, it may happen that the above test case
exits without the custodian process doing anything.2. I think the best way to test the custodian process code is by
adding a TAP test module to see actually the custodian process kicks
in. Perhaps, add elog(DEBUGX,...) messages to various custodian
process functions and see if we see the logs in server logs.
The test appears to reliably create snapshot and mapping files, so if the
directories are empty at some point after the checkpoint at the end, we can
be reasonably certain the custodian took action. I didn't add explicit
checks that there are files in the directories before the checkpoint
because a concurrent checkpoint could make such checks unreliable.
0004:
I think the 0004 patch can be merged into 0001, 0002 and 0003 patches.
Otherwise the patch LGTM.
I'm keeping this one separate because I've received conflicting feedback
about the idea.
1. I think we can trivially extend the custodian process to remove any
future WAL files on the old timeline, something like the attached
0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file).
While this offloads the recovery a bit, the server may archive such
WAL files before the custodian removes them. We can do a bit more to
stop the server from archiving such WAL files, but that needs more
coding. I don't think we need to do all that now, perhaps, we can give
it a try once the basic custodian stuff gets in.
2. Moving RemovePgTempFiles() to the custodian can bring up the server
soon. The idea is that the postmaster just renames the temp
directories and informs the custodian so that it can go delete such
temp files and directories. I have personally seen cases where the
server spent a good amount of time cleaning up temp files. We can park
it for later.
3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster.
4. PreallocXlogFiles() - if we ever have plans to make pre-allocation
more aggressive (pre-allocate more than 1 WAL file), perhaps letting
custodian do that is a good idea. Again, too many tasks for a single
process.
I definitely want to do #2. І have some patches for that upthread, but I
removed them for now based on Simon's feedback. I intend to pick that up
in a new thread. I haven't thought too much about the others yet.
Another comment:
IIUC, there's no custodian_delay GUC as we want to avoid unnecessary
wakeups for power savings (being discussed in the other thread).
However, can it happen that the custodian missed to capture SetLatch
wakeups by other backends? In other words, can the custodian process
be sleeping when there's work to do?
I'm not aware of any way this could happen, but if there is one, I think we
should treat it as a bug instead of relying on the custodian process to
periodically wake up and check for work to do.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v18-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From b71edf8b0112e102bf580ac86ec3d1c29f3afa81 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v18 1/4] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
doc/src/sgml/glossary.sgml | 11 +
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 377 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 38 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
14 files changed, 488 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..ad3f53e2a3 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -144,6 +144,7 @@
(but not the autovacuum workers),
the <glossterm linkend="glossary-background-writer">background writer</glossterm>,
the <glossterm linkend="glossary-checkpointer">checkpointer</glossterm>,
+ the <glossterm linkend="glossary-custodian">custodian</glossterm>,
the <glossterm linkend="glossary-logger">logger</glossterm>,
the <glossterm linkend="glossary-startup-process">startup process</glossterm>,
the <glossterm linkend="glossary-wal-archiver">WAL archiver</glossterm>,
@@ -484,6 +485,16 @@
</glossdef>
</glossentry>
+ <glossentry id="glossary-custodian">
+ <glossterm>Custodian (process)</glossterm>
+ <glossdef>
+ <para>
+ An <glossterm linkend="glossary-auxiliary-proc">auxiliary process</glossterm>
+ that is responsible for executing assorted cleanup tasks.
+ </para>
+ </glossdef>
+ </glossentry>
+
<glossentry>
<glossterm>Data area</glossterm>
<glosssee otherterm="glossary-data-directory" />
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..e5af958999
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,377 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(void);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Advertise our latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks();
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If we are not in a standalone backend, the custodian will re-enqueue the
+ * currently running task if an exception is encountered.
+ */
+static void
+DoCustodianTasks(void)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (IsPostmasterEnvironment)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * In standalone backends, the task is performed immediately in the current
+ * process, and this function will not return until it completes. Otherwise,
+ * the task is added to the custodian's queue if it is not already enqueued,
+ * and this function returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (!IsPostmasterEnvironment)
+ DoCustodianTasks();
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 293a44ca29..ac72a8a07f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a8a246921f..6a74423172 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -240,6 +240,7 @@ bool send_abort_for_kill = false;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -537,6 +538,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1808,13 +1810,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2728,6 +2733,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3025,6 +3032,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3118,6 +3127,20 @@ reaper(SIGNAL_ARGS)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3532,6 +3555,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (CheckpointerPID != 0 && take_action)
sigquit_child(CheckpointerPID);
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ sigquit_child(CustodianPID);
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3685,6 +3714,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3722,6 +3754,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3815,6 +3848,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4027,6 +4061,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b204ecdbc3..cf80e65779 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b1c35653fc..6a8485e865 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -180,6 +180,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..63fd242b1e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb1046450b..f19f4c3075 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -278,6 +278,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..59a95dd7c0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,6 +323,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -429,6 +430,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -441,6 +443,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..73d0bc5f02
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index aa13e1d66e..8f0e696663 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -400,6 +400,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -417,11 +419,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..48602c8a16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_RECOVERY_WAL_STREAM,
--
2.25.1
v18-0002-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From 71a8437e9a6285a29e1f46bc0e714c0c6d7ff2c7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v18 2/4] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
contrib/test_decoding/expected/rewrite.out | 21 +++++++++++++++++++++
contrib/test_decoding/sql/rewrite.sql | 17 +++++++++++++++++
src/backend/access/transam/xlog.c | 6 ++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 2 +-
src/include/replication/snapbuild.h | 2 +-
7 files changed, 50 insertions(+), 9 deletions(-)
diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out
index b30999c436..8b97f15f6f 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -162,3 +162,24 @@ DROP TABLE IF EXISTS replication_example;
DROP FUNCTION iamalongfunction();
DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql
index 62dead3a9b..d268fa559a 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -105,3 +105,20 @@ DROP TABLE IF EXISTS replication_example;
DROP FUNCTION iamalongfunction();
DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
+
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..c153c32a77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -7001,10 +7001,12 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index e5af958999..9382d524a6 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -70,6 +71,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index beddcbcdea..e7c4f69b42 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2036,14 +2036,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 73d0bc5f02..ab6d4283b9 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 2a697e57c3..9eba403e0c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v18-0003-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From 4175a957c77513861faa3043a7cc739869d3fa22 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v18 3/4] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
contrib/test_decoding/expected/rewrite.out | 19 ++++++
contrib/test_decoding/sql/rewrite.sql | 14 ++++
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++---
src/backend/postmaster/custodian.c | 43 ++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
6 files changed, 149 insertions(+), 10 deletions(-)
diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out
index 8b97f15f6f..214a514a0a 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -183,3 +183,22 @@ SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
t
(1 row)
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql
index d268fa559a..d66f70f837 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -122,3 +122,17 @@ BEGIN
END
$$;
SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..ff4cd8cef9 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1179,7 +1181,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1207,6 +1210,9 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS, LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1239,15 +1245,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1285,3 +1283,63 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+ PGFileType de_type;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ de_type = get_dirent_type(path, mapping_de, false, DEBUG1);
+
+ if (de_type != PGFILETYPE_ERROR && de_type != PGFILETYPE_REG)
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 9382d524a6..33185e9913 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(void);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -72,6 +77,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -377,3 +383,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5cc04756a5..bc875330d7 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index ab6d4283b9..00280c203b 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -19,6 +21,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -28,5 +31,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
v18-0004-Do-not-delay-shutdown-due-to-long-running-custod.patchtext/x-diff; charset=us-asciiDownload
From 7c612cc6f3257f23fdc626a745e114ea2ef66308 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Mon, 28 Nov 2022 15:15:37 -0800
Subject: [PATCH v18 4/4] Do not delay shutdown due to long-running custodian
tasks.
These tasks are not essential enough to delay shutdown and can be
retried the next time the server is running.
---
src/backend/access/heap/rewriteheap.c | 9 +++++++++
src/backend/postmaster/custodian.c | 8 ++++++++
src/backend/replication/logical/snapbuild.c | 9 +++++++++
3 files changed, 26 insertions(+)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index ff4cd8cef9..a098060d76 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -117,6 +117,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -1313,6 +1314,14 @@ RemoveOldLogicalRewriteMappings(void)
lo;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(mapping_de->d_name, ".") == 0 ||
strcmp(mapping_de->d_name, "..") == 0)
continue;
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 33185e9913..5c24c5aefe 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -226,6 +226,14 @@ DoCustodianTasks(void)
{
CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+ /*
+ * Custodian tasks are not essential enough to delay shutdown, so bail
+ * out if there's a pending shutdown request. Tasks should be
+ * requested again and retried the next time the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
PG_TRY();
{
(*func) ();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e7c4f69b42..939ad4c4ab 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,6 +126,7 @@
#include "common/file_utils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -2072,6 +2073,14 @@ RemoveOldSerializedSnapshots(void)
XLogRecPtr lsn;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(snap_de->d_name, ".") == 0 ||
strcmp(snap_de->d_name, "..") == 0)
continue;
--
2.25.1
On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
4. Is it a good idea to add log messages in the DoCustodianTasks()
loop? Maybe at a debug level? The log message can say the current task
the custodian is processing. And/Or setting the custodian's status on
the ps display is also a good idea IMO.I'd like to pick these up in a new thread if/when this initial patch set is
committed. The tasks already do some logging, and the checkpointer process
doesn't update the ps display for these tasks today.
It'll be good to have some kind of dedicated monitoring for the
custodian process as it can do a "good" amount of work at times and
users will have a way to know what it currently is doing - it can be
logs at debug level, progress reporting via
ereport_startup_progress()-sort of mechanism, ps display,
pg_stat_custodian or a special function that tells some details, or
some other. In any case, I agree to park this for later.
0002 and 0003:
1.
+CHECKPOINT;
+DO $$
I think we need to ensure that there are some snapshot files before
the checkpoint. Otherwise, it may happen that the above test case
exits without the custodian process doing anything.2. I think the best way to test the custodian process code is by
adding a TAP test module to see actually the custodian process kicks
in. Perhaps, add elog(DEBUGX,...) messages to various custodian
process functions and see if we see the logs in server logs.The test appears to reliably create snapshot and mapping files, so if the
directories are empty at some point after the checkpoint at the end, we can
be reasonably certain the custodian took action. I didn't add explicit
checks that there are files in the directories before the checkpoint
because a concurrent checkpoint could make such checks unreliable.
I think you're right. I added sqls to see if the snapshot and mapping
files count > 0, see [1]diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out index 214a514a0a..0029e48852 100644 --- a/contrib/test_decoding/expected/rewrite.out +++ b/contrib/test_decoding/expected/rewrite.out @@ -163,6 +163,20 @@ DROP FUNCTION iamalongfunction(); DROP FUNCTION exec(text); DROP ROLE regress_justforcomments; -- make sure custodian cleans up files +-- make sure snapshot files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalsnapdir(); + ?column? +---------- + t +(1 row) + +-- make sure rewrite mapping files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalmapdir(); + ?column? +---------- + t +(1 row) + CHECKPOINT; DO $$ DECLARE diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql index d66f70f837..c076809f37 100644 --- a/contrib/test_decoding/sql/rewrite.sql +++ b/contrib/test_decoding/sql/rewrite.sql @@ -107,6 +107,13 @@ DROP FUNCTION exec(text); DROP ROLE regress_justforcomments; and the cirrus-ci members are happy too -
https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
think we can consider adding these count > 0 checks to tests.
0004:
I think the 0004 patch can be merged into 0001, 0002 and 0003 patches.
Otherwise the patch LGTM.I'm keeping this one separate because I've received conflicting feedback
about the idea.
If we classify custodian as a process doing non-critical tasks that
have nothing to do with regular server functioning, then processing
ShutdownRequestPending looks okay. However, delaying these
non-critical tasks such as file removals which reclaims disk space
might impact the server overall especially when it's reaching 100%
disk usage and we want the custodian to do its job fully before we
shutdown the server.
If we delay processing shutdown requests, that can impact the server
overall (might delay restarts, failovers etc.), because at times there
can be a lot of tasks with a good amount of work pending in the
custodian's task queue.
Having said above, I'm okay to process ShutdownRequestPending as early
as possible, however, should we also add CHECK_FOR_INTERRUPTS()
alongside ShutdownRequestPending?
Also, I think it's enough to just have ShutdownRequestPending check in
DoCustodianTasks(void)'s main loop and we can let
RemoveOldSerializedSnapshots() and RemoveOldLogicalRewriteMappings()
do their jobs to the fullest as they do today.
While thinking about this, one thing that really struck me is what
happens if we let the custodian exit, say after processing
ShutdownRequestPending immediately or after a restart, leaving other
queued tasks? The custodian will never get to work on those tasks
unless the requestors (say checkpoint or some other process) requests
it to do so after restart. Maybe, we don't need to worry about it.
Maybe we need to worry about it. Maybe it's an overkill to save the
custodian's task state to disk so that it can come up and do the
leftover tasks upon restart.
Another comment:
IIUC, there's no custodian_delay GUC as we want to avoid unnecessary
wakeups for power savings (being discussed in the other thread).
However, can it happen that the custodian missed to capture SetLatch
wakeups by other backends? In other words, can the custodian process
be sleeping when there's work to do?I'm not aware of any way this could happen, but if there is one, I think we
should treat it as a bug instead of relying on the custodian process to
periodically wake up and check for work to do.
One possible scenario is that the requestor adds its task details to
the queue and sets the latch, the custodian can miss this SetLatch()
when it's in the midst of processing a task. However, it guarantees
the requester that it'll process the added task after it completes the
current task. And, I don't know the other reasons when the custodian
can miss SetLatch().
[1]
diff --git a/contrib/test_decoding/expected/rewrite.out
b/contrib/test_decoding/expected/rewrite.out
index 214a514a0a..0029e48852 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -163,6 +163,20 @@ DROP FUNCTION iamalongfunction();
DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
-- make sure custodian cleans up files
+-- make sure snapshot files exist for custodian to clean up
+SELECT count(*) > 0 FROM pg_ls_logicalsnapdir();
+ ?column?
+----------
+ t
+(1 row)
+
+-- make sure rewrite mapping files exist for custodian to clean up
+SELECT count(*) > 0 FROM pg_ls_logicalmapdir();
+ ?column?
+----------
+ t
+(1 row)
+
CHECKPOINT;
DO $$
DECLARE
diff --git a/contrib/test_decoding/sql/rewrite.sql
b/contrib/test_decoding/sql/rewrite.sql
index d66f70f837..c076809f37 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -107,6 +107,13 @@ DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
-- make sure custodian cleans up files
+
+-- make sure snapshot files exist for custodian to clean up
+SELECT count(*) > 0 FROM pg_ls_logicalsnapdir();
+
+-- make sure rewrite mapping files exist for custodian to clean up
+SELECT count(*) > 0 FROM pg_ls_logicalmapdir();
+
CHECKPOINT;
DO $$
DECLARE
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
The test appears to reliably create snapshot and mapping files, so if the
directories are empty at some point after the checkpoint at the end, we can
be reasonably certain the custodian took action. I didn't add explicit
checks that there are files in the directories before the checkpoint
because a concurrent checkpoint could make such checks unreliable.I think you're right. I added sqls to see if the snapshot and mapping
files count > 0, see [1] and the cirrus-ci members are happy too -
https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
think we can consider adding these count > 0 checks to tests.
My worry about adding "count > 0" checks is that a concurrent checkpoint
could make them unreliable. In other words, those checks might ordinarily
work, but if an automatic checkpoint causes the files be cleaned up just
beforehand, they will fail.
Having said above, I'm okay to process ShutdownRequestPending as early
as possible, however, should we also add CHECK_FOR_INTERRUPTS()
alongside ShutdownRequestPending?
I'm not seeing a need for CHECK_FOR_INTERRUPTS. Do you see one?
While thinking about this, one thing that really struck me is what
happens if we let the custodian exit, say after processing
ShutdownRequestPending immediately or after a restart, leaving other
queued tasks? The custodian will never get to work on those tasks
unless the requestors (say checkpoint or some other process) requests
it to do so after restart. Maybe, we don't need to worry about it.
Maybe we need to worry about it. Maybe it's an overkill to save the
custodian's task state to disk so that it can come up and do the
leftover tasks upon restart.
Yes, tasks will need to be retried when the server starts again. The ones
in this patch set should be requested again during the next checkpoint.
Temporary file cleanup would always be requested during server start, so
that should be handled as well. Even today, the server might abruptly shut
down while executing these tasks, and we don't have any special handling
for that.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Sat, Dec 3, 2022 at 12:45 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
The test appears to reliably create snapshot and mapping files, so if the
directories are empty at some point after the checkpoint at the end, we can
be reasonably certain the custodian took action. I didn't add explicit
checks that there are files in the directories before the checkpoint
because a concurrent checkpoint could make such checks unreliable.I think you're right. I added sqls to see if the snapshot and mapping
files count > 0, see [1] and the cirrus-ci members are happy too -
https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
think we can consider adding these count > 0 checks to tests.My worry about adding "count > 0" checks is that a concurrent checkpoint
could make them unreliable. In other words, those checks might ordinarily
work, but if an automatic checkpoint causes the files be cleaned up just
beforehand, they will fail.
Hm. It would have been better with a TAP test module for testing the
custodian code reliably. Anyway, that mustn't stop the patch getting
in. If required, we can park the TAP test module for later - IMO.
Others may have different thoughts here.
Having said above, I'm okay to process ShutdownRequestPending as early
as possible, however, should we also add CHECK_FOR_INTERRUPTS()
alongside ShutdownRequestPending?I'm not seeing a need for CHECK_FOR_INTERRUPTS. Do you see one?
Since the custodian has SignalHandlerForShutdownRequest as SIGINT and
SIGTERM handlers, unlike StatementCancelHandler and die respectively,
no need of CFI I guess. And also none of the CFI signal handler flags
applies to the custodian.
While thinking about this, one thing that really struck me is what
happens if we let the custodian exit, say after processing
ShutdownRequestPending immediately or after a restart, leaving other
queued tasks? The custodian will never get to work on those tasks
unless the requestors (say checkpoint or some other process) requests
it to do so after restart. Maybe, we don't need to worry about it.
Maybe we need to worry about it. Maybe it's an overkill to save the
custodian's task state to disk so that it can come up and do the
leftover tasks upon restart.Yes, tasks will need to be retried when the server starts again. The ones
in this patch set should be requested again during the next checkpoint.
Temporary file cleanup would always be requested during server start, so
that should be handled as well. Even today, the server might abruptly shut
down while executing these tasks, and we don't have any special handling
for that.
Right.
The v18 patch set posted upthread
/messages/by-id/20221201214026.GA1799688@nathanxps13
looks good to me. I see the CF entry is marked RfC -
https://commitfest.postgresql.org/41/3448/.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
rebased for cfbot
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v19-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From abbd26a3bcfcc828e196187e9f6abf6af64f3393 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v19 1/4] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
doc/src/sgml/glossary.sgml | 11 +
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 377 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 38 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
14 files changed, 488 insertions(+), 5 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..ad3f53e2a3 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -144,6 +144,7 @@
(but not the autovacuum workers),
the <glossterm linkend="glossary-background-writer">background writer</glossterm>,
the <glossterm linkend="glossary-checkpointer">checkpointer</glossterm>,
+ the <glossterm linkend="glossary-custodian">custodian</glossterm>,
the <glossterm linkend="glossary-logger">logger</glossterm>,
the <glossterm linkend="glossary-startup-process">startup process</glossterm>,
the <glossterm linkend="glossary-wal-archiver">WAL archiver</glossterm>,
@@ -484,6 +485,16 @@
</glossdef>
</glossentry>
+ <glossentry id="glossary-custodian">
+ <glossterm>Custodian (process)</glossterm>
+ <glossdef>
+ <para>
+ An <glossterm linkend="glossary-auxiliary-proc">auxiliary process</glossterm>
+ that is responsible for executing assorted cleanup tasks.
+ </para>
+ </glossdef>
+ </glossentry>
+
<glossentry>
<glossterm>Data area</glossterm>
<glosssee otherterm="glossary-data-directory" />
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..a1f042f13a 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..98bb9efcfd
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,377 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(void);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Advertise our latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks();
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If we are not in a standalone backend, the custodian will re-enqueue the
+ * currently running task if an exception is encountered.
+ */
+static void
+DoCustodianTasks(void)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (IsPostmasterEnvironment)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * In standalone backends, the task is performed immediately in the current
+ * process, and this function will not return until it completes. Otherwise,
+ * the task is added to the custodian's queue if it is not already enqueued,
+ * and this function returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (!IsPostmasterEnvironment)
+ DoCustodianTasks();
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 9079922de7..63f2abe3a1 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2552327d90..e3aef4081e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -249,6 +249,7 @@ bool send_abort_for_kill = false;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -560,6 +561,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1795,13 +1797,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2732,6 +2737,8 @@ process_pm_reload_request(void)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3085,6 +3092,8 @@ process_pm_child_exit(void)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3178,6 +3187,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3590,6 +3613,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (CheckpointerPID != 0 && take_action)
sigquit_child(CheckpointerPID);
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ sigquit_child(CustodianPID);
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3744,6 +3773,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3781,6 +3813,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3877,6 +3910,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4092,6 +4126,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..4268a941ad 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..40a83636fe 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -178,6 +178,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..9348b441ba 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 59532bbd80..25c2ba97b8 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -283,6 +283,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 96b3a1e1a0..738e7f8fb0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -324,6 +324,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -430,6 +431,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -442,6 +444,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..73d0bc5f02
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4258cd92c9..25e00a14ff 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -400,6 +400,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -417,11 +419,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..ea8ba623e9 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_LOGICAL_PARALLEL_APPLY_MAIN,
--
2.25.1
v19-0002-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From c9e0527d39f697408e01bc0ed1730e5f7eeeffd5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v19 2/4] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
contrib/test_decoding/expected/rewrite.out | 21 +++++++++++++++++++++
contrib/test_decoding/sql/rewrite.sql | 17 +++++++++++++++++
src/backend/access/transam/xlog.c | 6 ++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 2 +-
src/include/replication/snapbuild.h | 2 +-
7 files changed, 50 insertions(+), 9 deletions(-)
diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out
index b30999c436..8b97f15f6f 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -162,3 +162,24 @@ DROP TABLE IF EXISTS replication_example;
DROP FUNCTION iamalongfunction();
DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql
index 62dead3a9b..d268fa559a 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -105,3 +105,20 @@ DROP TABLE IF EXISTS replication_example;
DROP FUNCTION iamalongfunction();
DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
+
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..382e59f723 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -6997,10 +6997,12 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 98bb9efcfd..4e0ce1f7b3 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -70,6 +71,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 829c568112..6b403a2bb4 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2036,14 +2036,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 73d0bc5f02..ab6d4283b9 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index f49b941b53..5f1ba3842c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v19-0003-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From 493d858dc4acfc80d201df6d9f7bb45d83881e10 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v19 3/4] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
contrib/test_decoding/expected/rewrite.out | 19 ++++++
contrib/test_decoding/sql/rewrite.sql | 14 ++++
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++---
src/backend/postmaster/custodian.c | 43 ++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
6 files changed, 149 insertions(+), 10 deletions(-)
diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out
index 8b97f15f6f..214a514a0a 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -183,3 +183,22 @@ SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
t
(1 row)
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql
index d268fa559a..d66f70f837 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -122,3 +122,17 @@ BEGIN
END
$$;
SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 8993c1ed5a..9ea0f81ac3 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1179,7 +1181,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1207,6 +1210,9 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS, LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1239,15 +1245,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1285,3 +1283,63 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+ PGFileType de_type;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ de_type = get_dirent_type(path, mapping_de, false, DEBUG1);
+
+ if (de_type != PGFILETYPE_ERROR && de_type != PGFILETYPE_REG)
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 4e0ce1f7b3..4cbd89fae9 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(void);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -72,6 +77,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -377,3 +383,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 1125457053..dc3eb3e308 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index ab6d4283b9..00280c203b 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -19,6 +21,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -28,5 +31,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
v19-0004-Do-not-delay-shutdown-due-to-long-running-custod.patchtext/x-diff; charset=us-asciiDownload
From 45e69ea5ab047b5d296de0f8869d359a17fdcf99 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Mon, 28 Nov 2022 15:15:37 -0800
Subject: [PATCH v19 4/4] Do not delay shutdown due to long-running custodian
tasks.
These tasks are not essential enough to delay shutdown and can be
retried the next time the server is running.
---
src/backend/access/heap/rewriteheap.c | 9 +++++++++
src/backend/postmaster/custodian.c | 8 ++++++++
src/backend/replication/logical/snapbuild.c | 9 +++++++++
3 files changed, 26 insertions(+)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 9ea0f81ac3..3ee635fe77 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -117,6 +117,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -1313,6 +1314,14 @@ RemoveOldLogicalRewriteMappings(void)
lo;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(mapping_de->d_name, ".") == 0 ||
strcmp(mapping_de->d_name, "..") == 0)
continue;
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 4cbd89fae9..274b2d4a79 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -226,6 +226,14 @@ DoCustodianTasks(void)
{
CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+ /*
+ * Custodian tasks are not essential enough to delay shutdown, so bail
+ * out if there's a pending shutdown request. Tasks should be
+ * requested again and retried the next time the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
PG_TRY();
{
(*func) ();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6b403a2bb4..0890825fb9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,6 +126,7 @@
#include "common/file_utils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -2072,6 +2073,14 @@ RemoveOldSerializedSnapshots(void)
XLogRecPtr lsn;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(snap_de->d_name, ".") == 0 ||
strcmp(snap_de->d_name, "..") == 0)
continue;
--
2.25.1
On Thu, Feb 02, 2023 at 09:48:08PM -0800, Nathan Bossart wrote:
rebased for cfbot
another rebase for cfbot
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v20-0001-Introduce-custodian.patchtext/x-diff; charset=us-asciiDownload
From 1c9b95cae7adcc57b7544a44ff16a26e71c6c736 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Wed, 5 Jan 2022 19:24:22 +0000
Subject: [PATCH v20 1/4] Introduce custodian.
The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing. This commit simply adds the new process; it does
not yet do anything useful.
---
doc/src/sgml/glossary.sgml | 11 +
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/custodian.c | 377 ++++++++++++++++++++++++
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 38 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/init/miscinit.c | 3 +
src/include/miscadmin.h | 3 +
src/include/postmaster/custodian.h | 32 ++
src/include/storage/proc.h | 11 +-
src/include/utils/wait_event.h | 1 +
15 files changed, 491 insertions(+), 6 deletions(-)
create mode 100644 src/backend/postmaster/custodian.c
create mode 100644 src/include/postmaster/custodian.h
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..ad3f53e2a3 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -144,6 +144,7 @@
(but not the autovacuum workers),
the <glossterm linkend="glossary-background-writer">background writer</glossterm>,
the <glossterm linkend="glossary-checkpointer">checkpointer</glossterm>,
+ the <glossterm linkend="glossary-custodian">custodian</glossterm>,
the <glossterm linkend="glossary-logger">logger</glossterm>,
the <glossterm linkend="glossary-startup-process">startup process</glossterm>,
the <glossterm linkend="glossary-wal-archiver">WAL archiver</glossterm>,
@@ -484,6 +485,16 @@
</glossdef>
</glossentry>
+ <glossentry id="glossary-custodian">
+ <glossterm>Custodian (process)</glossterm>
+ <glossdef>
+ <para>
+ An <glossterm linkend="glossary-auxiliary-proc">auxiliary process</glossterm>
+ that is responsible for executing assorted cleanup tasks.
+ </para>
+ </glossdef>
+ </glossentry>
+
<glossentry>
<glossterm>Data area</glossterm>
<glosssee otherterm="glossary-data-directory" />
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..5f4dde85cf 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
bgworker.o \
bgwriter.o \
checkpointer.o \
+ custodian.o \
fork_process.o \
interrupt.o \
pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..a1f042f13a 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case CheckpointerProcess:
MyBackendType = B_CHECKPOINTER;
break;
+ case CustodianProcess:
+ MyBackendType = B_CUSTODIAN;
+ break;
case WalWriterProcess:
MyBackendType = B_WAL_WRITER;
break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
CheckpointerMain();
proc_exit(1);
+ case CustodianProcess:
+ CustodianMain();
+ proc_exit(1);
+
case WalWriterProcess:
WalWriterMain();
proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 0000000000..98bb9efcfd
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,377 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc. Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding). However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode). The custodian is not
+ * an essential process and can shutdown quickly when requested. The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/custodian.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(void);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+ slock_t cust_lck;
+
+ CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+ int task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+ CustodianTask task;
+ CustodianTaskFunction task_func; /* performs task */
+ CustodianTaskHandleArg handle_arg_func; /* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is requested via RequestCustodian(). handle_arg_func is an
+ * optional function for providing extra information for the next invocation of
+ * the task. Typically, the extra information should be stored in shared
+ * memory for access from the custodian process. handle_arg_func is invoked
+ * before enqueueing the task, and it will still be invoked regardless of
+ * whether the task is already enqueued.
+ */
+static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
+};
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext custodian_context;
+
+ /*
+ * Properly accept or ignore signals that might be sent to us.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks.
+ */
+ custodian_context = AllocSetContextCreate(TopMemoryContext,
+ "Custodian",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(custodian_context);
+
+ /*
+ * If an exception is encountered, processing resumes here. As with other
+ * auxiliary processes, we cannot use PG_TRY because this is the bottom of
+ * the exception stack.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about.
+ */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(custodian_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(custodian_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+
+ /*
+ * Close all open files after any error. This is helpful on Windows,
+ * where holding deleted files open causes various strange errors.
+ * It's not clear we need it elsewhere, but shouldn't hurt.
+ */
+ smgrcloseall();
+
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Advertise our latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->custodianLatch = &MyProc->procLatch;
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleMainLoopInterrupts();
+
+ DoCustodianTasks();
+
+ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+ WAIT_EVENT_CUSTODIAN_MAIN);
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * DoCustodianTasks
+ * Perform requested custodian tasks
+ *
+ * If we are not in a standalone backend, the custodian will re-enqueue the
+ * currently running task if an exception is encountered.
+ */
+static void
+DoCustodianTasks(void)
+{
+ CustodianTask task;
+
+ while ((task = CustodianGetNextTask()) != INVALID_CUSTODIAN_TASK)
+ {
+ CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+
+ PG_TRY();
+ {
+ (*func) ();
+ }
+ PG_CATCH();
+ {
+ if (IsPostmasterEnvironment)
+ CustodianEnqueueTask(task);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+}
+
+Size
+CustodianShmemSize(void)
+{
+ return sizeof(CustodianShmemStruct);
+}
+
+void
+CustodianShmemInit(void)
+{
+ Size size = CustodianShmemSize();
+ bool found;
+
+ CustodianShmem = (CustodianShmemStruct *)
+ ShmemInitStruct("Custodian Data", size, &found);
+
+ if (!found)
+ {
+ memset(CustodianShmem, 0, size);
+ SpinLockInit(&CustodianShmem->cust_lck);
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ CustodianShmem->task_queue_elems[i] = INVALID_CUSTODIAN_TASK;
+ }
+}
+
+/*
+ * RequestCustodian
+ * Called to request a custodian task.
+ *
+ * In standalone backends, the task is performed immediately in the current
+ * process, and this function will not return until it completes. Otherwise,
+ * the task is added to the custodian's queue if it is not already enqueued,
+ * and this function returns without waiting for the task to complete.
+ *
+ * arg can be used to provide additional information to the custodian that is
+ * necessary for the task. Typically, the handling function should store this
+ * information in shared memory for later use by the custodian. Note that the
+ * task's handling function for arg is invoked before enqueueing the task, and
+ * it will still be invoked regardless of whether the task is already enqueued.
+ */
+void
+RequestCustodian(CustodianTask requested, Datum arg)
+{
+ CustodianTaskHandleArg arg_func = (LookupCustodianFunctions(requested))->handle_arg_func;
+
+ /* First process any extra information provided in the request. */
+ if (arg_func)
+ (*arg_func) (arg);
+
+ CustodianEnqueueTask(requested);
+
+ if (!IsPostmasterEnvironment)
+ DoCustodianTasks();
+ else if (ProcGlobal->custodianLatch)
+ SetLatch(ProcGlobal->custodianLatch);
+}
+
+/*
+ * CustodianEnqueueTask
+ * Add a task to the custodian's queue
+ *
+ * If the task is already in the queue, this function has no effect.
+ */
+static void
+CustodianEnqueueTask(CustodianTask task)
+{
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ for (int i = 0; i < NUM_CUSTODIAN_TASKS; i++)
+ {
+ int idx = (CustodianShmem->task_queue_head + i) % NUM_CUSTODIAN_TASKS;
+ CustodianTask *elem = &CustodianShmem->task_queue_elems[idx];
+
+ /*
+ * If the task is already queued in this slot or the slot is empty,
+ * enqueue the task here and return.
+ */
+ if (*elem == INVALID_CUSTODIAN_TASK || *elem == task)
+ {
+ *elem = task;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+ return;
+ }
+ }
+
+ /* We should never run out of space in the queue. */
+ elog(ERROR, "could not enqueue custodian task %d", task);
+ pg_unreachable();
+}
+
+/*
+ * CustodianGetNextTask
+ * Retrieve the next task that the custodian should execute
+ *
+ * The returned task is dequeued from the custodian's queue. If no tasks are
+ * queued, INVALID_CUSTODIAN_TASK is returned.
+ */
+static CustodianTask
+CustodianGetNextTask(void)
+{
+ CustodianTask next_task;
+ CustodianTask *elem;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+
+ elem = &CustodianShmem->task_queue_elems[CustodianShmem->task_queue_head];
+
+ next_task = *elem;
+ *elem = INVALID_CUSTODIAN_TASK;
+
+ CustodianShmem->task_queue_head++;
+ CustodianShmem->task_queue_head %= NUM_CUSTODIAN_TASKS;
+
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return next_task;
+}
+
+/*
+ * LookupCustodianFunctions
+ * Given a custodian task, look up its function pointers.
+ */
+static const struct cust_task_funcs_entry *
+LookupCustodianFunctions(CustodianTask task)
+{
+ const struct cust_task_funcs_entry *entry;
+
+ Assert(task >= 0 && task < NUM_CUSTODIAN_TASKS);
+
+ for (entry = cust_task_functions;
+ entry && entry->task != INVALID_CUSTODIAN_TASK;
+ entry++)
+ {
+ if (entry->task == task)
+ return entry;
+ }
+
+ /* All tasks must have an entry. */
+ elog(ERROR, "could not lookup functions for custodian task %d", task);
+ pg_unreachable();
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..faaaba6a21 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
'bgworker.c',
'bgwriter.c',
'checkpointer.c',
+ 'custodian.c',
'fork_process.c',
'interrupt.c',
'pgarch.c',
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2552327d90..e3aef4081e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -249,6 +249,7 @@ bool send_abort_for_kill = false;
static pid_t StartupPID = 0,
BgWriterPID = 0,
CheckpointerPID = 0,
+ CustodianPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
AutoVacPID = 0,
@@ -560,6 +561,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartArchiver() StartChildProcess(ArchiverProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartCustodian() StartChildProcess(CustodianProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1795,13 +1797,16 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and custodian.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
{
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
}
@@ -2732,6 +2737,8 @@ process_pm_reload_request(void)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -3085,6 +3092,8 @@ process_pm_child_exit(void)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (CustodianPID == 0)
+ CustodianPID = StartCustodian();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -3178,6 +3187,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the custodian? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == CustodianPID)
+ {
+ CustodianPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("custodian process"));
+ continue;
+ }
+
/*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
@@ -3590,6 +3613,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (CheckpointerPID != 0 && take_action)
sigquit_child(CheckpointerPID);
+ /* Take care of the custodian too */
+ if (pid == CustodianPID)
+ CustodianPID = 0;
+ else if (CustodianPID != 0 && take_action)
+ sigquit_child(CustodianPID);
+
/* Take care of the walwriter too */
if (pid == WalWriterPID)
WalWriterPID = 0;
@@ -3744,6 +3773,9 @@ PostmasterStateMachine(void)
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ /* and the custodian too */
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
@@ -3781,6 +3813,7 @@ PostmasterStateMachine(void)
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
+ CustodianPID == 0 &&
WalWriterPID == 0 &&
AutoVacPID == 0)
{
@@ -3877,6 +3910,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(CustodianPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -4092,6 +4126,8 @@ TerminateChildren(int signal)
signal_child(BgWriterPID, signal);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, signal);
+ if (CustodianPID != 0)
+ signal_child(CustodianPID, signal);
if (WalWriterPID != 0)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..4268a941ad 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
@@ -130,6 +131,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
+ size = add_size(size, CustodianShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
@@ -278,6 +280,7 @@ CreateSharedMemoryAndSemaphores(void)
PMSignalShmemInit();
ProcSignalShmemInit();
CheckpointerShmemInit();
+ CustodianShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
ReplicationOriginShmemInit();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..40a83636fe 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -178,6 +178,7 @@ InitProcGlobal(void)
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->walwriterLatch = NULL;
ProcGlobal->checkpointerLatch = NULL;
+ ProcGlobal->custodianLatch = NULL;
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 0e07e0848d..f05590f2c4 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -224,7 +224,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and Custodian IO is not tracked in pg_stat_io for
+* now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -243,6 +244,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
{
case B_INVALID:
case B_ARCHIVER:
+ case B_CUSTODIAN:
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index cb99cc6339..6ca751dd1f 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -224,6 +224,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_CHECKPOINTER_MAIN:
event_name = "CheckpointerMain";
break;
+ case WAIT_EVENT_CUSTODIAN_MAIN:
+ event_name = "CustodianMain";
+ break;
case WAIT_EVENT_LOGICAL_APPLY_MAIN:
event_name = "LogicalApplyMain";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 59532bbd80..25c2ba97b8 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -283,6 +283,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_CUSTODIAN:
+ backendDesc = "custodian";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c309e0233d..50f407a0bf 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -324,6 +324,7 @@ typedef enum BackendType
B_BG_WORKER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_CUSTODIAN,
B_LOGGER,
B_STANDALONE_BACKEND,
B_STARTUP,
@@ -432,6 +433,7 @@ typedef enum
BgWriterProcess,
ArchiverProcess,
CheckpointerProcess,
+ CustodianProcess,
WalWriterProcess,
WalReceiverProcess,
@@ -444,6 +446,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
+#define AmCustodianProcess() (MyAuxProcType == CustodianProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
new file mode 100644
index 0000000000..73d0bc5f02
--- /dev/null
+++ b/src/include/postmaster/custodian.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * custodian.h
+ * Exports from postmaster/custodian.c.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/custodian.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _CUSTODIAN_H
+#define _CUSTODIAN_H
+
+/*
+ * If you add a new task here, be sure to add its corresponding function
+ * pointers to cust_task_functions in custodian.c.
+ */
+typedef enum CustodianTask
+{
+ FAKE_TASK, /* placeholder until we have a real task */
+
+ NUM_CUSTODIAN_TASKS, /* new tasks go above */
+ INVALID_CUSTODIAN_TASK
+} CustodianTask;
+
+extern void CustodianMain(void) pg_attribute_noreturn();
+extern Size CustodianShmemSize(void);
+extern void CustodianShmemInit(void);
+extern void RequestCustodian(CustodianTask task, Datum arg);
+
+#endif /* _CUSTODIAN_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4258cd92c9..25e00a14ff 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -400,6 +400,8 @@ typedef struct PROC_HDR
Latch *walwriterLatch;
/* Checkpointer process's latch */
Latch *checkpointerLatch;
+ /* Custodian process's latch */
+ Latch *custodianLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
@@ -417,11 +419,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, custodian, WAL writer and archiver run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 9ab23e1c4a..a100dbca3b 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -40,6 +40,7 @@ typedef enum
WAIT_EVENT_BGWRITER_HIBERNATE,
WAIT_EVENT_BGWRITER_MAIN,
WAIT_EVENT_CHECKPOINTER_MAIN,
+ WAIT_EVENT_CUSTODIAN_MAIN,
WAIT_EVENT_LOGICAL_APPLY_MAIN,
WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
WAIT_EVENT_LOGICAL_PARALLEL_APPLY_MAIN,
--
2.25.1
v20-0002-Move-removal-of-old-serialized-snapshots-to-cust.patchtext/x-diff; charset=us-asciiDownload
From 02258dd4551c0a2ee9e7357eb9899b9f290f0b06 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 5 Dec 2021 22:02:40 -0800
Subject: [PATCH v20 2/4] Move removal of old serialized snapshots to
custodian.
This was only done during checkpoints because it was a convenient
place to put it. However, if there are many snapshots to remove,
it can significantly extend checkpoint time. To avoid this, move
this work to the newly-introduced custodian process.
---
contrib/test_decoding/expected/rewrite.out | 21 +++++++++++++++++++++
contrib/test_decoding/sql/rewrite.sql | 17 +++++++++++++++++
src/backend/access/transam/xlog.c | 6 ++++--
src/backend/postmaster/custodian.c | 2 ++
src/backend/replication/logical/snapbuild.c | 9 ++++-----
src/include/postmaster/custodian.h | 2 +-
src/include/replication/snapbuild.h | 2 +-
7 files changed, 50 insertions(+), 9 deletions(-)
diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out
index b30999c436..8b97f15f6f 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -162,3 +162,24 @@ DROP TABLE IF EXISTS replication_example;
DROP FUNCTION iamalongfunction();
DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql
index 62dead3a9b..d268fa559a 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -105,3 +105,20 @@ DROP TABLE IF EXISTS replication_example;
DROP FUNCTION iamalongfunction();
DROP FUNCTION exec(text);
DROP ROLE regress_justforcomments;
+
+-- make sure custodian cleans up files
+CHECKPOINT;
+DO $$
+DECLARE
+ snaps_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ snaps_removed := count(*) = 0 FROM pg_ls_logicalsnapdir();
+ IF snaps_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..7da9461048 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -76,12 +76,12 @@
#include "port/atomics.h"
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
#include "replication/slot.h"
-#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -6994,10 +6994,12 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
- CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ /* tasks offloaded to custodian */
+ RequestCustodian(CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, (Datum) 0);
+
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 98bb9efcfd..4e0ce1f7b3 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -25,6 +25,7 @@
#include "pgstat.h"
#include "postmaster/custodian.h"
#include "postmaster/interrupt.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
@@ -70,6 +71,7 @@ struct cust_task_funcs_entry
* whether the task is already enqueued.
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
+ {CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 62542827e4..f940bb5930 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2036,14 +2036,13 @@ SnapBuildRestoreContents(int fd, char *dest, Size size, const char *path)
/*
* Remove all serialized snapshots that are not required anymore because no
- * slot can need them. This doesn't actually have to run during a checkpoint,
- * but it's a convenient point to schedule this.
+ * slot can need them.
*
- * NB: We run this during checkpoints even if logical decoding is disabled so
- * we cleanup old slots at some point after it got disabled.
+ * NB: We run this even if logical decoding is disabled so we cleanup old slots
+ * at some point after it got disabled.
*/
void
-CheckPointSnapBuild(void)
+RemoveOldSerializedSnapshots(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index 73d0bc5f02..ab6d4283b9 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -18,7 +18,7 @@
*/
typedef enum CustodianTask
{
- FAKE_TASK, /* placeholder until we have a real task */
+ CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index f49b941b53..5f1ba3842c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -57,7 +57,7 @@ struct ReorderBuffer;
struct xl_heap_new_cid;
struct xl_running_xacts;
-extern void CheckPointSnapBuild(void);
+extern void RemoveOldSerializedSnapshots(void);
extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
TransactionId xmin_horizon, XLogRecPtr start_lsn,
--
2.25.1
v20-0003-Move-removal-of-old-logical-rewrite-mapping-file.patchtext/x-diff; charset=us-asciiDownload
From e4800e6c85c562ac3856a65e2838fbc3e5a79b63 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Sun, 12 Dec 2021 22:07:11 -0800
Subject: [PATCH v20 3/4] Move removal of old logical rewrite mapping files to
custodian.
If there are many such files to remove, checkpoints can take much
longer. To avoid this, move this work to the newly-introduced
custodian process.
Since the mapping files include 32-bit transaction IDs, there is a
risk of wraparound if the files are not cleaned up fast enough.
Removing these files in checkpoints offered decent wraparound
protection simply due to the relatively high frequency of
checkpointing. With this change, servers should still clean up
mappings files with decently high frequency, but in theory the
wraparound risk might worsen for some (e.g., if the custodian is
spending a lot of time on a different task). Given this is an
existing problem, this change makes no effort to handle the
wraparound risk, and it is left as a future exercise.
---
contrib/test_decoding/expected/rewrite.out | 19 ++++++
contrib/test_decoding/sql/rewrite.sql | 14 ++++
src/backend/access/heap/rewriteheap.c | 78 +++++++++++++++++++---
src/backend/postmaster/custodian.c | 43 ++++++++++++
src/include/access/rewriteheap.h | 1 +
src/include/postmaster/custodian.h | 4 ++
6 files changed, 149 insertions(+), 10 deletions(-)
diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out
index 8b97f15f6f..214a514a0a 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -183,3 +183,22 @@ SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
t
(1 row)
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql
index d268fa559a..d66f70f837 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -122,3 +122,17 @@ BEGIN
END
$$;
SELECT count(*) = 0 FROM pg_ls_logicalsnapdir();
+DO $$
+DECLARE
+ mappings_removed bool;
+ loops int := 0;
+BEGIN
+ LOOP
+ mappings_removed := count(*) = 0 FROM pg_ls_logicalmapdir();
+ IF mappings_removed OR loops > 120 * 100 THEN EXIT; END IF;
+ PERFORM pg_sleep(0.01);
+ loops := loops + 1;
+ END LOOP;
+END
+$$;
+SELECT count(*) = 0 FROM pg_ls_logicalmapdir();
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 8993c1ed5a..9ea0f81ac3 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/custodian.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -123,6 +124,7 @@
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
#include "utils/rel.h"
/*
@@ -1179,7 +1181,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
- * 1) Remove all mappings not needed anymore based on the logical restart LSN
+ * 1) Alert the custodian to remove all mappings not needed anymore based on the
+ * logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
@@ -1207,6 +1210,9 @@ CheckPointLogicalRewriteHeap(void)
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
+ /* let the custodian know what it can remove */
+ RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS, LSNGetDatum(cutoff));
+
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
@@ -1239,15 +1245,7 @@ CheckPointLogicalRewriteHeap(void)
lsn = ((uint64) hi) << 32 | lo;
- if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
- {
- elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
- if (unlink(path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m", path)));
- }
- else
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
{
/* on some operating systems fsyncing a file requires O_RDWR */
int fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
@@ -1285,3 +1283,63 @@ CheckPointLogicalRewriteHeap(void)
/* persist directory entries to disk */
fsync_fname("pg_logical/mappings", true);
}
+
+/*
+ * Remove all mappings not needed anymore based on the logical restart LSN saved
+ * by the checkpointer. We use this saved value instead of calling
+ * ReplicationSlotsComputeLogicalRestartLSN() so that we don't try to remove
+ * files that a concurrent call to CheckPointLogicalRewriteHeap() is trying to
+ * flush to disk.
+ */
+void
+RemoveOldLogicalRewriteMappings(void)
+{
+ XLogRecPtr cutoff;
+ DIR *mappings_dir;
+ struct dirent *mapping_de;
+ char path[MAXPGPATH + 20];
+
+ cutoff = CustodianGetLogicalRewriteCutoff();
+
+ mappings_dir = AllocateDir("pg_logical/mappings");
+ while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
+ {
+ Oid dboid;
+ Oid relid;
+ XLogRecPtr lsn;
+ TransactionId rewrite_xid;
+ TransactionId create_xid;
+ uint32 hi,
+ lo;
+ PGFileType de_type;
+
+ if (strcmp(mapping_de->d_name, ".") == 0 ||
+ strcmp(mapping_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
+ de_type = get_dirent_type(path, mapping_de, false, DEBUG1);
+
+ if (de_type != PGFILETYPE_ERROR && de_type != PGFILETYPE_REG)
+ continue;
+
+ /* Skip over files that cannot be ours. */
+ if (strncmp(mapping_de->d_name, "map-", 4) != 0)
+ continue;
+
+ if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
+ &dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
+ elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
+
+ lsn = ((uint64) hi) << 32 | lo;
+ if (lsn >= cutoff && cutoff != InvalidXLogRecPtr)
+ continue;
+
+ elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
+ FreeDir(mappings_dir);
+}
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 4e0ce1f7b3..4cbd89fae9 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include "access/rewriteheap.h"
#include "libpq/pqsignal.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
@@ -33,11 +34,13 @@
#include "storage/procsignal.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
static void DoCustodianTasks(void);
static CustodianTask CustodianGetNextTask(void);
static void CustodianEnqueueTask(CustodianTask task);
static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+static void CustodianSetLogicalRewriteCutoff(Datum arg);
typedef struct
{
@@ -45,6 +48,8 @@ typedef struct
CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
int task_queue_head;
+
+ XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */
} CustodianShmemStruct;
static CustodianShmemStruct *CustodianShmem;
@@ -72,6 +77,7 @@ struct cust_task_funcs_entry
*/
static const struct cust_task_funcs_entry cust_task_functions[] = {
{CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS, RemoveOldSerializedSnapshots, NULL},
+ {CUSTODIAN_REMOVE_REWRITE_MAPPINGS, RemoveOldLogicalRewriteMappings, CustodianSetLogicalRewriteCutoff},
{INVALID_CUSTODIAN_TASK, NULL, NULL} /* must be last */
};
@@ -377,3 +383,40 @@ LookupCustodianFunctions(CustodianTask task)
elog(ERROR, "could not lookup functions for custodian task %d", task);
pg_unreachable();
}
+
+/*
+ * Stores the provided cutoff LSN in the custodian's shared memory.
+ *
+ * It's okay if the cutoff LSN is updated before a previously set cutoff has
+ * been used for cleaning up files. If that happens, it just means that the
+ * next invocation of RemoveOldLogicalRewriteMappings() will use a more accurate
+ * cutoff.
+ */
+static void
+CustodianSetLogicalRewriteCutoff(Datum arg)
+{
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ CustodianShmem->logical_rewrite_mappings_cutoff = DatumGetLSN(arg);
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ /* if pass-by-ref, free Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(arg));
+#endif
+}
+
+/*
+ * Used by the custodian to determine which logical rewrite mapping files it can
+ * remove.
+ */
+XLogRecPtr
+CustodianGetLogicalRewriteCutoff(void)
+{
+ XLogRecPtr cutoff;
+
+ SpinLockAcquire(&CustodianShmem->cust_lck);
+ cutoff = CustodianShmem->logical_rewrite_mappings_cutoff;
+ SpinLockRelease(&CustodianShmem->cust_lck);
+
+ return cutoff;
+}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 1125457053..dc3eb3e308 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -53,5 +53,6 @@ typedef struct LogicalRewriteMappingData
*/
#define LOGICAL_REWRITE_FORMAT "map-%x-%x-%X_%X-%x-%x"
extern void CheckPointLogicalRewriteHeap(void);
+extern void RemoveOldLogicalRewriteMappings(void);
#endif /* REWRITE_HEAP_H */
diff --git a/src/include/postmaster/custodian.h b/src/include/postmaster/custodian.h
index ab6d4283b9..00280c203b 100644
--- a/src/include/postmaster/custodian.h
+++ b/src/include/postmaster/custodian.h
@@ -12,6 +12,8 @@
#ifndef _CUSTODIAN_H
#define _CUSTODIAN_H
+#include "access/xlogdefs.h"
+
/*
* If you add a new task here, be sure to add its corresponding function
* pointers to cust_task_functions in custodian.c.
@@ -19,6 +21,7 @@
typedef enum CustodianTask
{
CUSTODIAN_REMOVE_SERIALIZED_SNAPSHOTS,
+ CUSTODIAN_REMOVE_REWRITE_MAPPINGS,
NUM_CUSTODIAN_TASKS, /* new tasks go above */
INVALID_CUSTODIAN_TASK
@@ -28,5 +31,6 @@ extern void CustodianMain(void) pg_attribute_noreturn();
extern Size CustodianShmemSize(void);
extern void CustodianShmemInit(void);
extern void RequestCustodian(CustodianTask task, Datum arg);
+extern XLogRecPtr CustodianGetLogicalRewriteCutoff(void);
#endif /* _CUSTODIAN_H */
--
2.25.1
v20-0004-Do-not-delay-shutdown-due-to-long-running-custod.patchtext/x-diff; charset=us-asciiDownload
From 5dd4a58a87005d440b905c4e4b6a16fd7622f4b5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Mon, 28 Nov 2022 15:15:37 -0800
Subject: [PATCH v20 4/4] Do not delay shutdown due to long-running custodian
tasks.
These tasks are not essential enough to delay shutdown and can be
retried the next time the server is running.
---
src/backend/access/heap/rewriteheap.c | 9 +++++++++
src/backend/postmaster/custodian.c | 8 ++++++++
src/backend/replication/logical/snapbuild.c | 9 +++++++++
3 files changed, 26 insertions(+)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 9ea0f81ac3..3ee635fe77 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -117,6 +117,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/slot.h"
#include "storage/bufmgr.h"
@@ -1313,6 +1314,14 @@ RemoveOldLogicalRewriteMappings(void)
lo;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(mapping_de->d_name, ".") == 0 ||
strcmp(mapping_de->d_name, "..") == 0)
continue;
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
index 4cbd89fae9..274b2d4a79 100644
--- a/src/backend/postmaster/custodian.c
+++ b/src/backend/postmaster/custodian.c
@@ -226,6 +226,14 @@ DoCustodianTasks(void)
{
CustodianTaskFunction func = (LookupCustodianFunctions(task))->task_func;
+ /*
+ * Custodian tasks are not essential enough to delay shutdown, so bail
+ * out if there's a pending shutdown request. Tasks should be
+ * requested again and retried the next time the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
PG_TRY();
{
(*func) ();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index f940bb5930..ca2e2a3e5b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,6 +126,7 @@
#include "common/file_utils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
#include "replication/snapbuild.h"
@@ -2072,6 +2073,14 @@ RemoveOldSerializedSnapshots(void)
XLogRecPtr lsn;
PGFileType de_type;
+ /*
+ * This task is not essential enough to delay shutdown, so bail out if
+ * there's a pending shutdown request. We'll try again the next time
+ * the server is running.
+ */
+ if (ShutdownRequestPending)
+ break;
+
if (strcmp(snap_de->d_name, ".") == 0 ||
strcmp(snap_de->d_name, "..") == 0)
continue;
--
2.25.1
Nathan Bossart <nathandbossart@gmail.com> writes:
another rebase for cfbot
I took a brief look through v20, and generally liked what I saw,
but there are a few things troubling me:
* The comments for CustodianEnqueueTask claim that it won't enqueue an
already-queued task, but I don't think I believe that, because it stops
scanning as soon as it finds an empty slot. That data structure seems
quite oddly designed in any case. Why isn't it simply an array of
need-to-run-this-one booleans indexed by the CustodianTask enum?
Fairness of dispatch could be ensured by the same state variable that
CustodianGetNextTask already uses to track which array element to
inspect next. While that wouldn't guarantee that tasks A and B are
dispatched in the same order they were requested in, I'm not sure why
we should care.
* I don't much like cust_lck, mainly because you didn't bother to
document what it protects (in general, CustodianShmemStruct deserves
more than zero commentary). Do we need it at all? If the task-needed
flags were sig_atomic_t not bool, we probably don't need it for the
basic job of tracking which tasks remain to be run. I see that some
of the tasks have possibly-non-atomically-assigned parameters to be
transmitted, but restricting cust_lck to protect those seems like a
better idea.
* Not quite convinced about handle_arg_func, mainly because the Datum
API would be pretty inconvenient for any task with more than one arg.
Why do we need that at all, rather than saying that callers should
set up any required parameters separately before invoking
RequestCustodian?
* Why does LookupCustodianFunctions think it needs to search the
constant array?
* The original proposal included moving RemovePgTempFiles into this
mechanism, which I thought was probably the most useful bit of the
whole thing. I'm sad to see that gone, what became of it?
regards, tom lane
Hi,
On 2023-04-02 13:40:05 -0400, Tom Lane wrote:
Nathan Bossart <nathandbossart@gmail.com> writes:
another rebase for cfbot
I took a brief look through v20, and generally liked what I saw,
but there are a few things troubling me:
Just want to note that I've repeatedly objected to 0002 and 0003, i.e. moving
serialized logical decoding snapshots and mapping files, to custodian, and
still do. Without further work it increases wraparound risks (the filenames
contain xids), and afaict nothing has been done to ameliorate that.
Without those, the current patch series does not have any tasks:
* The original proposal included moving RemovePgTempFiles into this
mechanism, which I thought was probably the most useful bit of the
whole thing. I'm sad to see that gone, what became of it?
Greetings,
Andres Freund
On Sun, Apr 02, 2023 at 01:40:05PM -0400, Tom Lane wrote:
I took a brief look through v20, and generally liked what I saw,
but there are a few things troubling me:
Thanks for taking a look.
* The comments for CustodianEnqueueTask claim that it won't enqueue an
already-queued task, but I don't think I believe that, because it stops
scanning as soon as it finds an empty slot. That data structure seems
quite oddly designed in any case. Why isn't it simply an array of
need-to-run-this-one booleans indexed by the CustodianTask enum?
Fairness of dispatch could be ensured by the same state variable that
CustodianGetNextTask already uses to track which array element to
inspect next. While that wouldn't guarantee that tasks A and B are
dispatched in the same order they were requested in, I'm not sure why
we should care.
That works. Will update.
* I don't much like cust_lck, mainly because you didn't bother to
document what it protects (in general, CustodianShmemStruct deserves
more than zero commentary). Do we need it at all? If the task-needed
flags were sig_atomic_t not bool, we probably don't need it for the
basic job of tracking which tasks remain to be run. I see that some
of the tasks have possibly-non-atomically-assigned parameters to be
transmitted, but restricting cust_lck to protect those seems like a
better idea.
Will do.
* Not quite convinced about handle_arg_func, mainly because the Datum
API would be pretty inconvenient for any task with more than one arg.
Why do we need that at all, rather than saying that callers should
set up any required parameters separately before invoking
RequestCustodian?
I had done it this way earlier, but added the Datum argument based on
feedback upthread [0]/messages/by-id/20220703172732.wembjsb55xl63vuw@awork3.anarazel.de. It presently has only one proposed use, anyway, so
I think it would be fine to switch it back for now.
* Why does LookupCustodianFunctions think it needs to search the
constant array?
The order of the tasks in the array isn't guaranteed to match the order in
the CustodianTask enum.
* The original proposal included moving RemovePgTempFiles into this
mechanism, which I thought was probably the most useful bit of the
whole thing. I'm sad to see that gone, what became of it?
I postponed that based on advice from upthread [1]/messages/by-id/CANbhV-EagKLoUH7tLEfg__VcLu37LY78F8gvLMzHrRZyZKm6sw@mail.gmail.com. I was hoping to start
a dedicated thread for that immediately after the custodian infrastructure
was committed. FWIW I agree that it's the most useful task of what's
proposed thus far.
[0]: /messages/by-id/20220703172732.wembjsb55xl63vuw@awork3.anarazel.de
[1]: /messages/by-id/CANbhV-EagKLoUH7tLEfg__VcLu37LY78F8gvLMzHrRZyZKm6sw@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Sun, Apr 02, 2023 at 11:42:26AM -0700, Andres Freund wrote:
Just want to note that I've repeatedly objected to 0002 and 0003, i.e. moving
serialized logical decoding snapshots and mapping files, to custodian, and
still do. Without further work it increases wraparound risks (the filenames
contain xids), and afaict nothing has been done to ameliorate that.
From your feedback earlier [0]/messages/by-id/20220702225456.zit5kjdtdfqmjujt@alap3.anarazel.de, I was under the (perhaps false) impression
that adding a note about this existing issue in the commit message was
sufficient, at least initially. I did add such a note in 0003, but it's
missing from 0002 for some reason. I suspect I left it out because the
serialized snapshot file names do not contain XIDs. You cleared that up
earlier [1]/messages/by-id/20220217065938.x2esfdppzypegn5j@alap3.anarazel.de, so this is my bad.
It's been a little while since I dug into this, but I do see your point
that the wraparound risk could be higher in some cases. For example, if
you have a billion temp files to clean up, the custodian could be stuck on
that task for a long time. I will give this some further thought. I'm all
ears if anyone has ideas about how to reduce this risk.
[0]: /messages/by-id/20220702225456.zit5kjdtdfqmjujt@alap3.anarazel.de
[1]: /messages/by-id/20220217065938.x2esfdppzypegn5j@alap3.anarazel.de
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes:
On Sun, Apr 02, 2023 at 01:40:05PM -0400, Tom Lane wrote:
* Why does LookupCustodianFunctions think it needs to search the
constant array?
The order of the tasks in the array isn't guaranteed to match the order in
the CustodianTask enum.
Why not? It's a constant array, we can surely manage to make its
order match the enum.
* The original proposal included moving RemovePgTempFiles into this
mechanism, which I thought was probably the most useful bit of the
whole thing. I'm sad to see that gone, what became of it?
I postponed that based on advice from upthread [1]. I was hoping to start
a dedicated thread for that immediately after the custodian infrastructure
was committed. FWIW I agree that it's the most useful task of what's
proposed thus far.
Hmm, given Andres' objections there's little point in moving forward
without that task.
regards, tom lane
Nathan Bossart <nathandbossart@gmail.com> writes:
It's been a little while since I dug into this, but I do see your point
that the wraparound risk could be higher in some cases. For example, if
you have a billion temp files to clean up, the custodian could be stuck on
that task for a long time. I will give this some further thought. I'm all
ears if anyone has ideas about how to reduce this risk.
I wonder if a single long-lived custodian task is the right model at all.
At least for RemovePgTempFiles, it'd make more sense to write it as a
background worker that spawns, does its work, and then exits,
independently of anything else. Of course, then you need some mechanism
for ensuring that a bgworker slot is available when needed, but that
doesn't seem horridly difficult --- we could have a few "reserved
bgworker" slots, perhaps. An idle bgworker slot doesn't cost much.
regards, tom lane
On Sun, Apr 02, 2023 at 04:23:05PM -0400, Tom Lane wrote:
Nathan Bossart <nathandbossart@gmail.com> writes:
On Sun, Apr 02, 2023 at 01:40:05PM -0400, Tom Lane wrote:
* Why does LookupCustodianFunctions think it needs to search the
constant array?The order of the tasks in the array isn't guaranteed to match the order in
the CustodianTask enum.Why not? It's a constant array, we can surely manage to make its
order match the enum.
Alright. I'll change this.
* The original proposal included moving RemovePgTempFiles into this
mechanism, which I thought was probably the most useful bit of the
whole thing. I'm sad to see that gone, what became of it?I postponed that based on advice from upthread [1]. I was hoping to start
a dedicated thread for that immediately after the custodian infrastructure
was committed. FWIW I agree that it's the most useful task of what's
proposed thus far.Hmm, given Andres' objections there's little point in moving forward
without that task.
Yeah. I should probably tackle that one first and leave the logical tasks
for later, given there is some prerequisite work required.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Sun, Apr 02, 2023 at 04:37:38PM -0400, Tom Lane wrote:
Nathan Bossart <nathandbossart@gmail.com> writes:
It's been a little while since I dug into this, but I do see your point
that the wraparound risk could be higher in some cases. For example, if
you have a billion temp files to clean up, the custodian could be stuck on
that task for a long time. I will give this some further thought. I'm all
ears if anyone has ideas about how to reduce this risk.I wonder if a single long-lived custodian task is the right model at all.
At least for RemovePgTempFiles, it'd make more sense to write it as a
background worker that spawns, does its work, and then exits,
independently of anything else. Of course, then you need some mechanism
for ensuring that a bgworker slot is available when needed, but that
doesn't seem horridly difficult --- we could have a few "reserved
bgworker" slots, perhaps. An idle bgworker slot doesn't cost much.
This has crossed my mind. Even if we use the custodian for several
different tasks, perhaps it could shut down while not in use. For many
servers, the custodian process will be used sparingly, if at all. And if
we introduce something like custodian_max_workers, perhaps we could dodge
the wraparound issue a bit by setting the default to the number of
supported tasks. That being said, this approach adds some complexity.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
I sent this one to the next commitfest and marked it as waiting-on-author
and targeted for v17. I'm aiming to have something that addresses the
latest feedback ready for the July commitfest.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On 4 Apr 2023, at 05:36, Nathan Bossart <nathandbossart@gmail.com> wrote:
I sent this one to the next commitfest and marked it as waiting-on-author
and targeted for v17. I'm aiming to have something that addresses the
latest feedback ready for the July commitfest.
Have you had a chance to look at this such that there is something ready?
--
Daniel Gustafsson
On Tue, Jul 04, 2023 at 09:30:43AM +0200, Daniel Gustafsson wrote:
On 4 Apr 2023, at 05:36, Nathan Bossart <nathandbossart@gmail.com> wrote:
I sent this one to the next commitfest and marked it as waiting-on-author
and targeted for v17. I'm aiming to have something that addresses the
latest feedback ready for the July commitfest.Have you had a chance to look at this such that there is something ready?
Not yet, sorry.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com