O(n) tasks cause lengthy startups and checkpoints

Started by Nathan Bossartover 4 years ago89 messageshackers
Jump to latest
#1Nathan Bossart
nathandbossart@gmail.com

Hi hackers,

Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash. However,
as noted earlier this year [0]/messages/by-id/32B59582-AA6C-4609-B08F-2256A271F7A5@amazon.com, there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters. I'll likely split each of these into its own
thread, given there is community interest for such changes.

1) CheckPointSnapBuild(): This function loops through
pg_logical/snapshots to remove all snapshots that are no longer
needed. If there are many entries in this directory, this can take
a long time. The note above this function indicates that this is
done during checkpoints simply because it is convenient. IIUC
there is no requirement that this function actually completes for a
given checkpoint. My current idea is to move this to a new
maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
pg_logical/mappings to remove old mappings and flush all remaining
ones. IIUC there is no requirement that the "remove old mappings"
part must complete for a given checkpoint, but the "flush all
remaining" portion allows replay after a checkpoint to only "deal
with the parts of a mapping that have been written out after the
checkpoint started." Therefore, I think we should move the "remove
old mappings" part to a new maintenance worker (probably the same
one as for 1), and we should consider using syncfs() for the "flush
all remaining" part. (I suspect the main argument against the
latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
temporary files to individually remove. This step is already
optionally done after a crash via the remove_temp_files_after_crash
GUC. I propose that we have startup move the temporary file
directories aside and create new ones, and then a separate worker
(probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
has been spilled to disk. This code appears to be written to avoid
deleting different types of files in these directories, but AFAICT
there shouldn't be any other files. Therefore, I think we could do
something similar to 3 (i.e., move the directories aside during
startup and clean them up via a new maintenance worker).

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?

Nathan

[0]: /messages/by-id/32B59582-AA6C-4609-B08F-2256A271F7A5@amazon.com

#2SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: Nathan Bossart (#1)
Re: O(n) tasks cause lengthy startups and checkpoints

+1 to the idea. I don't see a reason why checkpointer has to do all of
that. Keeping checkpoint to minimal essential work helps servers recover
faster in the event of a crash.

RemoveOldXlogFiles is also an O(N) operation that can at least be avoided
during the end of recovery (CHECKPOINT_END_OF_RECOVERY) checkpoint. When a
sufficient number of WAL files accumulated and the previous checkpoint did
not get a chance to cleanup, this can increase the unavailability of the
server.

RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);

On Wed, Dec 1, 2021 at 12:24 PM Bossart, Nathan <bossartn@amazon.com> wrote:

Show quoted text

Hi hackers,

Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash. However,
as noted earlier this year [0], there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters. I'll likely split each of these into its own
thread, given there is community interest for such changes.

1) CheckPointSnapBuild(): This function loops through
pg_logical/snapshots to remove all snapshots that are no longer
needed. If there are many entries in this directory, this can take
a long time. The note above this function indicates that this is
done during checkpoints simply because it is convenient. IIUC
there is no requirement that this function actually completes for a
given checkpoint. My current idea is to move this to a new
maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
pg_logical/mappings to remove old mappings and flush all remaining
ones. IIUC there is no requirement that the "remove old mappings"
part must complete for a given checkpoint, but the "flush all
remaining" portion allows replay after a checkpoint to only "deal
with the parts of a mapping that have been written out after the
checkpoint started." Therefore, I think we should move the "remove
old mappings" part to a new maintenance worker (probably the same
one as for 1), and we should consider using syncfs() for the "flush
all remaining" part. (I suspect the main argument against the
latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
temporary files to individually remove. This step is already
optionally done after a crash via the remove_temp_files_after_crash
GUC. I propose that we have startup move the temporary file
directories aside and create new ones, and then a separate worker
(probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
has been spilled to disk. This code appears to be written to avoid
deleting different types of files in these directories, but AFAICT
there shouldn't be any other files. Therefore, I think we could do
something similar to 3 (i.e., move the directories aside during
startup and clean them up via a new maintenance worker).

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?

Nathan

[0] /messages/by-id/32B59582-AA6C-4609-B08F-2256A271F7A5@amazon.com

#3Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#1)
Re: O(n) tasks cause lengthy startups and checkpoints

Hi,

On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?

-1. I think the overhead of an additional worker is disproportional here. And
there's simplicity benefits in having a predictable cleanup interlock as well.

I think particularly for the snapshot stuff it'd be better to optimize away
unnecessary snapshot files, rather than making the cleanup more asynchronous.

Greetings,

Andres Freund

#4Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#3)
Re: O(n) tasks cause lengthy startups and checkpoints

On 12/1/21, 2:56 PM, "Andres Freund" <andres@anarazel.de> wrote:

On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?

-1. I think the overhead of an additional worker is disproportional here. And
there's simplicity benefits in having a predictable cleanup interlock as well.

Another idea I had was to put some upper limit on how much time is
spent on such tasks. For example, a checkpoint would only spend X
minutes on CheckPointSnapBuild() before giving up until the next one.
I think the main downside of that approach is that it could lead to
unbounded growth, so perhaps we would limit (or even skip) such tasks
only for end-of-recovery and shutdown checkpoints. Perhaps the
startup tasks could be limited in a similar fashion.

I think particularly for the snapshot stuff it'd be better to optimize away
unnecessary snapshot files, rather than making the cleanup more asynchronous.

I can look into this. Any pointers would be much appreciated.

Nathan

#5Euler Taveira
euler@eulerto.com
In reply to: Nathan Bossart (#4)
Re: O(n) tasks cause lengthy startups and checkpoints

On Wed, Dec 1, 2021, at 9:19 PM, Bossart, Nathan wrote:

On 12/1/21, 2:56 PM, "Andres Freund" <andres@anarazel.de> wrote:

On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?

-1. I think the overhead of an additional worker is disproportional here. And
there's simplicity benefits in having a predictable cleanup interlock as well.

Another idea I had was to put some upper limit on how much time is
spent on such tasks. For example, a checkpoint would only spend X
minutes on CheckPointSnapBuild() before giving up until the next one.
I think the main downside of that approach is that it could lead to
unbounded growth, so perhaps we would limit (or even skip) such tasks
only for end-of-recovery and shutdown checkpoints. Perhaps the
startup tasks could be limited in a similar fashion.

Saying that a certain task is O(n) doesn't mean it needs a separate process to
handle it. Did you have a use case or even better numbers (% of checkpoint /
startup time) that makes your proposal worthwhile?

I would try to optimize (1) and (2). However, delayed removal can be a
long-term issue if the new routine cannot keep up with the pace of file
creation (specially if the checkpoints are far apart).

For (3), there is already a GUC that would avoid the slowdown during startup.
Use it if you think the startup time is more important that disk space occupied
by useless files.

For (4), you are forgetting that the on-disk state of replication slots is
stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the
replication slot directory and copy the state file. What happen if there is a
crash before copying the state file?

While we are talking about items (1), (2) and (4), we could probably have an
option to create some ephemeral logical decoding files into ramdisk (similar to
statistics directory). I wouldn't like to hijack this thread but this proposal
could alleviate the possible issues that you pointed out. If people are
interested in this proposal, I can start a new thread about it.

--
Euler Taveira
EDB https://www.enterprisedb.com/

#6Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#1)
Re: O(n) tasks cause lengthy startups and checkpoints

On Thu, Dec 2, 2021 at 1:54 AM Bossart, Nathan <bossartn@amazon.com> wrote:

Hi hackers,

Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash. However,
as noted earlier this year [0], there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters. I'll likely split each of these into its own
thread, given there is community interest for such changes.

1) CheckPointSnapBuild(): This function loops through
pg_logical/snapshots to remove all snapshots that are no longer
needed. If there are many entries in this directory, this can take
a long time. The note above this function indicates that this is
done during checkpoints simply because it is convenient. IIUC
there is no requirement that this function actually completes for a
given checkpoint. My current idea is to move this to a new
maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
pg_logical/mappings to remove old mappings and flush all remaining
ones. IIUC there is no requirement that the "remove old mappings"
part must complete for a given checkpoint, but the "flush all
remaining" portion allows replay after a checkpoint to only "deal
with the parts of a mapping that have been written out after the
checkpoint started." Therefore, I think we should move the "remove
old mappings" part to a new maintenance worker (probably the same
one as for 1), and we should consider using syncfs() for the "flush
all remaining" part. (I suspect the main argument against the
latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
temporary files to individually remove. This step is already
optionally done after a crash via the remove_temp_files_after_crash
GUC. I propose that we have startup move the temporary file
directories aside and create new ones, and then a separate worker
(probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
has been spilled to disk. This code appears to be written to avoid
deleting different types of files in these directories, but AFAICT
there shouldn't be any other files. Therefore, I think we could do
something similar to 3 (i.e., move the directories aside during
startup and clean them up via a new maintenance worker).

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?

+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write some
marker files (for instance, it can write snapshot_<cutofflsn> files
with file name itself representing the cutoff lsn so that the new bg
cleaner can remove the snapshot files, similarly it can write marker
files for other file removals). Having said that, a new bg cleaner
deleting the files asynchronously on behalf of checkpoint can look an
overkill until we have some numbers that we could save with this
approach. For this purpose, I did a small experiment to figure out how
much usually file deletion takes [1]on SSD: deletion of 1000000 files took 7.930380 seconds deletion of 500000 files took 3.921676 seconds deletion of 100000 files took 0.768772 seconds deletion of 50000 files took 0.400623 seconds deletion of 10000 files took 0.077565 seconds deletion of 1000 files took 0.006232 seconds on a SSD, for 1million files
8seconds, I'm sure it will be much more on HDD.

The bg cleaner can also be used for RemovePgTempFiles, probably the
postmaster just renaming the pgsql_temp to something
pgsql_temp_delete, then proceeding with the server startup, the bg
cleaner can then delete the files.
Also, we could do something similar for removing/recycling old xlog
files and StartupReorderBuffer.

Another idea could be to parallelize the checkpoint i.e. IIUC, the
tasks that checkpoint do in CheckPointGuts are independent and if we
have some counters like (how many snapshot/mapping files that the
server generated)

[1]: on SSD: deletion of 1000000 files took 7.930380 seconds deletion of 500000 files took 3.921676 seconds deletion of 100000 files took 0.768772 seconds deletion of 50000 files took 0.400623 seconds deletion of 10000 files took 0.077565 seconds deletion of 1000 files took 0.006232 seconds
deletion of 1000000 files took 7.930380 seconds
deletion of 500000 files took 3.921676 seconds
deletion of 100000 files took 0.768772 seconds
deletion of 50000 files took 0.400623 seconds
deletion of 10000 files took 0.077565 seconds
deletion of 1000 files took 0.006232 seconds

Regards,
Bharath Rupireddy.

#7Nathan Bossart
nathandbossart@gmail.com
In reply to: Euler Taveira (#5)
Re: O(n) tasks cause lengthy startups and checkpoints

On 12/1/21, 6:06 PM, "Euler Taveira" <euler@eulerto.com> wrote:

Saying that a certain task is O(n) doesn't mean it needs a separate process to
handle it. Did you have a use case or even better numbers (% of checkpoint /
startup time) that makes your proposal worthwhile?

I don't have specific numbers on hand, but each of the four functions
I listed is something I routinely see impacting customers.

For (3), there is already a GUC that would avoid the slowdown during startup.
Use it if you think the startup time is more important that disk space occupied
by useless files.

Setting remove_temp_files_after_crash to false only prevents temp file
cleanup during restart after a backend crash. It is always called for
other startups.

For (4), you are forgetting that the on-disk state of replication slots is
stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the
replication slot directory and copy the state file. What happen if there is a
crash before copying the state file?

Good point. I think it's possible to deal with this, though. Perhaps
the files that should be deleted on startup should go in a separate
directory, or maybe we could devise a way to ensure the state file is
copied even if there is a crash at an inconvenient time.

Nathan

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#6)
Re: O(n) tasks cause lengthy startups and checkpoints

On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:

+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write some

Right. IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks. I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem. That being said, I'm all for
exploring other ways to handle this.

Another idea could be to parallelize the checkpoint i.e. IIUC, the
tasks that checkpoint do in CheckPointGuts are independent and if we
have some counters like (how many snapshot/mapping files that the
server generated)

Could you elaborate on this? Is your idea that the checkpointer would
create worker processes like autovacuum does?

Nathan

#9Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#8)
Re: O(n) tasks cause lengthy startups and checkpoints

On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote:

On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:

+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write some

Right. IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks. I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem. That being said, I'm all for
exploring other ways to handle this.

Having a generic background cleaner process (controllable via a few
GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
temp files etc.) or some other task on behalf of the checkpointer,
seems to be the easiest solution.

I'm too open for other ideas.

Another idea could be to parallelize the checkpoint i.e. IIUC, the
tasks that checkpoint do in CheckPointGuts are independent and if we
have some counters like (how many snapshot/mapping files that the
server generated)

Could you elaborate on this? Is your idea that the checkpointer would
create worker processes like autovacuum does?

Yes, I was thinking that the checkpointer creates one or more dynamic
background workers (we can assume one background worker for now) to
delete the files. If a threshold of files crosses (snapshot files
count is more than this threshold), the new worker gets spawned which
would then enumerate the files and delete the unneeded ones, the
checkpointer can proceed with the other tasks and finish the
checkpointing. Having said this, I prefer the background cleaner
approach over the dynamic background worker. The advantage with the
background cleaner being that it can do other tasks (like other kinds
of file deletion).

Another idea could be that, use the existing background writer to do
the file deletion while the checkpoint is happening. But again, this
might cause problems because the bg writer flushing dirty buffers will
get delayed.

Regards,
Bharath Rupireddy.

#10Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#9)
Re: O(n) tasks cause lengthy startups and checkpoints

On 12/3/21, 5:57 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote:

On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:

+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write some

Right. IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks. I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem. That being said, I'm all for
exploring other ways to handle this.

Having a generic background cleaner process (controllable via a few
GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
temp files etc.) or some other task on behalf of the checkpointer,
seems to be the easiest solution.

I'm too open for other ideas.

I might hack something together for the separate worker approach, if
for no other reason than to make sure I really understand how these
functions work. If/when a better idea emerges, we can alter course.

Nathan

#11Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#10)
Re: O(n) tasks cause lengthy startups and checkpoints

On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan <bossartn@amazon.com> wrote:

On 12/3/21, 5:57 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote:

On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:

+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write some

Right. IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks. I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem. That being said, I'm all for
exploring other ways to handle this.

Having a generic background cleaner process (controllable via a few
GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
temp files etc.) or some other task on behalf of the checkpointer,
seems to be the easiest solution.

I'm too open for other ideas.

I might hack something together for the separate worker approach, if
for no other reason than to make sure I really understand how these
functions work. If/when a better idea emerges, we can alter course.

Thanks. As I said upthread we've been discussing the approach of
offloading some of the checkpoint tasks like (deleting snapshot files)
internally for quite some time and I would like to share a patch that
adds a new background cleaner process (currently able to delete the
logical replication snapshot files, if required can be extended to do
other tasks as well). I don't mind if it gets rejected. Please have a
look.

Regards,
Bharath Rupireddy.

Attachments:

v1-0001-background-cleaner-to-offload-checkpoint-tasks.patchapplication/octet-stream; name=v1-0001-background-cleaner-to-offload-checkpoint-tasks.patchDownload+579-3
#12Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#11)
Re: O(n) tasks cause lengthy startups and checkpoints

On 12/6/21, 3:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan <bossartn@amazon.com> wrote:

I might hack something together for the separate worker approach, if
for no other reason than to make sure I really understand how these
functions work. If/when a better idea emerges, we can alter course.

Thanks. As I said upthread we've been discussing the approach of
offloading some of the checkpoint tasks like (deleting snapshot files)
internally for quite some time and I would like to share a patch that
adds a new background cleaner process (currently able to delete the
logical replication snapshot files, if required can be extended to do
other tasks as well). I don't mind if it gets rejected. Please have a
look.

Thanks for sharing! I've also spent some time on a patch set, which I
intend to share once I have handling for all four tasks (so far I have
handling for CheckPointSnapBuild() and RemovePgTempFiles()). I'll
take a look at your patch as well.

Nathan

#13Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#12)
Re: O(n) tasks cause lengthy startups and checkpoints

On 12/6/21, 11:23 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:

On 12/6/21, 3:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:

Thanks. As I said upthread we've been discussing the approach of
offloading some of the checkpoint tasks like (deleting snapshot files)
internally for quite some time and I would like to share a patch that
adds a new background cleaner process (currently able to delete the
logical replication snapshot files, if required can be extended to do
other tasks as well). I don't mind if it gets rejected. Please have a
look.

Thanks for sharing! I've also spent some time on a patch set, which I
intend to share once I have handling for all four tasks (so far I have
handling for CheckPointSnapBuild() and RemovePgTempFiles()). I'll
take a look at your patch as well.

Well, I haven't had a chance to look at your patch, and my patch set
still only has handling for CheckPointSnapBuild() and
RemovePgTempFiles(), but I thought I'd share what I have anyway. I
split it into 5 patches:

0001 - Adds a new "custodian" auxiliary process that does nothing.
0002 - During startup, remove the pgsql_tmp directories instead of
only clearing the contents.
0003 - Split temporary file cleanup during startup into two stages.
The first renames the directories, and the second clears them.
0004 - Moves the second stage from 0003 to the custodian process.
0005 - Moves CheckPointSnapBuild() to the custodian process.

This is still very much a work in progress, and I've done minimal
testing so far.

Nathan

Attachments:

v1-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchapplication/octet-stream; name=v1-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patchDownload+37-13
v1-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchapplication/octet-stream; name=v1-0003-Split-pgsql_tmp-cleanup-into-two-stages.patchDownload+162-25
v1-0002-Also-remove-pgsql_tmp-directories-during-startup.patchapplication/octet-stream; name=v1-0002-Also-remove-pgsql_tmp-directories-during-startup.patchDownload+17-16
v1-0001-Introduce-custodian.patchapplication/octet-stream; name=v1-0001-Introduce-custodian.patchDownload+297-6
v1-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchapplication/octet-stream; name=v1-0005-Move-removal-of-old-serialized-snapshots-to-custo.patchDownload+16-9
#14Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#13)
Re: O(n) tasks cause lengthy startups and checkpoints

On Fri, Dec 10, 2021 at 2:03 PM Bossart, Nathan <bossartn@amazon.com> wrote:

Well, I haven't had a chance to look at your patch, and my patch set
still only has handling for CheckPointSnapBuild() and
RemovePgTempFiles(), but I thought I'd share what I have anyway. I
split it into 5 patches:

0001 - Adds a new "custodian" auxiliary process that does nothing.
0002 - During startup, remove the pgsql_tmp directories instead of
only clearing the contents.
0003 - Split temporary file cleanup during startup into two stages.
The first renames the directories, and the second clears them.
0004 - Moves the second stage from 0003 to the custodian process.
0005 - Moves CheckPointSnapBuild() to the custodian process.

I don't know whether this kind of idea is good or not.

One thing we've seen a number of times now is that entrusting the same
process with multiple responsibilities often ends poorly. Sometimes
it's busy with one thing when another thing really needs to be done
RIGHT NOW. Perhaps that won't be an issue here since all of these
things are related to checkpointing, but then the process name should
reflect that rather than making it sound like we can just keep piling
more responsibilities onto this process indefinitely. At some point
that seems bound to become an issue.

Another issue is that we don't want to increase the number of
processes without bound. Processes use memory and CPU resources and if
we run too many of them it becomes a burden on the system. Low-end
systems may not have too many resources in total, and high-end systems
can struggle to fit demanding workloads within the resources that they
have. Maybe it would be cheaper to do more things at once if we were
using threads rather than processes, but that day still seems fairly
far off.

But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Justin Pryzby
pryzby@telsasoft.com
In reply to: Robert Haas (#14)
Re: O(n) tasks cause lengthy startups and checkpoints

On Mon, Dec 13, 2021 at 08:53:37AM -0500, Robert Haas wrote:

On Fri, Dec 10, 2021 at 2:03 PM Bossart, Nathan <bossartn@amazon.com> wrote:

Well, I haven't had a chance to look at your patch, and my patch set
still only has handling for CheckPointSnapBuild() and
RemovePgTempFiles(), but I thought I'd share what I have anyway. I
split it into 5 patches:

0001 - Adds a new "custodian" auxiliary process that does nothing.

...

I don't know whether this kind of idea is good or not.

...

Another issue is that we don't want to increase the number of
processes without bound. Processes use memory and CPU resources and if
we run too many of them it becomes a burden on the system. Low-end
systems may not have too many resources in total, and high-end systems
can struggle to fit demanding workloads within the resources that they
have. Maybe it would be cheaper to do more things at once if we were
using threads rather than processes, but that day still seems fairly
far off.

Maybe that's an argument that this should be a dynamic background worker
instead of an auxilliary process. Then maybe it would be controlled by
max_parallel_maintenance_workers (or something similar). The checkpointer
would need to do these tasks itself if parallel workers were disabled or
couldn't be launched.

--
Justin

#16Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#14)
Re: O(n) tasks cause lengthy startups and checkpoints

On 12/13/21, 5:54 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:

I don't know whether this kind of idea is good or not.

Thanks for chiming in. I have an almost-complete patch set that I'm
hoping to post to the lists in the next couple of days.

One thing we've seen a number of times now is that entrusting the same
process with multiple responsibilities often ends poorly. Sometimes
it's busy with one thing when another thing really needs to be done
RIGHT NOW. Perhaps that won't be an issue here since all of these
things are related to checkpointing, but then the process name should
reflect that rather than making it sound like we can just keep piling
more responsibilities onto this process indefinitely. At some point
that seems bound to become an issue.

Two of the tasks are cleanup tasks that checkpointing handles at the
moment, and two are cleanup tasks that are done at startup. For now,
all of these tasks are somewhat nonessential. There's no requirement
that any of these tasks complete in order to finish startup or
checkpointing. In fact, outside of preventing the server from running
out of disk space, I don't think there's any requirement that these
tasks run at all. IMO this would have to be a core tenet of a new
auxiliary process like this.

That being said, I totally understand your point. If there were a
dozen such tasks handled by a single auxiliary process, that could
cause a new set of problems. Your checkpointing and startup might be
fast, but you might run out of disk space because our cleanup process
can't handle it all. So a new worker could end up becoming an
availability risk as well.

Another issue is that we don't want to increase the number of
processes without bound. Processes use memory and CPU resources and if
we run too many of them it becomes a burden on the system. Low-end
systems may not have too many resources in total, and high-end systems
can struggle to fit demanding workloads within the resources that they
have. Maybe it would be cheaper to do more things at once if we were
using threads rather than processes, but that day still seems fairly
far off.

I do agree that it is important to be very careful about adding new
processes, and if a better idea for how to handle these tasks emerges,
I will readily abandon my current approach. Upthread, Andres
mentioned optimizing unnecessary snapshot files, and I mentioned
possibly limiting how much time startup and checkpoints spend on these
tasks. I don't have too many details for the former, and for the
latter, I'm worried about not being able to keep up. But if the
prospect of adding a new auxiliary process for this stuff is a non-
starter, perhaps I should explore that approach some more.

But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.

Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.

Nathan

#17Nathan Bossart
nathandbossart@gmail.com
In reply to: Justin Pryzby (#15)
Re: O(n) tasks cause lengthy startups and checkpoints

On 12/13/21, 9:20 AM, "Justin Pryzby" <pryzby@telsasoft.com> wrote:

On Mon, Dec 13, 2021 at 08:53:37AM -0500, Robert Haas wrote:

Another issue is that we don't want to increase the number of
processes without bound. Processes use memory and CPU resources and if
we run too many of them it becomes a burden on the system. Low-end
systems may not have too many resources in total, and high-end systems
can struggle to fit demanding workloads within the resources that they
have. Maybe it would be cheaper to do more things at once if we were
using threads rather than processes, but that day still seems fairly
far off.

Maybe that's an argument that this should be a dynamic background worker
instead of an auxilliary process. Then maybe it would be controlled by
max_parallel_maintenance_workers (or something similar). The checkpointer
would need to do these tasks itself if parallel workers were disabled or
couldn't be launched.

I think this is an interesting idea. I dislike the prospect of having
two code paths for all this stuff, but if it addresses the concerns
about resource usage, maybe it's worth it.

Nathan

#18Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#16)
Re: O(n) tasks cause lengthy startups and checkpoints

On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote:

But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.

Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.

Well sure. But I've never actually seen that happen.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#18)
Re: O(n) tasks cause lengthy startups and checkpoints

On 12/13/21, 12:37 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:

On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote:

But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.

Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.

Well sure. But I've never actually seen that happen.

I'll admit that surprises me. As noted elsewhere [0]/messages/by-id/E7573D54-A8C9-40A8-89D7-0596A36ED124@amazon.com, we were seeing
this enough with pgsql_tmp that we started moving the directory aside
before starting the server. Discussions about handling this usually
prompt questions about why there are so many temporary files in the
first place (which is fair). FWIW all four functions noted in my
original message [1]/messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com are things I've seen firsthand affecting users.

Nathan

[0]: /messages/by-id/E7573D54-A8C9-40A8-89D7-0596A36ED124@amazon.com
[1]: /messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com

#20Bruce Momjian
bruce@momjian.us
In reply to: Nathan Bossart (#19)
Re: O(n) tasks cause lengthy startups and checkpoints

On Mon, Dec 13, 2021 at 11:05:46PM +0000, Bossart, Nathan wrote:

On 12/13/21, 12:37 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:

On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote:

But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.

Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.

Well sure. But I've never actually seen that happen.

I'll admit that surprises me. As noted elsewhere [0], we were seeing
this enough with pgsql_tmp that we started moving the directory aside
before starting the server. Discussions about handling this usually
prompt questions about why there are so many temporary files in the
first place (which is fair). FWIW all four functions noted in my
original message [1] are things I've seen firsthand affecting users.

Have we changed temporary file handling in any recent major releases,
meaning is this a current problem or one already improved in PG 14.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#21Nathan Bossart
nathandbossart@gmail.com
In reply to: Bruce Momjian (#20)
#22Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#21)
#23Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#23)
#25Amul Sul
sulamul@gmail.com
In reply to: Andres Freund (#24)
#26Nathan Bossart
nathandbossart@gmail.com
In reply to: Amul Sul (#25)
#27Maxim Orlov
orlovmg@gmail.com
In reply to: Nathan Bossart (#26)
#28Nathan Bossart
nathandbossart@gmail.com
In reply to: Maxim Orlov (#27)
#29Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#28)
#30Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#29)
#31Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#30)
#32Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#32)
#34Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#33)
#35Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#34)
#36Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#36)
#38Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#37)
#39Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#38)
#40Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#39)
#41Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#40)
#42Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#41)
#43Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#42)
#44Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#43)
#45Simon Riggs
simon@2ndQuadrant.com
In reply to: Nathan Bossart (#43)
#46Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#45)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#46)
#48Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#46)
#49Simon Riggs
simon@2ndQuadrant.com
In reply to: Nathan Bossart (#48)
#50Nathan Bossart
nathandbossart@gmail.com
In reply to: Simon Riggs (#49)
#51Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#50)
#52Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#51)
#53Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#52)
#54Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#53)
#55Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#54)
#56Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#55)
#57Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#56)
#58Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#57)
#59Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#58)
#60Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#59)
#61Simon Riggs
simon@2ndQuadrant.com
In reply to: Nathan Bossart (#60)
#62Nathan Bossart
nathandbossart@gmail.com
In reply to: Simon Riggs (#61)
#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Nathan Bossart (#62)
#64Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#63)
#65Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#64)
#66Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#65)
#67Simon Riggs
simon@2ndQuadrant.com
In reply to: Nathan Bossart (#66)
#68Nathan Bossart
nathandbossart@gmail.com
In reply to: Simon Riggs (#67)
#69Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#68)
#70Simon Riggs
simon@2ndQuadrant.com
In reply to: Nathan Bossart (#68)
#71Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#69)
#72Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#71)
#73Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#72)
#74Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#73)
#75Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#74)
#76Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#75)
#77Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#76)
#78Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#77)
#79Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#78)
#80Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#79)
#81Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#79)
#82Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#80)
#83Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#81)
#84Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#82)
#85Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#83)
#86Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#84)
#87Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#86)
#88Daniel Gustafsson
daniel@yesql.se
In reply to: Nathan Bossart (#87)
#89Nathan Bossart
nathandbossart@gmail.com
In reply to: Daniel Gustafsson (#88)