O(n) tasks cause lengthy startups and checkpoints

Started by Nathan Bossartover 4 years ago89 messageshackers

nathandbossart@gmail.com

over 4 years ago

Hi hackers,

Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash. However,
as noted earlier this year [0]/messages/by-id/32B59582-AA6C-4609-B08F-2256A271F7A5@amazon.com, there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters. I'll likely split each of these into its own
thread, given there is community interest for such changes.

1) CheckPointSnapBuild(): This function loops through
pg_logical/snapshots to remove all snapshots that are no longer
needed. If there are many entries in this directory, this can take
a long time. The note above this function indicates that this is
done during checkpoints simply because it is convenient. IIUC
there is no requirement that this function actually completes for a
given checkpoint. My current idea is to move this to a new
maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
pg_logical/mappings to remove old mappings and flush all remaining
ones. IIUC there is no requirement that the "remove old mappings"
part must complete for a given checkpoint, but the "flush all
remaining" portion allows replay after a checkpoint to only "deal
with the parts of a mapping that have been written out after the
checkpoint started." Therefore, I think we should move the "remove
old mappings" part to a new maintenance worker (probably the same
one as for 1), and we should consider using syncfs() for the "flush
all remaining" part. (I suspect the main argument against the
latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
temporary files to individually remove. This step is already
optionally done after a crash via the remove_temp_files_after_crash
GUC. I propose that we have startup move the temporary file
directories aside and create new ones, and then a separate worker
(probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
has been spilled to disk. This code appears to be written to avoid
deleting different types of files in these directories, but AFAICT
there shouldn't be any other files. Therefore, I think we could do
something similar to 3 (i.e., move the directories aside during
startup and clean them up via a new maintenance worker).

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?

Nathan

[0]: /messages/by-id/32B59582-AA6C-4609-B08F-2256A271F7A5@amazon.com

SATYANARAYANA NARLAPURAM

satyanarlapuram@gmail.com

over 4 years ago

In reply to: Nathan Bossart (#1)

Re: O(n) tasks cause lengthy startups and checkpoints

+1 to the idea. I don't see a reason why checkpointer has to do all of
that. Keeping checkpoint to minimal essential work helps servers recover
faster in the event of a crash.

RemoveOldXlogFiles is also an O(N) operation that can at least be avoided
during the end of recovery (CHECKPOINT_END_OF_RECOVERY) checkpoint. When a
sufficient number of WAL files accumulated and the previous checkpoint did
not get a chance to cleanup, this can increase the unavailability of the
server.

RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);

On Wed, Dec 1, 2021 at 12:24 PM Bossart, Nathan <bossartn@amazon.com> wrote:

Show quoted text

Hi hackers,

Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash. However,
as noted earlier this year [0], there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters. I'll likely split each of these into its own
thread, given there is community interest for such changes.

1) CheckPointSnapBuild(): This function loops through
pg_logical/snapshots to remove all snapshots that are no longer
needed. If there are many entries in this directory, this can take
a long time. The note above this function indicates that this is
done during checkpoints simply because it is convenient. IIUC
there is no requirement that this function actually completes for a
given checkpoint. My current idea is to move this to a new
maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
pg_logical/mappings to remove old mappings and flush all remaining
ones. IIUC there is no requirement that the "remove old mappings"
part must complete for a given checkpoint, but the "flush all
remaining" portion allows replay after a checkpoint to only "deal
with the parts of a mapping that have been written out after the
checkpoint started." Therefore, I think we should move the "remove
old mappings" part to a new maintenance worker (probably the same
one as for 1), and we should consider using syncfs() for the "flush
all remaining" part. (I suspect the main argument against the
latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
temporary files to individually remove. This step is already
optionally done after a crash via the remove_temp_files_after_crash
GUC. I propose that we have startup move the temporary file
directories aside and create new ones, and then a separate worker
(probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
has been spilled to disk. This code appears to be written to avoid
deleting different types of files in these directories, but AFAICT
there shouldn't be any other files. Therefore, I think we could do
something similar to 3 (i.e., move the directories aside during
startup and clean them up via a new maintenance worker).

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?

Nathan

[0] /messages/by-id/32B59582-AA6C-4609-B08F-2256A271F7A5@amazon.com

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Nathan Bossart (#1)

Re: O(n) tasks cause lengthy startups and checkpoints

Hi,

On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?

-1. I think the overhead of an additional worker is disproportional here. And
there's simplicity benefits in having a predictable cleanup interlock as well.

I think particularly for the snapshot stuff it'd be better to optimize away
unnecessary snapshot files, rather than making the cleanup more asynchronous.

Greetings,

Andres Freund

O(n) tasks cause lengthy startups and checkpoints

Attachments:

Attachments: