[PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles

Started by Nitin Motiani4 months ago7 messages

nitinmotiani@google.com

4 months ago

1 attachment(s)

Hi Hackers,

I'd like to propose a patch to allow accepting connections post recovery
without waiting for the removal of old xlog files.

*Why* : We have seen instances where the *crash recovery takes very long
(tens of minutes to hours) *if a large number of accumulated WAL files need
to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours).

This WAL accumulation is usually caused by :

1. Inactive replication slot
2. PITR failing to keep up

In the above cases when the resolution (deleting inactive slot/disabling
PITR) is followed by a crash (before checkpoint could run), we see the
recovery take a very long time. Note that in these cases the actual WAL
replay is done relatively quickly and most of the delay is due to
RemoveOldXlogFiles().

*How* : This patch solves this issue by running RemoveOldXlogFiles()
separately and async. This is achieved by doing two things :

1. *Skip RemoveOldXlogFiles() for an END_OF_RECOVERY checkpoint*. This will
ensure that the recovery finishes sooner and postgres can start accepting
connections.
2. *After the recovery we run another checkpoint without CHECKPOINT_WAIT*.
This is done in StartupXLOG(). This will lead to some extra work but that
should be minuscule as it is run right after the recovery. And the majority
of work done by this checkpoint will be in RemoveOldXlogFiles() which can
now run asynchronously.

I considered a couple of *alternative solutions* before attempting this.

1. One option could be to simply skip the removal of old xlog files during
recovery and let a later checkpoint take care of that. But in case of large
checkpoint_timeout, this could lead to bloat for longer.

2. Another approach might be to separate out RemoveOldXlogFiles() in a new
request. This might also be doable by creating a special checkpoint flag
like CHECKPOINT_ONLY_DELETE_OLD_FILES and using that in
RequestCheckpoint(). This way we can have the second checkpoint only take
care of file deletion. I ended up picking my approach over this because
that can be done with a smaller change which might make it safer and less
error-prone.

I would like to know what folks think of these alternative approaches vs
the current one.

*Repro Steps* : To repro this, I inserted and deleted a few billion rows
while keeping an inactive replication slot at the publisher. I changed the
wal_segsize to 1MB to increase the number of files for a smaller amount of
data. With 1.5TB worth of WAL files, I could consistently reproduce a 40
minutes delay. With 300GB it was around 10 minutes. With the proposed patch
the connections started being accepted right after redo is done.

These are the steps to reproduce this.

1. I created the instance with following settings :

wal-segsize=1MB (pg_ctl -D $PUB_DATA init -o --wal-segsize=1)

checkpoint_timeout=86400 to stop periodic checkpoint from running (echo
"checkpoint_timeout = 86400" >> $PUB_DATA/postgresql.conf)

max_wal_size=102400 to avoid too many checkpoints during transactions (echo
"max_wal_size = 102400" >> $PUB_DATA/postgresql.conf)

2. I created a database test and then ran the following commands :

create table t_pub(id int);
alter table t_pub replica identity full;
create publication p;
alter publication p add table t_pub;

3. I created a subscriber instance (to create an inactive replication
slot). I didn't change any config settings for this instance. Here also I
created a db test and ran the following commands :

create table t_pub(id int);
create subscription s connection 'application_name=test host=localhost
user=nitinmotiani dbname=test port=5001' publication p;
alter subscription s refresh publication;

4. I stopped the subscriber instance and checked on the first instance
(publisher) that there was an inactive replication slot by running the
following command :

select slot_name, active from pg_replication_slots;

5. I inserted and deleted data by running the following pair of commands
multiple times :

insert into t_pub select generate_series(1, 2000000000);
delete from t_pub;

Depending on the number of times these are run, we can get different
amounts of data in the WAL directory.

6. I dropped the replication slot using the following :

select pg_drop_replication_slot('s');

7. I killed one of the postgres processes to trigger crash recovery.

By checking the logs, I confirmed that the patch reduces the crash recovery
time significantly. With the patch as the removal of old xlog files was
going on in async, I also ran a few queries, created another table,
inserted data etc and it was all working.

I'm attaching the patch here. Currently I have not added any tests as I
would like to get feedback on this approach to solve the problem vs the
alternatives. Please let me know what you think.

Thanks & Regards,
Nitin Motiani
Google

Attachments:

v1-0001-Accept-connections-post-recovery-without-waiting-.patchapplication/octet-stream; name=v1-0001-Accept-connections-post-recovery-without-waiting-.patchDownload

From fa784e1216439ef01424c8b2576a2441616feb32 Mon Sep 17 00:00:00 2001
From: Nitin Motiani <nitinmotiani@google.com>
Date: Fri, 8 Aug 2025 11:11:52 +0000
Subject: [PATCH v1] Accept connections post recovery without waiting for
 RemoveOldXlogFiles

* This change runs RemoveOldXlogFiles async from the
  rest of recovery process. So that we can start accepting connections sooner.

* The purpose of this change is to improve the crash recovery time in the cases
  where the WAL replay is done quick enough but a lot more time is spent
  removing a large number of old xlog files. This can happen if there is
  WAL accumulation due to an inactive replication slot or PITR not being
  able to keep up. And if the resolution of the above is followed by a crash,
  the recovery can take a long time getting rid of the old accumulated files.

* The solution implemented here is to separate RemoveOldXlogFiles from the
  END_OF_RECOVERY checkpoint. To achieve this, we skip running the call
  to the above function in an END_OF_RECOVERY checkpoint. And after the
  recovery is done, we run a checkpoint without CHECKPOINT_WAIT. This will
  lead to extra bookkeeping work but that should be minuscule as this
  will be run immediately after the recovery. And this can remove the old
  files async. This extra checkpoint is created in StartupXLOG.

* The alternative solutions considered were
  1. Skip the removal of old xlog files altogether and let a later checkpoint
     take care of that. But if the checkpoint_timeout is large, this can lead
     to a bloat for a long time.
  2. Separate RemoveOldXlogFiles in a separate request. Or create a special
     checkpoint flag for that. Since the current approach is doing something
     quite similar with a smaller code change, it seemed safer to go with that.
---
 src/backend/access/transam/xlog.c | 62 ++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..308fb74d8ff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6231,6 +6231,18 @@ StartupXLOG(void)
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
+	/*
+	 * This is done because We skip RemoveOldXlogFiles() in
+	 * an END_OF_RECOVERY checkpoint for a lower recovery time. And then
+	 * start another checkpoint at the end in background which can do the
+	 * cleanup while the regular database operations can continue.
+	 * continue. We considered adding this to PerformRecoveryXlogAction()
+	 * but we need to wait for the recovery to be marked complete.
+	 * END_OF_RECOVERY checkpoint. The details can be found in the comment
+	 * inside CreateCheckPoint().
+	 */
+	if (performedWalRecovery)
+		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
 /*
@@ -7359,8 +7371,56 @@ CreateCheckPoint(int flags)
 		KeepLogSeg(recptr, &_logSegNo);
 	}
 	_logSegNo--;
-	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
+	/*
+	 * This is done to start accepting connections
+	 * post recovery without having to wait for the deletion of old xlog files.
+	 * The other part of this logic is in StartupXLOG(). Below is the
+	 * description of the problem and potential solutions.
+	 *
+	 * The problem arises when there are a huge number of xlog files which need
+	 * to be removed and it can delay recovery from a crash. We have seen this
+	 * happen in the following scenarios.
+	 * 1. Inactive replication slot leading to WAL accumulation. If the slot
+	 *    is deleted and a crash happens soon after.
+	 * 2. PITR not able to keep up. If it's disabled and a crash happens soon
+	 *    after.
+	 *
+	 * Possible solutions
+	 * 1. Skip removal of old xlog files during end of recovery. And rely on a later
+	 *    checkpoint to take care of those. The below code on its own can implement
+	 *    this.
+	 * 2. We separate the RemoveOldXlogFiles() into a separate request.
+	 *    Or find a way to refactor the code to finish recovery while the
+	 *    deletion goes in background. This would require making changes
+	 *    to different parts of the code and seems more error-prone. As
+	 *    we'll start accepting connections and will get new queries
+	 *    coming in, we'll need to be sure that the deletion is not
+	 *    interfering with that. This could be doable by keeping track of
+	 *    the lsn from the end of recovery checkpoint but seems riskier than
+	 *    option 3.
+	 * 3. We can combine the above 2 ideas. Instead of separating out the
+	 *    RemoveOldXlogFiles(), we can just run another checkpoint right
+	 *    after the recovery without setting CHECKPOINT_WAIT.
+	 *    This checkpoint will not be an END_OF_RECOVERY checkpoint,
+	 *    but a regular checkpoint which can do the same job in background.
+	 *    While this will lead to some extra bookkeeping work before we
+	 *    reach RemoveOldXlogFiles(), that should be minuscule as it would
+	 *    happen right after the recovery completion. Therefore here we are
+	 *    implementing this solution. The extra checkpoint is created at
+	 *    the end of StartupXLOG (see comment there) after recovery.
+	 *
+	 * Another alternative might be to create another checkpoint flag
+	 * CHECKPOINT_SKIP_OLD_FILES_DELETION and pass that from the caller
+	 * instead of always skipping this step for CHECKPOINT_END_OF_RECOVERY.
+	 * We can create a flag CHECKPOINT_ONLY_DELETE_OLD_FILES
+	 * to be used from the second checkpoint. Which ensures that no extra
+	 * work is done. This might end up very close to option 2.
+	 */
+	if (!(flags & CHECKPOINT_END_OF_RECOVERY))
+	{
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
 					   checkPoint.ThisTimeLineID);
+	}
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
-- 
2.51.0.355.g5224444f11-goog

Dilip Kumar

dilipbalaut@gmail.com

4 months ago

In reply to: Nitin Motiani (#1)

Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles

On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote:

Hi Hackers,

I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files.

Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of accumulated WAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours).

This WAL accumulation is usually caused by :

1. Inactive replication slot
2. PITR failing to keep up

In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before checkpoint could run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done relatively quickly and most of the delay is due to RemoveOldXlogFiles().

It makes sense to improve this.

How : This patch solves this issue by running RemoveOldXlogFiles() separately and async. This is achieved by doing two things :

1. Skip RemoveOldXlogFiles() for an END_OF_RECOVERY checkpoint. This will ensure that the recovery finishes sooner and postgres can start accepting connections.
2. After the recovery we run another checkpoint without CHECKPOINT_WAIT. This is done in StartupXLOG(). This will lead to some extra work but that should be minuscule as it is run right after the recovery. And the majority of work done by this checkpoint will be in RemoveOldXlogFiles() which can now run asynchronously.

I considered a couple of alternative solutions before attempting this.

1. One option could be to simply skip the removal of old xlog files during recovery and let a later checkpoint take care of that. But in case of large checkpoint_timeout, this could lead to bloat for longer.

2. Another approach might be to separate out RemoveOldXlogFiles() in a new request. This might also be doable by creating a special checkpoint flag like CHECKPOINT_ONLY_DELETE_OLD_FILES and using that in RequestCheckpoint(). This way we can have the second checkpoint only take care of file deletion. I ended up picking my approach over this because that can be done with a smaller change which might make it safer and less error-prone.

One of the advantages of this approach over forcing an extra
checkpoint is that you don't need to loop through the entire buffer
pool just to find out mostly nothing is dirty, but yeah this may
create some extra flags and extra checks in checkpointer code.

--
Regards,
Dilip Kumar
Google

Amit Kapila

amit.kapila16@gmail.com

4 months ago

In reply to: Nitin Motiani (#1)

Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles

On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote:

I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files.

Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of accumulated WAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours).

This WAL accumulation is usually caused by :

1. Inactive replication slot
2. PITR failing to keep up

In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before checkpoint could run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done relatively quickly and most of the delay is due to RemoveOldXlogFiles().

Isn't it better to fix the reasons for WAL accumulation? Because even
without recovery, this can fill up the disk. For example, one can use
idle_replication_slot_timeout for inactive slots. Similarly, we can
see what leads to slow PITR and try to avoid that.

--
With Regards,
Amit Kapila.

Dilip Kumar

dilipbalaut@gmail.com

4 months ago

In reply to: Amit Kapila (#3)

Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles

On Tue, Sep 9, 2025 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote:

I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files.

Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of accumulated WAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours).

This WAL accumulation is usually caused by :

1. Inactive replication slot
2. PITR failing to keep up

In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before checkpoint could run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done relatively quickly and most of the delay is due to RemoveOldXlogFiles().

Isn't it better to fix the reasons for WAL accumulation? Because even
without recovery, this can fill up the disk. For example, one can use
idle_replication_slot_timeout for inactive slots. Similarly, we can
see what leads to slow PITR and try to avoid that.

I agree that in the ideal world it's better if someone can set
'idle_replication_slot_timeout' correctly so that we don't even create
WAL accumulation. But that's not always the case with the user and
there are situations where WAL gets accumulated. In this context, the
goal is to address the problem after it has already happened,
minimizing additional downtime for the user. I feel this is a
reasonable goal although we can think more about whether it is worth
issuing the extra checkpoint for improving this situation.

--
Regards,
Dilip Kumar
Google

Fujii Masao

masao.fujii@gmail.com

4 months ago

In reply to: Nitin Motiani (#1)

Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles

On Mon, Sep 8, 2025 at 6:33 PM Nitin Motiani <nitinmotiani@google.com> wrote:

Hi Hackers,

I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files.

As another idea, could crash recovery avoid waiting for the end-of-recovery
checkpoint itself to finish, similar to archive recovery? In other words,
crash recovery would write the end-of-recovery WAL record and request
a checkpoint, but not block until it completes. Thought?

One concern, though: in your case, the first checkpoint after crash recovery
could take a very long time, since it needs to remove a large number of
WAL files. This could delay subsequent checkpoints beyond checkpoint_timeout.
If so, perhaps we'd need to limit how many WAL files a single checkpoint
can remove.

Regards,

--
Fujii Masao

Dilip Kumar

dilipbalaut@gmail.com

4 months ago

In reply to: Fujii Masao (#5)

Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles

On Tue, Sep 9, 2025 at 1:23 PM Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Sep 8, 2025 at 6:33 PM Nitin Motiani <nitinmotiani@google.com> wrote:

Hi Hackers,

I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files.

As another idea, could crash recovery avoid waiting for the end-of-recovery
checkpoint itself to finish, similar to archive recovery? In other words,
crash recovery would write the end-of-recovery WAL record and request
a checkpoint, but not block until it completes. Thought?

Thanks for your input Fujii. The end-of-recovery checkpoint needs to
set checkpoint.redo to serve as a new recovery starting point. This
prevents a full recovery from the previous checkpoint in the event of
a crash. However, setting checkpoint.redo requires that no other
backend is generating concurrent WAL. For this reason, the
end-of-recovery checkpoint cannot run concurrently.

--
Regards,
Dilip Kumar
Google

Nitin Motiani

nitinmotiani@google.com

4 months ago

In reply to: Fujii Masao (#5)

Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles

On Tue, Sep 9, 2025 at 1:23 PM Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Sep 8, 2025 at 6:33 PM Nitin Motiani <nitinmotiani@google.com> wrote:

Hi Hackers,

I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files.

As another idea, could crash recovery avoid waiting for the end-of-recovery
checkpoint itself to finish, similar to archive recovery? In other words,
crash recovery would write the end-of-recovery WAL record and request
a checkpoint, but not block until it completes. Thought?

Thanks for the feedback Fujii. I'll look into this. Although based on
Dilip's reply it is probably not feasible.

One concern, though: in your case, the first checkpoint after crash recovery
could take a very long time, since it needs to remove a large number of
WAL files. This could delay subsequent checkpoints beyond checkpoint_timeout.
If so, perhaps we'd need to limit how many WAL files a single checkpoint
can remove.

The limiting of WAL files is something we only want to do for this
checkpoint or in general for all checkpoints? A couple of thoughts on
these options :

1. If we only do it for the post end-of-recovery checkpoint, we will
have to add special handling for that case and perhaps that reduces
the simplicity of this approach. Also if we just do it for the first
checkpoint after recovery, a future checkpoint might again spend a lot
of time removing these files and delay subsequent checkpoints.
2. We can do it for all checkpoints but that can cause the bloat to
last for a far longer period.

One alternative might be to provide a guc to set the num/size of wal
files or a timeout for this step which would require some tuning from
the users. Also what do you think of the simple method of skipping
removal of files at recovery time and let the future checkpoints take
care of it?

One reason I went with this solution over the others was that in the
current state, the system is down for all the time of removal of
files. But with this, the only thing which might be delayed is the
checkpoint and that seems like an improvement. But it would be great
to get your thoughts on this and the other alternatives.

Thanks & Regards,
Nitin Motiani
Google