Force the old transactions logs cleanup even if checkpoint is skipped
Hi, hackers!
I've stumbled into an interesting problem. Currently, if Postgres has nothing to write, it would skip the checkpoint creation defined by the checkpoint timeout setting. However, we might face a temporary archiving problem (for example, some network issues) that might lead to a pile of wal files stuck in pg_wal. After this temporary issue has gone, we would still be unable to archive them since we effectively skip the checkpoint because we have nothing to write.
That might lead to a problem - suppose you've run out of disk space because of the temporary failure of the archiver. After this temporary failure has gone, Postgres would be unable to recover from it automatically and will require human attention to initiate a CHECKPOINT call.
I suggest changing this behavior by trying to clean up the old WAL even if we skip the main checkpoint routine. I've attached the patch that does exactly that.
What do you think?
To reproduce the issue, you might repeat the following steps:
1. Init Postgres:
pg_ctl initdb -D /Users/usernamedt/test_archiver
2. Add the archiver script to simulate failure:
➜ ~ cat /Users/usernamedt/command.sh
#!/bin/bash
false
3. Then alter the PostgreSQL conf:
archive_mode = on
checkpoint_timeout = 30s
archive_command = /Users/usernamedt/command.sh
log_min_messages = debug1
4. Then start Postgres:
/usr/local/pgsql/bin/pg_ctl -D /Users/usernamedt/test_archiver -l logfile start
5. Insert some data:
pgbench -i -s 30 -d postgres
6. Trigger checkpoint to flush all data:
psql -c "checkpoint;"
7. Alter the archiver script to simulate the end of archiver issues:
➜ ~ cat /Users/usernamedt/command.sh
#!/bin/bash
true
8. Check that the WAL files are actually archived but not removed:
➜ ~ ls -lha /Users/usernamedt/test_archiver/pg_wal/archive_status | head
total 0
drwx------@ 48 usernamedt LD\Domain Users 1.5K Oct 17 17:44 .
drwx------@ 50 usernamedt LD\Domain Users 1.6K Oct 17 17:43 ..
-rw-------@ 1 usernamedt LD\Domain Users 0B Oct 17 17:42 000000010000000000000040.done
...
-rw-------@ 1 usernamedt LD\Domain Users 0B Oct 17 17:43 00000001000000000000006D.done
2023-10-17 18:03:44.621 +04 [71737] DEBUG: checkpoint skipped because system is idle
Thanks,
Daniil Zakhlystov
Attachments:
0001-Cleanup-old-files-if-checkpoint-is-skipped.patchapplication/octet-stream; name=0001-Cleanup-old-files-if-checkpoint-is-skipped.patchDownload
From a268fab163d867decf24a93def07d2ce732c182e Mon Sep 17 00:00:00 2001
From: usernamedt <usernamedt@yandex-team.com>
Date: Tue, 17 Oct 2023 17:45:46 +0400
Subject: [PATCH] Cleanup old files if checkpoint is skipped
---
src/backend/access/transam/xlog.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c0e4ca5089..8d379cdd0d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6588,6 +6588,21 @@ CreateCheckPoint(int flags)
END_CRIT_SECTION();
ereport(DEBUG1,
(errmsg_internal("checkpoint skipped because system is idle")));
+
+ /*
+ * We still might want to remove some old xlog files. For example,
+ * if there was a temporary archiving problem so we could not remove
+ * them at the previous checkpoint but now we can actually do that.
+ */
+ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+ _logSegNo--;
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, last_important_lsn, ControlFile->checkPointCopy.ThisTimeLineID);
+
+ /*
+ * Make more log segments if needed. (Do this after recycling old log
+ * segments, since that may supply some of the needed files.)
+ */
+ PreallocXlogFiles(last_important_lsn, ControlFile->checkPointCopy.ThisTimeLineID);
return;
}
}
--
2.39.1
Hi,
I went through the Cfbot and saw that some test are failing for it
(link: https://cirrus-ci.com/task/4631357628874752):
test: postgresql:recovery / recovery/019_replslot_limit
# test failed
----------------------------------- stderr -----------------------------------
# poll_query_until timed out executing this query:
# SELECT '0/15000D8' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 7.
I tried to test it locally and this test is timing out in my local
machine as well.
Thanks
Shlok Kumar Kyal
Hi!
Thanks for your review. I've actually missed the logic to retain some WALs required for replication slots or wal_keep_size setting. I've attached the updated version of the patch with an additional call to KeepLogSeg(). Now it passed all the tests, at least in my fork (https://cirrus-ci.com/build/4770244019879936).
Daniil Zakhlystov
________________________________________
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Sent: Thursday, November 2, 2023 1:25 PM
To: Zakhlystov, Daniil (Nebius)
Cc: amborodin@acm.org; pgsql-hackers@lists.postgresql.org; Mokrushin, Mikhail (Nebius)
Subject: Re: Force the old transactions logs cleanup even if checkpoint is skipped
CAUTION: This email originated from outside mail organization. Do not click links or open attachments unless you recognize the sender.
Hi,
I went through the Cfbot and saw that some test are failing for it
(link: https://cirrus-ci.com/task/4631357628874752):
test: postgresql:recovery / recovery/019_replslot_limit
# test failed
----------------------------------- stderr -----------------------------------
# poll_query_until timed out executing this query:
# SELECT '0/15000D8' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 7.
I tried to test it locally and this test is timing out in my local
machine as well.
Thanks
Shlok Kumar Kyal
Attachments:
0001-Cleanup-old-files-if-checkpoint-is-skipped.patchapplication/octet-stream; name=0001-Cleanup-old-files-if-checkpoint-is-skipped.patchDownload
From 5fd34ffd7e2afd3a3a8d4e655a9df46f3057ea3b Mon Sep 17 00:00:00 2001
From: usernamedt <usernamedt@yandex-team.com>
Date: Wed, 25 Oct 2023 17:09:52 +0200
Subject: [PATCH] Cleanup old files if checkpoint is skipped
Currently, if Postgres has nothing to write, it would skip the checkpoint creation defined by the checkpoint timeout setting. However, we might face a temporary archiving problem (for example, some network issues) that might lead to a pile of wal files stuck in pg_wal. After this temporary issue has gone, we would still be unable to archive them since we effectively skip the checkpoint because we have nothing to write.
This commit adds logic to clean up the old WAL segments even if checkpoint is skipped.
---
src/backend/access/transam/xlog.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec2..36a12dbb2ad 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6625,6 +6625,22 @@ CreateCheckPoint(int flags)
END_CRIT_SECTION();
ereport(DEBUG1,
(errmsg_internal("checkpoint skipped because system is idle")));
+
+ /*
+ * We still might want to remove some old xlog files. For example,
+ * if there was a temporary archiving problem so we could not remove
+ * them at the previous checkpoint but now we can actually do that.
+ */
+ XLByteToSeg(last_important_lsn, _logSegNo, wal_segment_size);
+ KeepLogSeg(last_important_lsn, &_logSegNo);
+ _logSegNo--;
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, last_important_lsn, ControlFile->checkPointCopy.ThisTimeLineID);
+
+ /*
+ * Make more log segments if needed. (Do this after recycling old log
+ * segments, since that may supply some of the needed files.)
+ */
+ PreallocXlogFiles(last_important_lsn, ControlFile->checkPointCopy.ThisTimeLineID);
return;
}
}
--
2.39.1
On Tue, Oct 17, 2023 at 02:09:21PM +0000, Zakhlystov, Daniil (Nebius) wrote:
I've stumbled into an interesting problem. Currently, if Postgres
has nothing to write, it would skip the checkpoint creation defined
by the checkpoint timeout setting. However, we might face a
temporary archiving problem (for example, some network issues) that
might lead to a pile of wal files stuck in pg_wal. After this
temporary issue has gone, we would still be unable to archive them
since we effectively skip the checkpoint because we have nothing to
write.
I am not sure to understand your last sentence here. Once the
archiver is back up, you mean that the WAL segments that were not
previously archived still are still not archived? Or do you mean that
because of a succession of checkpoint skipped we are just enable to
remove them from pg_wal.
That might lead to a problem - suppose you've run out of disk space
because of the temporary failure of the archiver. After this
temporary failure has gone, Postgres would be unable to recover from
it automatically and will require human attention to initiate a
CHECKPOINT call.I suggest changing this behavior by trying to clean up the old WAL
even if we skip the main checkpoint routine. I've attached the patch
that does exactly that.What do you think?
I am not convinced that this is worth the addition in the skipped
path. If your system is idle and a set of checkpoints is skipped, the
data folder is not going to be under extra space pressure because of
database activity (okay, unlogged tables even if these would generate
some WAL for init pages), because there is nothing happening in it
with no "important" WAL generated. Note that the backend is very
unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT.
More to the point: what's the origin of the disk space issues? System
logs, unlogged tables or something else? It is usually a good
practice to move logs to a different partition. At the end, it sounds
to me that removing segments more aggressively is just kicking the can
elsewhere, without taking care of the origin of the disk issues.
--
Michael
Hi!
Thanks for your review.
I am not sure to understand your last sentence here. Once the
archiver is back up, you mean that the WAL segments that were not
previously archived still are still not archived? Or do you mean that
because of a succession of checkpoint skipped we are just enable to
remove them from pg_wal.
Yes, the latter is correct - we are unable to clean up the already archived WALs
due to the checkpoint being skipped.
I am not convinced that this is worth the addition in the skipped
path. If your system is idle and a set of checkpoints is skipped, the
data folder is not going to be under extra space pressure because of
database activity (okay, unlogged tables even if these would generate
some WAL for init pages), because there is nothing happening in it
with no "important" WAL generated. Note that the backend is very
unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT.
More to the point: what's the origin of the disk space issues? System
logs, unlogged tables or something else? It is usually a good
practice to move logs to a different partition. At the end, it sounds
to me that removing segments more aggressively is just kicking the can
elsewhere, without taking care of the origin of the disk issues.
This problem arises when disk space issues are caused by temporary failed archiving.
As a result, the pg_wal becomes filled with WALs. This situation
leads to Postgres being unable to perform any write operations since there is no more
free disk space left. Usually, cloud providers switch the cluster to a Read-Only mode
if there is less than 3-4% of the available disk space left, but this also does not resolve
this problem.
The actual problem is that after archiving starts working normally again, Postgres is
unable to free the accumulated WAL and switch to Read-Write mode due to the
checkpoint being skipped, leading to a vicious circle. However, nothing prevents
Postgres from exiting such situations on its own. This patch addresses this specific
behavior, enabling Postgres to resolve such situations autonomously.
Thank you,
Daniil Zakhlystov
On Wed, Nov 08, 2023 at 12:44:09PM +0000, Zakhlystov, Daniil (Nebius) wrote:
I am not sure to understand your last sentence here. Once the
archiver is back up, you mean that the WAL segments that were not
previously archived still are still not archived? Or do you mean that
because of a succession of checkpoint skipped we are just enable to
remove them from pg_wal.Yes, the latter is correct - we are unable to clean up the already archived WALs
due to the checkpoint being skipped.
Yes, theoretically you could face this situation if you have an
irregular WAL activity with cycles where nothing happens and an
archive command that keeps failing while there is WAL generated, but
works while WAL is not generated.
I am not convinced that this is worth the addition in the skipped
path. If your system is idle and a set of checkpoints is skipped, the
data folder is not going to be under extra space pressure because of
database activity (okay, unlogged tables even if these would generate
some WAL for init pages), because there is nothing happening in it
with no "important" WAL generated. Note that the backend is very
unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT.More to the point: what's the origin of the disk space issues? System
logs, unlogged tables or something else? It is usually a good
practice to move logs to a different partition. At the end, it sounds
to me that removing segments more aggressively is just kicking the can
elsewhere, without taking care of the origin of the disk issues.This problem arises when disk space issues are caused by temporary failed archiving.
As a result, the pg_wal becomes filled with WALs. This situation
leads to Postgres being unable to perform any write operations since there is no more
free disk space left. Usually, cloud providers switch the cluster to a Read-Only mode
if there is less than 3-4% of the available disk space left, but this also does not resolve
this problem.
The actual problem is that after archiving starts working normally again, Postgres is
unable to free the accumulated WAL and switch to Read-Write mode due to the
checkpoint being skipped, leading to a vicious circle. However, nothing prevents
Postgres from exiting such situations on its own. This patch addresses this specific
behavior, enabling Postgres to resolve such situations autonomously.
Yep, but it does not really solve your disk space issues in a reliable
way.
I am not really convinced that this is worth complicating the skipped
path for this goal. In my experience, I've seen complaints where WAL
archiving bloat was coming from the archive command not able to keep
up with the amount generated by the backend, particularly because the
command invocation was taking longer than it takes to generate a new
segment. Even if there is a hole of activity in the server, if too
much WAL has been generated it may not be enough to catch up depending
on the number of segments that need to be processed. Others are free
to chime in with extra opinions, of course.
While on it, I think that your patch would cause incorrect and early
removal of segments. It computes the name of the last segment to
remove based on last_important_lsn, ignoring KeepLogSeg(), meaning
that it ignores any WAL retention required by replication slots or
wal_keep_size. And this causes the calculation of an incorrect segno
horizon.
--
Michael
Hello,
On 9 Nov 2023, at 01:30, Michael Paquier <michael@paquier.xyz> wrote:
I am not really convinced that this is worth complicating the skipped
path for this goal. In my experience, I've seen complaints where WAL
archiving bloat was coming from the archive command not able to keep
up with the amount generated by the backend, particularly because the
command invocation was taking longer than it takes to generate a new
segment. Even if there is a hole of activity in the server, if too
much WAL has been generated it may not be enough to catch up depending
on the number of segments that need to be processed. Others are free
to chime in with extra opinions, of course.
I agree that there might multiple reasons of pg_wal bloat. Please note that
I am not addressing the WAL archiving issue at all. My proposal is to add a
small improvement to the WAL cleanup routine for WALs that have been already
archived successfully to free the disk space.
Yes, it might be not a common case, but a fairly realistic one. It occurred multiple times
in our production when we had temporary issues with archiving. This small
complication of the skipped path will help Postgres to return to a normal operational
state without any human operator / external control routine intervention.
On 9 Nov 2023, at 01:30, Michael Paquier <michael@paquier.xyz> wrote:
While on it, I think that your patch would cause incorrect and early
removal of segments. It computes the name of the last segment to
remove based on last_important_lsn, ignoring KeepLogSeg(), meaning
that it ignores any WAL retention required by replication slots or
wal_keep_size. And this causes the calculation of an incorrect segno
horizon.
Please check the latest patch version, I believe that it has been already fixed there.
Thanks,
Daniil Zakhlystov
Hi,
On 2023-11-09 11:50:10 +0000, Zakhlystov, Daniil (Nebius) wrote:
On 9 Nov 2023, at 01:30, Michael Paquier <michael@paquier.xyz> wrote:
I am not really convinced that this is worth complicating the skipped
path for this goal. In my experience, I've seen complaints where WAL
archiving bloat was coming from the archive command not able to keep
up with the amount generated by the backend, particularly because the
command invocation was taking longer than it takes to generate a new
segment. Even if there is a hole of activity in the server, if too
much WAL has been generated it may not be enough to catch up depending
on the number of segments that need to be processed. Others are free
to chime in with extra opinions, of course.I agree that there might multiple reasons of pg_wal bloat. Please note that
I am not addressing the WAL archiving issue at all. My proposal is to add a
small improvement to the WAL cleanup routine for WALs that have been already
archived successfully to free the disk space.Yes, it might be not a common case, but a fairly realistic one. It occurred multiple times
in our production when we had temporary issues with archiving. This small
complication of the skipped path will help Postgres to return to a normal operational
state without any human operator / external control routine intervention.
I agree that the scenario is worth addressing - it's quite a nasty situation.
But I'm not sure this is the way to address it. If a checkpoint does have to
happen, we might not get to the point of removing the old segments, because we
might fail to emit the WAL record due to running out of space. And if that
doesn't happen - do we really want to wait till a checkpoint finishes to free
up space?
What if we instead made archiver delete WAL files after archiving, if they're
old enough? Some care would be needed to avoid checkpointer and archiver
trampling on each other, but that doesn't seem too hard.
Greetings,
Andres Freund