Fix calculations on WAL recycling.
I'll register this to the next commitfest.
/messages/by-id/20180719.125117.155470938.horiguchi.kyotaro@lab.ntt.co.jp
While considering this, I found a bug in 4b0d28de06, which
removed prior checkpoint from control file. It actually trims the
segments before the last checkpoint's redo segment but recycling
is still considered based on the *prevous* checkpoint. As the
result min_wal_size doesn't work as told. Specifically, setting
min/max_wal_size to 48MB and advance four or more segments then
two checkpoints leaves just one segment, which is less than
min_wal_size.The attached patch fixes that. One arguable point on this would
be the removal of the behavior when RemoveXLogFile(name,
InvalidXLogRecPtr, ..).The only place calling the function with the parameter is
timeline switching. Previously unconditionally 10 segments are
recycled after switchpoint but the reason for the behavior is we
didn't have the information on previous checkpoint at hand at the
time. But now we can use the timeline switch point as the
approximate of the last checkpoint's redo point and this allows
us to use min/max_wal_size properly at the time.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v2-0001-Fix-calculation-base-of-WAL-recycling.patchtext/x-patch; charset=us-asciiDownload+17-21
On Mon, Jul 23, 2018 at 01:57:48PM +0900, Kyotaro HORIGUCHI wrote:
I'll register this to the next commitfest.
/messages/by-id/20180719.125117.155470938.horiguchi.kyotaro@lab.ntt.co.jp
This is an open item for v11.
While considering this, I found a bug in 4b0d28de06, which
removed prior checkpoint from control file. It actually trims the
segments before the last checkpoint's redo segment but recycling
is still considered based on the *prevous* checkpoint. As the
result min_wal_size doesn't work as told. Specifically, setting
min/max_wal_size to 48MB and advance four or more segments then
two checkpoints leaves just one segment, which is less than
min_wal_size.The attached patch fixes that. One arguable point on this would
be the removal of the behavior when RemoveXLogFile(name,
InvalidXLogRecPtr, ..).The only place calling the function with the parameter is
timeline switching. Previously unconditionally 10 segments are
recycled after switchpoint but the reason for the behavior is we
didn't have the information on previous checkpoint at hand at the
time. But now we can use the timeline switch point as the
approximate of the last checkpoint's redo point and this allows
us to use min/max_wal_size properly at the time.
I have been looking at that, and tested with this simple scenario:
create table aa (a int);
Then just repeat the following SQLs:
insert into aa values (1);
select pg_switch_wal();
checkpoint;
select pg_walfile_name(pg_current_wal_lsn());
By doing so, you would notice that the oldest WAL segment does not get
recycled after the checkpoint, while it should as the redo pointer is
always checkpoint generated. What happens is that this oldest segment
gets recycled every two checkpoints.
Then I had a look at the proposed patch, with a couple of comments.
- if (PriorRedoPtr == InvalidXLogRecPtr)
- recycleSegNo = endlogSegNo + 10;
- else
- recycleSegNo = XLOGfileslop(PriorRedoPtr);
+ recycleSegNo = XLOGfileslop(RedoRecPtr);
I think that this is a new behavior, and should not be changed, see
point 3 below.
In CreateCheckPoint(), the removal of past WAL segments is always going
to happen as RedoRecPtr is never InvalidXLogRecPtr, so the recycling
should happen also when PriorRedoPtr is InvalidXLogRecPtr, no?
/* Trim from the last checkpoint, not the last - 1 */
This comment could be adjusted, let's remove "not the last - 1" .
The calculation of _logSegNo in CreateRestartPoint is visibly incorrect,
this still uses PriorRedoPtr so the bug is not fixed for standbys. The
tweaks for ThisTimeLineID also need to be out of the loop where
PriorRedoPtr is InvalidXLogRecPtr, and only the distance calculation
should be kept.
Finally, in summary, this patch is doing actually three things:
1) Rename a couple of variables which refer incorrectly to the prior
checkpoint so as they refer to the last checkpoint generated.
2) Make sure that WAL recycling/removal happens based on the last
checkpoint generated, which is the one just generated when past WAL
segments are cleaned up as post-operation actions.
3) Enforce the recycling point to be the switch point instead of
arbitrarily recycle 10 segments, which is what b2a5545b has introduced.
Recycling at exactly the switch point is wrong, as the redo LSN of the
previous checkpoint may not be at the same segment when a timeline has
switched, so you would finish with recycling segments which are still
needed if an instance needs be crash-recovered post-promotion without
a completed post-recovery checkpoint. In short this is dangerous.
I'll let Heikki comment on that, but I think that you should fetch the
last redo LSN instead, still you need to be worried about checkpoints
running in parallel of the startup process, so I would stick with the
current logic.
1) and 2) are in my opinion clearly oversights from 4b0d28d, but 3) is
not.
--
Michael
At Mon, 23 Jul 2018 15:59:16 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180723065916.GI2854@paquier.xyz>
On Mon, Jul 23, 2018 at 01:57:48PM +0900, Kyotaro HORIGUCHI wrote:
I'll register this to the next commitfest.
/messages/by-id/20180719.125117.155470938.horiguchi.kyotaro@lab.ntt.co.jp
This is an open item for v11.
Mmm. Thanks. I wrongly thought this was v10 item. Removed this
from the next CF.
Thank you for taking a look.
While considering this, I found a bug in 4b0d28de06, which
removed prior checkpoint from control file. It actually trims the
segments before the last checkpoint's redo segment but recycling
is still considered based on the *prevous* checkpoint. As the
result min_wal_size doesn't work as told. Specifically, setting
min/max_wal_size to 48MB and advance four or more segments then
two checkpoints leaves just one segment, which is less than
min_wal_size.The attached patch fixes that. One arguable point on this would
be the removal of the behavior when RemoveXLogFile(name,
InvalidXLogRecPtr, ..).The only place calling the function with the parameter is
timeline switching. Previously unconditionally 10 segments are
recycled after switchpoint but the reason for the behavior is we
didn't have the information on previous checkpoint at hand at the
time. But now we can use the timeline switch point as the
approximate of the last checkpoint's redo point and this allows
us to use min/max_wal_size properly at the time.I have been looking at that, and tested with this simple scenario:
create table aa (a int);Then just repeat the following SQLs:
insert into aa values (1);
select pg_switch_wal();
checkpoint;
select pg_walfile_name(pg_current_wal_lsn());By doing so, you would notice that the oldest WAL segment does not get
recycled after the checkpoint, while it should as the redo pointer is
always checkpoint generated. What happens is that this oldest segment
gets recycled every two checkpoints.
(I'm not sure I'm getting it correctly..) I see the old segments
recycled. When I see pg_switch_wal() returns 0/12..,
pg_walfile_name is ...13 and segment files for 13 and 14 are
found in pg_wal directory. That is, seg 12 was recycled as seg
14. log_checkpoint=on shows every checkpoint recycles 1 segment
in the case.
Then I had a look at the proposed patch, with a couple of comments.
- if (PriorRedoPtr == InvalidXLogRecPtr) - recycleSegNo = endlogSegNo + 10; - else - recycleSegNo = XLOGfileslop(PriorRedoPtr); + recycleSegNo = XLOGfileslop(RedoRecPtr); I think that this is a new behavior, and should not be changed, see point 3 below.In CreateCheckPoint(), the removal of past WAL segments is always going
to happen as RedoRecPtr is never InvalidXLogRecPtr, so the recycling
should happen also when PriorRedoPtr is InvalidXLogRecPtr, no?
Yes. The reason for the change was the change of
RemoveNonParentXlogFiles that I made and it is irrelevant to this
bug fix (and rather breaking as you pointed..). So, reverted it.
/* Trim from the last checkpoint, not the last - 1 */
This comment could be adjusted, let's remove "not the last - 1" .
Oops! Thanks. The comment has finally vanished and melded into
another comment just above.
| * Delete old log files not required by the last checkpoint and recycle
| * them
The calculation of _logSegNo in CreateRestartPoint is visibly incorrect,
this still uses PriorRedoPtr so the bug is not fixed for standbys. The
tweaks for ThisTimeLineID also need to be out of the loop where
PriorRedoPtr is InvalidXLogRecPtr, and only the distance calculation
should be kept.
Agreed. While I reconsidered this, I noticed that the estimated
checkpoint distance is 0 when PriorRedoPtr is invalid. This means
that the first checkpoint/restartpoint tries to reduce WAL
segments to min_wal_size. However, it happens only at initdb time
and makes no substantial behavior change so I made the change
ignoring the difference.
Finally, in summary, this patch is doing actually three things:
1) Rename a couple of variables which refer incorrectly to the prior
checkpoint so as they refer to the last checkpoint generated.
2) Make sure that WAL recycling/removal happens based on the last
checkpoint generated, which is the one just generated when past WAL
segments are cleaned up as post-operation actions.
3) Enforce the recycling point to be the switch point instead of
arbitrarily recycle 10 segments, which is what b2a5545b has introduced.
Recycling at exactly the switch point is wrong, as the redo LSN of the
previous checkpoint may not be at the same segment when a timeline has
switched, so you would finish with recycling segments which are still
needed if an instance needs be crash-recovered post-promotion without
a completed post-recovery checkpoint. In short this is dangerous.
I'll let Heikki comment on that, but I think that you should fetch the
last redo LSN instead, still you need to be worried about checkpoints
running in parallel of the startup process, so I would stick with the
current logic.
Thank you for the detail. I was coufused a bit there. I agree to
preserve the logic, too.
1) and 2) are in my opinion clearly oversights from 4b0d28d, but 3) is
not.
Thank you for the comments and suggestions. After all, I did the
following things in the attached patch.
- Reverted the change on timeline switching. (Removed the (3))
- Fixed CreateRestartPoint to do the same thing with CreateCheckPoint.
- Both CreateRestart/CheckPoint now tries trimming of WAL
segments even for the first time.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v3-0001-Fix-calculation-base-of-WAL-recycling.patchtext/x-patch; charset=us-asciiDownload+75-85
On Mon, Jul 23, 2018 at 07:55:57PM +0900, Kyotaro HORIGUCHI wrote:
At Mon, 23 Jul 2018 15:59:16 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180723065916.GI2854@paquier.xyz>
This is an open item for v11.
Mmm. Thanks. I wrongly thought this was v10 item. Removed this
from the next CF.
Thanks for updating the CF app.
By doing so, you would notice that the oldest WAL segment does not get
recycled after the checkpoint, while it should as the redo pointer is
always checkpoint generated. What happens is that this oldest segment
gets recycled every two checkpoints.(I'm not sure I'm getting it correctly..) I see the old segments
recycled. When I see pg_switch_wal() returns 0/12..,
pg_walfile_name is ...13 and segment files for 13 and 14 are
found in pg_wal directory. That is, seg 12 was recycled as seg
14. log_checkpoint=on shows every checkpoint recycles 1 segment
in the case.
With your patch applied I see one segment recycled after each
checkpoint, which is correct. On HEAD, you would see no segments,
followed by 2 segments recycled. But I also see sometimes one segment
recycled.
The calculation of _logSegNo in CreateRestartPoint is visibly incorrect,
this still uses PriorRedoPtr so the bug is not fixed for standbys. The
tweaks for ThisTimeLineID also need to be out of the loop where
PriorRedoPtr is InvalidXLogRecPtr, and only the distance calculation
should be kept.Agreed. While I reconsidered this, I noticed that the estimated
checkpoint distance is 0 when PriorRedoPtr is invalid. This means
that the first checkpoint/restartpoint tries to reduce WAL
segments to min_wal_size. However, it happens only at initdb time
and makes no substantial behavior change so I made the change
ignoring the difference.
Yes, same analysis here.
1) and 2) are in my opinion clearly oversights from 4b0d28d, but 3) is
not.Thank you for the comments and suggestions. After all, I did the
following things in the attached patch.- Reverted the change on timeline switching. (Removed the (3))
- Fixed CreateRestartPoint to do the same thing with CreateCheckPoint.
- Both CreateRestart/CheckPoint now tries trimming of WAL
segments even for the first time.
Thanks, pushed after some comment tweaks and fixing the variable names
at the top of xlog.c for the static declarations. Perhaps we can do
more refactoring here by moving all the segment calculation logic
directly in RemoveOldXlogFiles, but that makes the end LSN calculation a
bit blurry so I kept things as you proposed in version 3 of the patch.
--
Michael
At Tue, 24 Jul 2018 10:41:18 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180724014118.GA4035@paquier.xyz>
On Mon, Jul 23, 2018 at 07:55:57PM +0900, Kyotaro HORIGUCHI wrote:
With your patch applied I see one segment recycled after each
checkpoint, which is correct. On HEAD, you would see no segments,
followed by 2 segments recycled. But I also see sometimes one segment
recycled.
Anyway good to hear that.
The calculation of _logSegNo in CreateRestartPoint is visibly incorrect,
this still uses PriorRedoPtr so the bug is not fixed for standbys. The
tweaks for ThisTimeLineID also need to be out of the loop where
PriorRedoPtr is InvalidXLogRecPtr, and only the distance calculation
should be kept.Agreed. While I reconsidered this, I noticed that the estimated
checkpoint distance is 0 when PriorRedoPtr is invalid. This means
that the first checkpoint/restartpoint tries to reduce WAL
segments to min_wal_size. However, it happens only at initdb time
and makes no substantial behavior change so I made the change
ignoring the difference.Yes, same analysis here.
1) and 2) are in my opinion clearly oversights from 4b0d28d, but 3) is
not.Thank you for the comments and suggestions. After all, I did the
following things in the attached patch.- Reverted the change on timeline switching. (Removed the (3))
- Fixed CreateRestartPoint to do the same thing with CreateCheckPoint.
- Both CreateRestart/CheckPoint now tries trimming of WAL
segments even for the first time.Thanks, pushed after some comment tweaks and fixing the variable names
at the top of xlog.c for the static declarations. Perhaps we can do
more refactoring here by moving all the segment calculation logic
directly in RemoveOldXlogFiles, but that makes the end LSN calculation a
bit blurry so I kept things as you proposed in version 3 of the patch.
Thank you for committing this.
I feel that XLOGfileslop() can be a function but I regret that I
didn't move it closer to RemoveOldXLogFiles a bit..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center