BUG #17903: There is a bug in the KeepLogSeg()

Started by PG Bug reporting formalmost 3 years ago13 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 17903
Logged by: xu xingwang
Email address: xu.xw2008@163.com
PostgreSQL version: 13.3
Operating system: openEuler
Description:

Hi,

I found that KeepLogSeg() has a piece of code that is not correctly.

segno may be larger than currSegNo, since the slot_keep_segs variable is of
type "uint64", in this case the code "if (currSegNo - segno >
slot_keep_segs)" is incorrect.

"if (currSegNo - segno < keep_segs)" is also the same.

Checkpoint calls the KeepLogSeg function, and there are many operations
between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
pg_control file, so segno may be larger than currSegNo.

KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
{
XLogSegNo currSegNo;
XLogSegNo segno;
XLogRecPtr keep;

XLByteToSeg(recptr, currSegNo, wal_segment_size);
segno = currSegNo;

/*
* Calculate how many segments are kept by slots first, adjusting for
* max_slot_wal_keep_size.
*/
keep = XLogGetReplicationSlotMinimumLSN();
if (keep != InvalidXLogRecPtr)
{
XLByteToSeg(keep, segno, wal_segment_size);

/* Cap by max_slot_wal_keep_size ... */
if (max_slot_wal_keep_size_mb >= 0)
{
uint64 slot_keep_segs;

slot_keep_segs =
ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);

if (currSegNo - segno > slot_keep_segs)
segno = currSegNo - slot_keep_segs;
}
}

/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
uint64 keep_segs;

keep_segs = ConvertToXSegs(wal_keep_size_mb, wal_segment_size);
if (currSegNo - segno < keep_segs)
{
/* avoid underflow, don't go below 1 */
if (currSegNo <= keep_segs)
segno = 1;
else
segno = currSegNo - keep_segs;
}
}

regards.

--
xu xingwang

#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #17903: There is a bug in the KeepLogSeg()

At Wed, 19 Apr 2023 10:26:13 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in

I found that KeepLogSeg() has a piece of code that is not correctly.

segno may be larger than currSegNo, since the slot_keep_segs variable is of
type "uint64", in this case the code "if (currSegNo - segno >
slot_keep_segs)" is incorrect.

"if (currSegNo - segno < keep_segs)" is also the same.

Checkpoint calls the KeepLogSeg function, and there are many operations
between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
pg_control file, so segno may be larger than currSegNo.

Correct. Thanks for the report.

If checkpointer somehow takes a long time between inserting a
checkpoint record and removing WAL files, while replication advances a
certain distnace, it can actually happen. Although that behavior
doesn't directly affect max_slot_wal_keep_size, it does disrupt the
effect of wal_keep_size.

The thinko was that we incorrectly assumed the slot minimum LSN can't
be larger than the checkpoint record LSN. We don't need to consider
max_slot_wal_keep_size if the slot minimum LSN is already larger than
currSegNo.

The attached fix works. However, I can't come up with a reasonable
testing script.

This dates back to 13, where max_slot_wal_keep_size was introduced.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Fix-incorrect-calculation-regarding-max_slot_wal_kee.patchtext/x-patch; charset=us-asciiDownload+5-3
#3Junwang Zhao
zhjwpku@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: BUG #17903: There is a bug in the KeepLogSeg()

If `segno` can be larger than `currSegNo`, your patch seems to miss
the following branch,
are you missing this for some particular reason?

```
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
uint64 keep_segs;

keep_segs = ConvertToXSegs(wal_keep_size_mb, wal_segment_size);
if (currSegNo - segno < keep_segs) <<<< see here
{
/* avoid underflow, don't go below 1 */
if (currSegNo <= keep_segs)
segno = 1;
else
segno = currSegNo - keep_segs;
}
}
```

On Thu, Apr 20, 2023 at 11:04 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 19 Apr 2023 10:26:13 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in

I found that KeepLogSeg() has a piece of code that is not correctly.

segno may be larger than currSegNo, since the slot_keep_segs variable is of
type "uint64", in this case the code "if (currSegNo - segno >
slot_keep_segs)" is incorrect.

"if (currSegNo - segno < keep_segs)" is also the same.

Checkpoint calls the KeepLogSeg function, and there are many operations
between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
pg_control file, so segno may be larger than currSegNo.

Correct. Thanks for the report.

If checkpointer somehow takes a long time between inserting a
checkpoint record and removing WAL files, while replication advances a
certain distnace, it can actually happen. Although that behavior
doesn't directly affect max_slot_wal_keep_size, it does disrupt the
effect of wal_keep_size.

The thinko was that we incorrectly assumed the slot minimum LSN can't
be larger than the checkpoint record LSN. We don't need to consider
max_slot_wal_keep_size if the slot minimum LSN is already larger than
currSegNo.

The attached fix works. However, I can't come up with a reasonable
testing script.

This dates back to 13, where max_slot_wal_keep_size was introduced.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Regards
Junwang Zhao

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Junwang Zhao (#3)
Re: BUG #17903: There is a bug in the KeepLogSeg()

At Thu, 20 Apr 2023 15:40:14 +0800, Junwang Zhao <zhjwpku@gmail.com> wrote in

If `segno` can be larger than `currSegNo`, your patch seems to miss
the following branch,
are you missing this for some particular reason?

Oops! Sorry for the mistake and thanks for pointing it out.

I should have kept segno within the reasonable range.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2-0001-Fix-incorrect-calculation-regarding-max_slot_wal_.patchtext/x-patch; charset=us-asciiDownload+8-1
#5Junwang Zhao
zhjwpku@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: BUG #17903: There is a bug in the KeepLogSeg()

This patch looks reasonable.

+1

On Thu, Apr 20, 2023 at 4:58 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 20 Apr 2023 15:40:14 +0800, Junwang Zhao <zhjwpku@gmail.com> wrote in

If `segno` can be larger than `currSegNo`, your patch seems to miss
the following branch,
are you missing this for some particular reason?

Oops! Sorry for the mistake and thanks for pointing it out.

I should have kept segno within the reasonable range.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Regards
Junwang Zhao

#6Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: BUG #17903: There is a bug in the KeepLogSeg()

On Thu, Apr 20, 2023 at 05:58:14PM +0900, Kyotaro Horiguchi wrote:

+		/*
+		 * Slot minimum LSN could be greater than recptr. In such cases, no
+		 * segments are protected by slots but we still need to keep segno in a
+		 * reasonable range for subsequent calculations to avoid underflow.
+		 */
+		if (segno > currSegNo)
+			segno = currSegNo;
+

I wonder if it'd be better to instead change

if (currSegNo - segno > slot_keep_segs)
to
if (currSegNo > segno + slot_keep_segs)

and

if (currSegNo - segno < keep_segs)
to
if (currSegNo < segNo + keep_segs)

If segno > currSegNo, the first conditional would be false as expected, and
the second would be true as expected. We could also use
pg_sub_u64_overflow() to detect underflow, but that might be excessive in
this case.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#6)
Re: BUG #17903: There is a bug in the KeepLogSeg()

At Thu, 20 Apr 2023 15:02:42 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

I wonder if it'd be better to instead change

if (currSegNo - segno > slot_keep_segs)
to
if (currSegNo > segno + slot_keep_segs)

and

if (currSegNo - segno < keep_segs)
to
if (currSegNo < segNo + keep_segs)

If segno > currSegNo, the first conditional would be false as expected, and
the second would be true as expected. We could also use
pg_sub_u64_overflow() to detect underflow, but that might be excessive in
this case.

From what I understand, the XLogSegNo calculations are designed
without considering the actual value range. Basically, it assumes that
(XLogSegNo + <any positive int>) can overflow. If we take the actual
value range into account, we can make that change.

The choice lies on whether we assume the actual value range or not.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#7)
Re: BUG #17903: There is a bug in the KeepLogSeg()

On Fri, Apr 21, 2023 at 10:42:31AM +0900, Kyotaro Horiguchi wrote:

From what I understand, the XLogSegNo calculations are designed
without considering the actual value range. Basically, it assumes that
(XLogSegNo + <any positive int>) can overflow. If we take the actual
value range into account, we can make that change.

The choice lies on whether we assume the actual value range or not.

Yeah, after staring at this some more, I think your proposed fix is the way
to go. Alternatively, we could adjust the conditional for the
max_slot_wal_keep_size block to

if (keep != InvalidXLogRecPtr && keep < recptr)

but AFAICT that yields the exact same behavior.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#9Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#8)
Re: BUG #17903: There is a bug in the KeepLogSeg()

Here is a new version of the patch. It is fundamentally the same as v2,
but I've adjusted the comment and commit message a bit. Barring
objections, I am planning to commit this (and back-patch to v13) in the
near future.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Prevent-underflow-in-KeepLogSeg.patchtext/x-diff; charset=us-asciiDownload+9-1
#10Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#9)
Re: BUG #17903: There is a bug in the KeepLogSeg()

At Mon, 24 Apr 2023 12:14:52 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

Here is a new version of the patch. It is fundamentally the same as v2,
but I've adjusted the comment and commit message a bit. Barring
objections, I am planning to commit this (and back-patch to v13) in the
near future.

When determining the oldest segment that must be kept for
replication slots, KeepLogSeg() might calculate a segment number
ahead of the one for the LSN given to the function. This causes

Maybe the KeepLogSeg() is a mistake of
XLogGetReplicationSlotMinimumLSN()?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#10)
Re: BUG #17903: There is a bug in the KeepLogSeg()

On Tue, Apr 25, 2023 at 01:14:52PM +0900, Kyotaro Horiguchi wrote:

At Mon, 24 Apr 2023 12:14:52 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

When determining the oldest segment that must be kept for
replication slots, KeepLogSeg() might calculate a segment number
ahead of the one for the LSN given to the function. This causes

Maybe the KeepLogSeg() is a mistake of
XLogGetReplicationSlotMinimumLSN()?

I adjusted the commit message to call out that function explicitly. Also,
I decided to go with the "keep < recptr" approach since there's no reason
to do anything in that block otherwise.

Thanks!

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#12Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#11)
Re: BUG #17903: There is a bug in the KeepLogSeg()

On Thu, Apr 27, 2023 at 02:47:46PM -0700, Nathan Bossart wrote:

I adjusted the commit message to call out that function explicitly. Also,
I decided to go with the "keep < recptr" approach since there's no reason
to do anything in that block otherwise.

b726236 exists in the tree, but it seems like the message of
pgsql-committers is held on moderation?
--
Michael

#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#11)
Re: BUG #17903: There is a bug in the KeepLogSeg()

At Thu, 27 Apr 2023 14:47:46 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

On Tue, Apr 25, 2023 at 01:14:52PM +0900, Kyotaro Horiguchi wrote:

At Mon, 24 Apr 2023 12:14:52 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

When determining the oldest segment that must be kept for
replication slots, KeepLogSeg() might calculate a segment number
ahead of the one for the LSN given to the function. This causes

Maybe the KeepLogSeg() is a mistake of
XLogGetReplicationSlotMinimumLSN()?

I adjusted the commit message to call out that function explicitly. Also,
I decided to go with the "keep < recptr" approach since there's no reason
to do anything in that block otherwise.

That works, too. Thanks!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center