Do away with zero-padding assumption before WALRead()

Started by Bharath Rupireddyalmost 2 years ago9 messages

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

1 attachment(s)

Hi,

I noticed an assumption [1]/* * Even though we just determined how much of the page can be validly read * as 'count', read the whole page anyway. It's guaranteed to be * zero-padded up to the page boundary if it's incomplete. */ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &errinfo)) at WALRead() call sites expecting the
flushed WAL page to be zero-padded after the flush LSN. I think this
can't always be true as the WAL can get flushed after determining the
flush LSN before reading it from the WAL file using WALRead(). I've
hacked the code up a bit to check if that's true -
https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
the tests hit the Assert(false); added. Which means, the zero-padding
comment around WALRead() call sites isn't quite right.

I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
despite knowing exactly how much to read. Is it to tell the OS to
explicitly fetch the whole page from the disk? If yes, the OS will do
that anyway because the page transfers from disk to OS page cache are
always in terms of disk block sizes, no?

Although, there's no immediate problem with it right now, the
assumption is going to be incorrect when reading WAL from WAL buffers
using WALReadFromBuffers -
/messages/by-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hLA_hsGWNx2Y5-G+mSwdhNg@mail.gmail.com.

If we have no reason, can the WALRead() callers just read how much
they want like walsender for physical replication? Attached a patch
for the change.

Thoughts?

[1]: /* * Even though we just determined how much of the page can be validly read * as 'count', read the whole page anyway. It's guaranteed to be * zero-padded up to the page boundary if it's incomplete. */ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &errinfo))
/*
* Even though we just determined how much of the page can be validly read
* as 'count', read the whole page anyway. It's guaranteed to be
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
&errinfo))

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Do-away-with-zero-padding-assumption-before-WALRe.patchapplication/octet-stream; name=v1-0001-Do-away-with-zero-padding-assumption-before-WALRe.patchDownload

From 35777aa91429c5858a740afc75b75f290c5d304e Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 13 Feb 2024 06:12:22 +0000
Subject: [PATCH v1] Do away with zero-padding assumption before WALRead

---
 src/backend/access/transam/xlogutils.c | 7 +------
 src/backend/postmaster/walsummarizer.c | 7 +------
 src/backend/replication/walsender.c    | 2 +-
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 945f1f790d..ad93035d50 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1007,12 +1007,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
-	/*
-	 * Even though we just determined how much of the page can be validly read
-	 * as 'count', read the whole page anyway. It's guaranteed to be
-	 * zero-padded up to the page boundary if it's incomplete.
-	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
+	if (!WALRead(state, cur_page, targetPagePtr, count, tli,
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 9b883c21ca..afed15a257 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1318,12 +1318,7 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 		}
 	}
 
-	/*
-	 * Even though we just determined how much of the page can be validly read
-	 * as 'count', read the whole page anyway. It's guaranteed to be
-	 * zero-padded up to the page boundary if it's incomplete.
-	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+	if (!WALRead(state, cur_page, targetPagePtr, count,
 				 private_data->tli, &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 146826d5db..945084eb01 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1099,7 +1099,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	if (!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 XLOG_BLCKSZ,
+				 count,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-- 
2.34.1

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#1)

Re: Do away with zero-padding assumption before WALRead()

Hi,

On Tue, 13 Feb 2024 at 09:17, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Hi,

I noticed an assumption [1] at WALRead() call sites expecting the
flushed WAL page to be zero-padded after the flush LSN. I think this
can't always be true as the WAL can get flushed after determining the
flush LSN before reading it from the WAL file using WALRead(). I've
hacked the code up a bit to check if that's true -
https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
the tests hit the Assert(false); added. Which means, the zero-padding
comment around WALRead() call sites isn't quite right.

I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
despite knowing exactly how much to read. Is it to tell the OS to
explicitly fetch the whole page from the disk? If yes, the OS will do
that anyway because the page transfers from disk to OS page cache are
always in terms of disk block sizes, no?

I am curious about the same. The page size and disk block size could
be different, so the reason could be explicitly fetching the whole
page from the disk as you said. Is this the reason or are there any
other benefits of always reading XLOG_BLCKSZ instead of reading the
sufficient part? I tried to search in older threads and code comments
but I could not find an explanation.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#1)

Re: Do away with zero-padding assumption before WALRead()

At Tue, 13 Feb 2024 11:47:06 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

Hi,

I noticed an assumption [1] at WALRead() call sites expecting the
flushed WAL page to be zero-padded after the flush LSN. I think this
can't always be true as the WAL can get flushed after determining the
flush LSN before reading it from the WAL file using WALRead(). I've
hacked the code up a bit to check if that's true -

Good catch! The comment seems wrong also to me. The subsequent bytes
can be written simultaneously, and it's very normal that there are
unflushed bytes are in OS's page buffer. The objective of the comment
seems to be to declare that there's no need to clear out the remaining
bytes, here. I agree that it's not a problem for now. However, I think
we need two fixes here.

1. It's useless to copy the whole page regardless of the 'count'. It's
enough to copy only up to the 'count'. The patch looks good in this
regard.

2. Maybe we need a comment that states the page_read callback
functions leave garbage bytes beyond the returned count, due to
partial copying without clearing the unused portion.

I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
despite knowing exactly how much to read. Is it to tell the OS to
explicitly fetch the whole page from the disk? If yes, the OS will do
that anyway because the page transfers from disk to OS page cache are
always in terms of disk block sizes, no?

If I understand your question correctly, I guess that the whole-page
copy was expected to clear out the remaining bytes, as I mentioned
earlier.

Although, there's no immediate problem with it right now, the
assumption is going to be incorrect when reading WAL from WAL buffers
using WALReadFromBuffers -
/messages/by-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hLA_hsGWNx2Y5-G+mSwdhNg@mail.gmail.com.

If we have no reason, can the WALRead() callers just read how much
they want like walsender for physical replication? Attached a patch
for the change.

Thoughts?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#2)

Re: Do away with zero-padding assumption before WALRead()

On Thu, Feb 15, 2024 at 3:49 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
despite knowing exactly how much to read. Is it to tell the OS to
explicitly fetch the whole page from the disk? If yes, the OS will do
that anyway because the page transfers from disk to OS page cache are
always in terms of disk block sizes, no?

I am curious about the same. The page size and disk block size could
be different,

Yes, they can be different, but.... (see below)

so the reason could be explicitly fetching the whole
page from the disk as you said.

Upon OS page cache miss, the whole page (of disk block size) gets
fetched from disk even if we just read 'count' bytes (< disk block
size), no? This is my understanding about page transfers between disk
and OS page cache.

Is this the reason or are there any
other benefits of always reading XLOG_BLCKSZ instead of reading the
sufficient part? I tried to search in older threads and code comments
but I could not find an explanation.

FWIW, walsender for physical replication will just read as much as it
wants to read which can range from WAL of size < XLOG_BLCKSZ to
MAX_SEND_SIZE (XLOG_BLCKSZ * 16). I mean, it does not read the whole
page of bytes XLOG_BLCKSZ when it wants to read < XLOG_BLCKSZ.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#3)

Re: Do away with zero-padding assumption before WALRead()

On Fri, Feb 16, 2024 at 7:10 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Good catch! The comment seems wrong also to me. The subsequent bytes
can be written simultaneously, and it's very normal that there are
unflushed bytes are in OS's page buffer. The objective of the comment
seems to be to declare that there's no need to clear out the remaining
bytes, here. I agree that it's not a problem for now. However, I think
we need two fixes here.

1. It's useless to copy the whole page regardless of the 'count'. It's
enough to copy only up to the 'count'. The patch looks good in this
regard.

Yes, it's not needed to copy the whole page. Per my understanding
about page transfers between disk and OS page cache - upon OS page
cache miss, the whole page (of disk block size) gets fetched from disk
even if we just read 'count' bytes (< disk block size).

2. Maybe we need a comment that states the page_read callback
functions leave garbage bytes beyond the returned count, due to
partial copying without clearing the unused portion.

Isn't the comment around page_read callback at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/xlogreader.h;h=2e9e5f43eb2de1ca9ba81afe76d21357065c61aa;hb=d57b7cc3338e9d9aa1d7c5da1b25a17c5a72dcce#l78
enough?

"The callback shall return the number of bytes read (never more than
XLOG_BLCKSZ), or -1 on failure."

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#5)

Re: Do away with zero-padding assumption before WALRead()

At Fri, 16 Feb 2024 19:50:00 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Fri, Feb 16, 2024 at 7:10 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

1. It's useless to copy the whole page regardless of the 'count'. It's
enough to copy only up to the 'count'. The patch looks good in this
regard.

Yes, it's not needed to copy the whole page. Per my understanding
about page transfers between disk and OS page cache - upon OS page
cache miss, the whole page (of disk block size) gets fetched from disk
even if we just read 'count' bytes (< disk block size).

Right, but with a possibly-different block size. Anyway that behavior
doesn't affect the result of this change. (Could affect performance
hereafter if it were not the case, though..)

2. Maybe we need a comment that states the page_read callback
functions leave garbage bytes beyond the returned count, due to
partial copying without clearing the unused portion.

Isn't the comment around page_read callback at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/xlogreader.h;h=2e9e5f43eb2de1ca9ba81afe76d21357065c61aa;hb=d57b7cc3338e9d9aa1d7c5da1b25a17c5a72dcce#l78
enough?

"The callback shall return the number of bytes read (never more than
XLOG_BLCKSZ), or -1 on failure."

Yeah, perhaps I was overly concerned. The removed comment made me
think that someone could add code relying on the incorrect assumption
that the remaining bytes beyond the returned count are cleared out. On
the flip side, SimpleXLogPageRead always reads a whole page and
returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't
contain random garbage bytes. Therefore, it's safe as long as the
caller doesn't access beyond the returned count. As a result, the
description you pointed out seems to be enough.

After all, the patch looks good to me.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#6)

Re: Do away with zero-padding assumption before WALRead()

At Mon, 19 Feb 2024 11:56:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Yeah, perhaps I was overly concerned. The removed comment made me
think that someone could add code relying on the incorrect assumption
that the remaining bytes beyond the returned count are cleared out. On
the flip side, SimpleXLogPageRead always reads a whole page and
returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't
contain random garbage bytes. Therefore, it's safe as long as the

Forgot to mention that there is a case involving non-initialized
pages, but it doesn't affect the correctness of the description you
pointed out.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#6)

Re: Do away with zero-padding assumption before WALRead()

On Mon, Feb 19, 2024 at 8:26 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

On
the flip side, SimpleXLogPageRead always reads a whole page and
returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't
contain random garbage bytes.

Is this assumption true when wal_init_zero is off? I think when
wal_init_zero is off, the last few bytes of the last page from the WAL
file may contain garbage bytes i.e. not zero bytes, no?

Therefore, it's safe as long as the
caller doesn't access beyond the returned count. As a result, the
description you pointed out seems to be enough.

Right.

After all, the patch looks good to me.

Thanks. It was committed -
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=73f0a1326608ac3a7d390706fdeec59fe4dc42c0.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#8)

Re: Do away with zero-padding assumption before WALRead()

At Mon, 19 Feb 2024 11:02:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

After all, the patch looks good to me.

Thanks. It was committed -
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=73f0a1326608ac3a7d390706fdeec59fe4dc42c0.

Yeah. I realied that after I had already sent the mail.. No harm done:p

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center