Fix possible 'unexpected data beyond EOF' on replica restart

Started by Anthonin Bonnefoy27 days ago5 messages

anthonin.bonnefoy@datadoghq.com

27 days ago

1 attachment(s)

Hi,

On restart, a replica can fail with an 'unexpected data beyond EOF in block
x of relation T/D/R' error. This happened on a PG17.7 and I've been able to
reproduce it on PG 18. This can happen under the following circumstances:

- A relation has a size of 400 blocks.
- Blocks 201 to 400 are empty.
- Block 200 has two rows.
- Blocks 101 to 199 are empty.
- A restartpoint is done
- Vacuum truncates the relation to 200 blocks
- A FPW deletes a row in block 200
- A checkpoint is done
- A FPW deletes the last row in block 200
- Vacuum truncates the relation to 100 blocks
- The replica restarts

When the replica restarts:
- The relation on disk is reduced to 100 blocks due to having applied the
truncate before restart.
- The first truncate to 200 blocks is replayed. It silently fails in
mdtruncate since 'nblocks > curnblk', but the caller isn't aware of that
and will still update the cached size to 200 blocks
- The first FPW on block 200 is applied, XLogReadBufferForRead will rely on
the cached size and incorrectly assume the page exists in file, and thus
won't extend the relation.
- The Checkpoint Online is replayed, calling smgrdestroyall which will
discard the cached size.
- The second FPW on block 200 is applied. This time, the detected size is
100 blocks, an extend is attempted. However, the block 200 is already
present in the buffer table due to the first FPW. This triggers the
'unexpected data beyond EOF' since the page isn't new.

The issue can be reproduced with the following script:

"""
pgbench -i
# Prepare the relation
psql -c "DELETE FROM pgbench_accounts WHERE aid > 80000 AND aid !=
ALL('{90000, 90001}');"
psql -c "VACUUM (VERBOSE, INDEX_CLEANUP ON, TRUNCATE OFF) pgbench_accounts;"

# Restartpoint here
psql -c "CHECKPOINT;"
psql -p 5433 -c "CHECKPOINT;"

# First truncate
psql -c "VACUUM (VERBOSE, INDEX_CLEANUP ON, TRUNCATE ON) pgbench_accounts;"

# First FPW deletion
psql -c "DELETE FROM pgbench_accounts WHERE aid = 90001;"

# Second FPW deletion
psql -c "CHECKPOINT;"
psql -c "DELETE FROM pgbench_accounts WHERE aid = 90000;"

# Second truncate
psql -c "VACUUM (VERBOSE, INDEX_CLEANUP ON, TRUNCATE ON) pgbench_accounts;"

# Let some time for replica to replay the truncate
psql -c "SELECT pg_sleep(1);"

# Stop without advancing the restartpoint
kill -9 $(pgrep -f "pg_data_replica")

# Restart should fail with the EOF error
pg_ctl -D pg_data_replica restart
"""

This assumes the replica is running on port 5433 and no
hot_standby_feedback (otherwise, tuples will be seen as 'not yet
removable'). I've used kill -9 to avoid advancing the restart point, but
I've seen the issue happening with a clean shutdown.

The patch fixes the issue by moving smgr_cached_nblocks updates in
mdtruncate and only updating the cached value if truncate was successful.

Regards,
Anthonin Bonnefoy

Attachments:

v1-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patchapplication/octet-stream; name=v1-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patchDownload

From b01ee0c2408669ced7154be9f0de71e8771a6a8c Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Date: Tue, 16 Dec 2025 10:48:12 +0100
Subject: Fix 'unexpected data beyond EOF' on replica restart

On restart, a replica can fail with an 'unexpected data beyond EOF in
block 200 of relation T/D/R' error. This can happen under the following
circumstances:

- A relation has a size of 400 blocks.
  - Blocks 201 to 400 are empty.
  - Block 200 has two rows.
  - Blocks 100 to 199 are empty.
- A restartpoint is done
- Vacuum truncates the relation to 200 blocks
- A FPW deletes a row in block 200
- A checkpoint is done
- A FPW deletes the last row in block 200
- Vacuum truncates the relation to 100 blocks
- The replica restarts

When the replica restarts:
- The relation on disk is reduced to 100 blocks due to having applied
  the truncate before restart.
- The first truncate to 200 blocks is replayed. It silently fails, but
  it will still update the cache size to 200 blocks
- The first FPW on block 200 is applied, XLogReadBufferForRead will rely
  on the cached size and incorrectly assume the page exists in file,
  and thus won't extend the relation.
- The Checkpoint Online is replayed, calling smgrdestroyall which will
  discard the cached size.
- The second FPW on block 200 is applied. This time, the detected size
  is 100 blocks, an extend is attempted. However, the block 200 is
  already present in the buffer table due to the first FPW. This
  triggers the 'unexpected data beyond EOF' since the page isn't new.

This patch fixes the issue by moving smgr_cached_nblocks updates in
mdtruncate. If truncate size > old size, we set the cache to the old
size. Otherwise, on successful truncate, the cached size is set to
truncate size.
---
 src/backend/storage/smgr/md.c   | 26 +++++++++++++++++++++++++-
 src/backend/storage/smgr/smgr.c | 12 ------------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ccb0faceb5..d0d116f42ef 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1280,18 +1280,33 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 	BlockNumber priorblocks;
 	int			curopensegs;
 
+	/* Make the cached size is invalid if we encounter an error. */
+	reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
 	if (nblocks > curnblk)
 	{
-		/* Bogus request ... but no complaint if InRecovery */
+		/*
+		 * This can happen when a relation was truncated multiple times and
+		 * the restartpoint is located before the truncates. On restart, the
+		 * relation on disk will have the size of the second truncate. As the
+		 * first truncate has a higher nblocks, mdtruncate will be called with
+		 * nblocks > curnblk during startup.
+		 */
 		if (InRecovery)
+		{
+			reln->smgr_cached_nblocks[forknum] = curnblk;
 			return;
+		}
 		ereport(ERROR,
 				(errmsg("could not truncate file \"%s\" to %u blocks: it's only %u blocks now",
 						relpath(reln->smgr_rlocator, forknum).str,
 						nblocks, curnblk)));
 	}
 	if (nblocks == curnblk)
+	{
+		reln->smgr_cached_nblocks[forknum] = curnblk;
 		return;					/* no work */
+	}
 
 	/*
 	 * Truncate segments, starting at the last one. Starting at the end makes
@@ -1357,6 +1372,15 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		}
 		curopensegs--;
 	}
+
+	/*
+	 * We might as well update the local smgr_cached_nblocks values. The smgr
+	 * cache inval message that this function sent will cause other backends
+	 * to invalidate their copies of smgr_cached_nblocks, and these ones too
+	 * at the next command boundary. But ensure they aren't outright wrong
+	 * until then.
+	 */
+	reln->smgr_cached_nblocks[forknum] = nblocks;
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index bce37a36d51..b017266316e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -898,20 +898,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 	/* Do the truncation */
 	for (i = 0; i < nforks; i++)
 	{
-		/* Make the cached size is invalid if we encounter an error. */
-		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
-
 		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i],
 											   old_nblocks[i], nblocks[i]);
-
-		/*
-		 * We might as well update the local smgr_cached_nblocks values. The
-		 * smgr cache inval message that this function sent will cause other
-		 * backends to invalidate their copies of smgr_cached_nblocks, and
-		 * these ones too at the next command boundary. But ensure they aren't
-		 * outright wrong until then.
-		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
 	}
 }
 
-- 
2.51.0

Amul Sul

sulamul@gmail.com

26 days ago

In reply to: Anthonin Bonnefoy (#1)

Re: Fix possible 'unexpected data beyond EOF' on replica restart

On Tue, Dec 16, 2025 at 6:38 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

[...]
The patch fixes the issue by moving smgr_cached_nblocks updates in mdtruncate and only updating the cached value if truncate was successful.

Thanks for detailed reproducible steps, I can see the reported issue
and proposed patch fixes the same. Patch looks good to me except
following changes in smgrtruncate():

- /* Make the cached size is invalid if we encounter an error. */
- reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
-
smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i],
old_nblocks[i], nblocks[i]);

The deleted code you moved to mdtruncate() should be kept where it
was. We cannot ensure that every extension using this interface will
adhere to that requirement what the comment describes. Furthermore,
an extension's routine might miss updating smgr_cached_nblocks.
To ensure that updates smgr_cached_nblocks properly, we should also
add an assertion after the call, like this:

/*
* Ensure that the local smgr_cached_nblocks value is updated.
*/
Assert(reln->smgr_cached_nblocks[forknum[i]] != InvalidBlockNumber);

Regards,
Amul

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

26 days ago

In reply to: Amul Sul (#2)

1 attachment(s)

Re: Fix possible 'unexpected data beyond EOF' on replica restart

On Wed, Dec 17, 2025 at 8:26 AM Amul Sul <sulamul@gmail.com> wrote:

Thanks for detailed reproducible steps, I can see the reported issue
and proposed patch fixes the same.

Thanks for the review!

The deleted code you moved to mdtruncate() should be kept where it
was. We cannot ensure that every extension using this interface will
adhere to that requirement what the comment describes.

Yeah, I've overlooked the case of extensions. I've moved back the
InvalidBlockNumber assignment.

Furthermore, an extension's routine might miss updating

smgr_cached_nblocks. To ensure that updates smgr_cached_nblocks

properly, we should also add an assertion after the call, like this:

/*
* Ensure that the local smgr_cached_nblocks value is updated.
*/
Assert(reln->smgr_cached_nblocks[forknum[i]] !=
InvalidBlockNumber);

Good point. I've added the assertion.

I wonder how critical it is to have an up to date value of
smgr_cached_nblocks after smgr_truncate. Leaving InvalidBlockNumber was
also an option as the next user will ask the kernel for the real size.
There are some functions like DropRelationBuffers which rely only on the
cached value, so it's probably safer to keep the same behaviour.

Regards,
Anthonin Bonnefoy

Attachments:

v2-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patchapplication/octet-stream; name=v2-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patchDownload

From cfde2a18b3806fac0a2e8721ac5916755d68a6a6 Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Date: Tue, 16 Dec 2025 10:48:12 +0100
Subject: Fix 'unexpected data beyond EOF' on replica restart

On restart, a replica can fail with an 'unexpected data beyond EOF in
block 200 of relation T/D/R' error. This can happen under the following
circumstances:

- A relation has a size of 400 blocks.
  - Blocks 201 to 400 are empty.
  - Block 200 has two rows.
  - Blocks 100 to 199 are empty.
- A restartpoint is done
- Vacuum truncates the relation to 200 blocks
- A FPW deletes a row in block 200
- A checkpoint is done
- A FPW deletes the last row in block 200
- Vacuum truncates the relation to 100 blocks
- The replica restarts

When the replica restarts:
- The relation on disk is reduced to 100 blocks due to having applied
  the truncate before restart.
- The first truncate to 200 blocks is replayed. It silently fails, but
  it will still update the cache size to 200 blocks
- The first FPW on block 200 is applied, XLogReadBufferForRead will rely
  on the cached size and incorrectly assume the page exists in file,
  and thus won't extend the relation.
- The Checkpoint Online is replayed, calling smgrdestroyall which will
  discard the cached size.
- The second FPW on block 200 is applied. This time, the detected size
  is 100 blocks, an extend is attempted. However, the block 200 is
  already present in the buffer table due to the first FPW. This
  triggers the 'unexpected data beyond EOF' since the page isn't new.

This patch fixes the issue by moving smgr_cached_nblocks update in
mdtruncate. If truncate size > old size, we set the cache to the old
size. Otherwise, on successful truncate, the cached size is set to
truncate size.
---
 src/backend/storage/smgr/md.c   | 23 ++++++++++++++++++++++-
 src/backend/storage/smgr/smgr.c | 12 ++++++------
 2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ccb0faceb5..78cf8980f6b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1282,16 +1282,28 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 
 	if (nblocks > curnblk)
 	{
-		/* Bogus request ... but no complaint if InRecovery */
+		/*
+		 * This can happen when a relation was truncated multiple times and
+		 * the restartpoint is located before the truncates. On restart, the
+		 * relation on disk will have the size of the second truncate. As the
+		 * first truncate has a higher nblocks, mdtruncate will be called with
+		 * nblocks > curnblk during startup.
+		 */
 		if (InRecovery)
+		{
+			reln->smgr_cached_nblocks[forknum] = curnblk;
 			return;
+		}
 		ereport(ERROR,
 				(errmsg("could not truncate file \"%s\" to %u blocks: it's only %u blocks now",
 						relpath(reln->smgr_rlocator, forknum).str,
 						nblocks, curnblk)));
 	}
 	if (nblocks == curnblk)
+	{
+		reln->smgr_cached_nblocks[forknum] = curnblk;
 		return;					/* no work */
+	}
 
 	/*
 	 * Truncate segments, starting at the last one. Starting at the end makes
@@ -1357,6 +1369,15 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		}
 		curopensegs--;
 	}
+
+	/*
+	 * We might as well update the local smgr_cached_nblocks values. The smgr
+	 * cache inval message that this function sent will cause other backends
+	 * to invalidate their copies of smgr_cached_nblocks, and these ones too
+	 * at the next command boundary. But ensure they aren't outright wrong
+	 * until then.
+	 */
+	reln->smgr_cached_nblocks[forknum] = nblocks;
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index bce37a36d51..90d46b1ae10 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -905,13 +905,13 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 											   old_nblocks[i], nblocks[i]);
 
 		/*
-		 * We might as well update the local smgr_cached_nblocks values. The
-		 * smgr cache inval message that this function sent will cause other
-		 * backends to invalidate their copies of smgr_cached_nblocks, and
-		 * these ones too at the next command boundary. But ensure they aren't
-		 * outright wrong until then.
+		 * smgr_truncate may do nothing, leaving the relation with a size of
+		 * old_nblocks. However, this isn't reported to smgr_truncate's
+		 * caller. Thus, it's expected that smgr_truncate's implementation
+		 * update smgr_cached_nblocks' to nblocks on successful truncate, or
+		 * old_nblocks if nothing was done.
 		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
+		Assert(reln->smgr_cached_nblocks[forknum[i]] != InvalidBlockNumber);
 	}
 }
 
-- 
2.51.0

Heikki Linnakangas

hlinnaka@iki.fi

25 days ago

In reply to: Anthonin Bonnefoy (#3)

1 attachment(s)

Re: Fix possible 'unexpected data beyond EOF' on replica restart

On 17/12/2025 10:40, Anthonin Bonnefoy wrote:

On Wed, Dec 17, 2025 at 8:26 AM Amul Sul <sulamul@gmail.com
<mailto:sulamul@gmail.com>> wrote:

The deleted code you moved to mdtruncate() should be kept where it
was. We cannot ensure that every extension using this interface will
adhere to that requirement what the comment describes.

Yeah, I've overlooked the case of extensions. I've moved back the
InvalidBlockNumber assignment.

The smgr interface isn't really an extension point, so we don't need to
worry about that. (I wish it was, but that's a different story [1]/messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com)

I don't think mdtruncate() should modify smgr_cached_nblocks, we should
keep that in smgrtruncate(). smgr.c is responsible for all other updates
of smgr_cached_nblocks, so doing it in mdtruncate() would be a layering
violation.

I'm thinking that we should do the attached. Untested, and we should
also add a comment to smgrtruncate() and mdtruncate() to explain how
they behave if nblocks > curnblk.

I wonder if we should move the whole "if (nblocks > curnblk)" check and
ereport() from mdtruncate() to smgrtruncate(). That logic doesn't really
depend on anything specific to md.c. If you'd imagine a different smgr
implementation, it'd need to just copy-paste that check. It's the
caller's mistake if it passes nblocks > curnblk, when not in recovery.
Then again, we do have other places in md.c too that behave differently
when InRecovery.

What do you think?

I wonder how critical it is to have an up to date value of
smgr_cached_nblocks after smgr_truncate. Leaving InvalidBlockNumber was
also an option as the next user will ask the kernel for the real size.
There are some functions like DropRelationBuffers which rely only on the
cached value, so it's probably safer to keep the same behaviour.

Leaving it invalid should work. But as the comment says, we might as
well update the cached value since we have the value at hand. It's just
that we were doing it wrong.

[1]: /messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com
/messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com

- Heikki

Attachments:

v3-Fix-unexpected-data-beyond-EOF-on-replica-restart.patchtext/x-patch; charset=UTF-8; name=v3-Fix-unexpected-data-beyond-EOF-on-replica-restart.patchDownload

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f9066ab8c49..abb51f0a0bb 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -911,7 +911,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 		 * these ones too at the next command boundary. But ensure they aren't
 		 * outright wrong until then.
 		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
+		reln->smgr_cached_nblocks[forknum[i]] =
+			nblocks[i] > old_nblocks[i] ? old_nblocks[i] : nblocks[i];
 	}
 }

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

25 days ago

In reply to: Heikki Linnakangas (#4)

1 attachment(s)

Re: Fix possible 'unexpected data beyond EOF' on replica restart

On Thu, Dec 18, 2025 at 3:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

The smgr interface isn't really an extension point, so we don't need to
worry about that. (I wish it was, but that's a different story [1])

Thanks for the additional context.

I don't think mdtruncate() should modify smgr_cached_nblocks, we should
keep that in smgrtruncate(). smgr.c is responsible for all other updates
of smgr_cached_nblocks, so doing it in mdtruncate() would be a layering
violation.

I'm thinking that we should do the attached. Untested, and we should
also add a comment to smgrtruncate() and mdtruncate() to explain how
they behave if nblocks > curnblk.

This appears to work. The cached size is correctly updated and my
script doesn't trigger the bug anymore. I've updated the patch with
this approach.

I wonder if we should move the whole "if (nblocks > curnblk)" check and
ereport() from mdtruncate() to smgrtruncate(). That logic doesn't really
depend on anything specific to md.c. If you'd imagine a different smgr
implementation, it'd need to just copy-paste that check. It's the
caller's mistake if it passes nblocks > curnblk, when not in recovery.
Then again, we do have other places in md.c too that behave differently
when InRecovery.

What do you think?

I imagine we will still want some sanity checks in mdtruncate(), or at
least an Assert to make sure the provided block values are correct.
This also assumes that we will only have one caller to mdtruncate(),
another caller will have to duplicate the check.

And in the context of a backpatch, the current approach has the
benefit of minimising the amount of change, so I am slightly partial
to keeping the check in mdtruncate().

Regards,
Anthonin Bonnefoy

Attachments:

v4-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patchapplication/octet-stream; name=v4-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patchDownload

From b3ad993bd6f8758f0a91354b8448e01647207bf3 Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Date: Tue, 16 Dec 2025 10:48:12 +0100
Subject: Fix 'unexpected data beyond EOF' on replica restart

On restart, a replica can fail with an 'unexpected data beyond EOF in
block 200 of relation T/D/R' error. This can happen under the following
circumstances:

- A relation has a size of 400 blocks.
  - Blocks 201 to 400 are empty.
  - Block 200 has two rows.
  - Blocks 100 to 199 are empty.
- A restartpoint is done
- Vacuum truncates the relation to 200 blocks
- A FPW deletes a row in block 200
- A checkpoint is done
- A FPW deletes the last row in block 200
- Vacuum truncates the relation to 100 blocks
- The replica restarts

When the replica restarts:
- The relation on disk is reduced to 100 blocks due to having applied
  the truncate before restart.
- The first truncate to 200 blocks is replayed. It silently fails, but
  it will still update the cache size to 200 blocks
- The first FPW on block 200 is applied, XLogReadBufferForRead will rely
  on the cached size and incorrectly assume the page exists in file,
  and thus won't extend the relation.
- The Checkpoint Online is replayed, calling smgrdestroyall which will
  discard the cached size.
- The second FPW on block 200 is applied. This time, the detected size
  is 100 blocks, an extend is attempted. However, the block 200 is
  already present in the buffer table due to the first FPW. This
  triggers the 'unexpected data beyond EOF' since the page isn't new.

This patch fixes the issue by only updating smgr_cached_nblocks when
the truncated size is smaller. When the truncated size is higher, the
file isn't modified and we restore the old cached value.
---
 src/backend/storage/smgr/md.c   |  3 +++
 src/backend/storage/smgr/smgr.c | 12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ccb0faceb5..c2c7c66d42b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1272,6 +1272,9 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
  * functions for this relation or handled interrupts in between.  This makes
  * sure we have opened all active segments, so that truncate loop will get
  * them all!
+ *
+ * If nblocks > curnblk, the request is ignored when we are in InRecovery,
+ * otherwise, an error is raised.
  */
 void
 mdtruncate(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index bce37a36d51..ee2e25a35c8 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -870,6 +870,9 @@ smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
  * be called in a critical section, but the current size must be checked
  * outside the critical section, and no interrupts or smgr functions relating
  * to this relation should be called in between.
+ *
+ * If the specified number of blocks is higher than the current size, the
+ * request is ignored when we are InRecovery, otherwise, an error is raised.
  */
 void
 smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
@@ -910,8 +913,15 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 		 * backends to invalidate their copies of smgr_cached_nblocks, and
 		 * these ones too at the next command boundary. But ensure they aren't
 		 * outright wrong until then.
+		 *
+		 * nblocks > oldblocks can happen when a relation is truncated
+		 * multiple times and the restartpoint is located before the
+		 * truncates. The relation on disk will have the size of the second
+		 * truncate and when replaying the first truncate, we will have
+		 * nblocks > curnblk. We must restore old_nblocks when this happens.
 		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
+		reln->smgr_cached_nblocks[forknum[i]] =
+			nblocks[i] > old_nblocks[i] ? old_nblocks[i] : nblocks[i];
 	}
 }
 
-- 
2.51.0