Changed behavior in rewriteheap

Started by Erik Nordströmabout 1 year ago8 messages
#1Erik Nordström
erik@timescale.com

Hello,

I've noticed a change in behavior of the heap rewrite functionality in
PostgreSQL 17, used by, e.g., CLUSTER. I've been experimenting with the
functionality to implement a way to merge partitions in TimescaleDB. I am
using table_relation_copy_for_cluster() to write the data of several tables
to a single merged table, and then I do a heap swap on one of the original
tables while dropping the others. So, if you have 3 partitions and want to
merge them to one, then I write all three partitions to a temporary heap,
swap the new heap on partition 1 and then drop partitions 2 and 3.

Now, this approach worked fine for PostgreSQL 15 and 16, but 17 introduced
some changes that altered the behavior so that I only see data from one of
the partitions after merge (the last one written).

The commit that I think is responsible is the following:
https://github.com/postgres/postgres/commit/8af256524893987a3e534c6578dd60edfb782a77

Now, I realize that this is not a problem for PostgreSQL itself since the
rewrite functionality isn't used for the purpose I am using it. To my
defense, the rewrite code seems to imply that it should be possible to
write more data to an existing heap according to this comment in
begin_heap_rewrite: /* new_heap needn't be empty, just locked */.

I've also tried recompiling PG17 with the rewriteheap.c file from PG16 and
then it works again. I haven't yet been able to figure out exactly what is
different but I will continue to try to narrow it down. In the meantime,
maybe someone on the mailing list has some insight on what could be the
issue and whether my approach is viable?

Regards,
Erik

--
Database Architect, Timescale

#2Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Erik Nordström (#1)
Re: Changed behavior in rewriteheap

On Thu, 21 Nov 2024, 17:18 Erik Nordström, <erik@timescale.com> wrote:

Hello,

I've noticed a change in behavior of the heap rewrite functionality in
PostgreSQL 17, used by, e.g., CLUSTER. I've been experimenting with the
functionality to implement a way to merge partitions in TimescaleDB. I am
using table_relation_copy_for_cluster() to write the data of several tables
to a single merged table, and then I do a heap swap on one of the original
tables while dropping the others. So, if you have 3 partitions and want to
merge them to one, then I write all three partitions to a temporary heap,
swap the new heap on partition 1 and then drop partitions 2 and 3.

Now, this approach worked fine for PostgreSQL 15 and 16, but 17 introduced
some changes that altered the behavior so that I only see data from one of
the partitions after merge (the last one written).

The commit that I think is responsible is the following:
https://github.com/postgres/postgres/commit/8af256524893987a3e534c6578dd60edfb782a77

Now, I realize that this is not a problem for PostgreSQL itself since the
rewrite functionality isn't used for the purpose I am using it. To my
defense, the rewrite code seems to imply that it should be possible to
write more data to an existing heap according to this comment in
begin_heap_rewrite: /* new_heap needn't be empty, just locked */.

I've also tried recompiling PG17 with the rewriteheap.c file from PG16 and
then it works again. I haven't yet been able to figure out exactly what is
different but I will continue to try to narrow it down. In the meantime,
maybe someone on the mailing list has some insight on what could be the
issue and whether my approach is viable?

I think the origin of the issue is at bulk_write.c, in smgr_bulk_flush
(specifically after line 282), which assumes the bulk write system is used
exclusively on empty relations.

If you use a separate pair of begin/end_heap_rewrite for every relation you
merge into that heap, it'll re-initialize the bulk writer, which will thus
overwrite the previous rewrites' pages. The pre-PG17 rewriteheap.c doesn't
use that API, and thus isn't affected.

I've CC-ed Heikki as author of that patch; maybe a new API to indicate bulk
writes into an existing fork would make sense, to solve Timescale's issue
and fix rewriteheap's behavior?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#3Erik Nordström
erik@timescale.com
In reply to: Matthias van de Meent (#2)
Re: Changed behavior in rewriteheap

On Fri, Nov 22, 2024 at 12:30 AM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

On Thu, 21 Nov 2024, 17:18 Erik Nordström, <erik@timescale.com> wrote:

Hello,

I've noticed a change in behavior of the heap rewrite functionality in
PostgreSQL 17, used by, e.g., CLUSTER. I've been experimenting with the
functionality to implement a way to merge partitions in TimescaleDB. I am
using table_relation_copy_for_cluster() to write the data of several tables
to a single merged table, and then I do a heap swap on one of the original
tables while dropping the others. So, if you have 3 partitions and want to
merge them to one, then I write all three partitions to a temporary heap,
swap the new heap on partition 1 and then drop partitions 2 and 3.

Now, this approach worked fine for PostgreSQL 15 and 16, but 17
introduced some changes that altered the behavior so that I only see data
from one of the partitions after merge (the last one written).

The commit that I think is responsible is the following:
https://github.com/postgres/postgres/commit/8af256524893987a3e534c6578dd60edfb782a77

Now, I realize that this is not a problem for PostgreSQL itself since the
rewrite functionality isn't used for the purpose I am using it. To my
defense, the rewrite code seems to imply that it should be possible to
write more data to an existing heap according to this comment in
begin_heap_rewrite: /* new_heap needn't be empty, just locked */.

I've also tried recompiling PG17 with the rewriteheap.c file from PG16
and then it works again. I haven't yet been able to figure out exactly what
is different but I will continue to try to narrow it down. In the meantime,
maybe someone on the mailing list has some insight on what could be the
issue and whether my approach is viable?

I think the origin of the issue is at bulk_write.c, in smgr_bulk_flush
(specifically after line 282), which assumes the bulk write system is used
exclusively on empty relations.

If you use a separate pair of begin/end_heap_rewrite for every relation
you merge into that heap, it'll re-initialize the bulk writer, which will
thus overwrite the previous rewrites' pages. The pre-PG17 rewriteheap.c
doesn't use that API, and thus isn't affected.

I've CC-ed Heikki as author of that patch; maybe a new API to indicate
bulk writes into an existing fork would make sense, to solve Timescale's
issue and fix rewriteheap's behavior?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Yes, looks like it is that code that zeros out the initial pages if the
write doesn't start at block 0. And it looks like I don't have access to
the bulk write state to set pages_written to trick it to believe those
pages have already been written. Maybe it is possible to add an option to
the bulkwriter to skip zero-initialization? The comment for that code seems
to indicate it isn't strictly necessary anyway.

Regards,

Erik

#4Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Erik Nordström (#3)
1 attachment(s)
Re: Changed behavior in rewriteheap

On Fri, 22 Nov 2024 at 09:11, Erik Nordström <erik@timescale.com> wrote:

On Fri, Nov 22, 2024 at 12:30 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Thu, 21 Nov 2024, 17:18 Erik Nordström, <erik@timescale.com> wrote:

Hello,

I've noticed a change in behavior of the heap rewrite functionality in PostgreSQL 17, used by, e.g., CLUSTER. I've been experimenting with the functionality to implement a way to merge partitions in TimescaleDB. I am using table_relation_copy_for_cluster() to write the data of several tables to a single merged table, and then I do a heap swap on one of the original tables while dropping the others. So, if you have 3 partitions and want to merge them to one, then I write all three partitions to a temporary heap, swap the new heap on partition 1 and then drop partitions 2 and 3.

Now, this approach worked fine for PostgreSQL 15 and 16, but 17 introduced some changes that altered the behavior so that I only see data from one of the partitions after merge (the last one written).

[...]

I've also tried recompiling PG17 with the rewriteheap.c file from PG16 and then it works again. I haven't yet been able to figure out exactly what is different but I will continue to try to narrow it down. In the meantime, maybe someone on the mailing list has some insight on what could be the issue and whether my approach is viable?

I think the origin of the issue is at bulk_write.c, in smgr_bulk_flush (specifically after line 282), which assumes the bulk write system is used exclusively on empty relations.

If you use a separate pair of begin/end_heap_rewrite for every relation you merge into that heap, it'll re-initialize the bulk writer, which will thus overwrite the previous rewrites' pages. The pre-PG17 rewriteheap.c doesn't use that API, and thus isn't affected.

Yes, looks like it is that code that zeros out the initial pages if the write doesn't start at block 0. And it looks like I don't have access to the bulk write state to set pages_written to trick it to believe those pages have already been written. Maybe it is possible to add an option to the bulkwriter to skip zero-initialization? The comment for that code seems to indicate it isn't strictly necessary anyway.

I think I'd go with a patch like attached, where the bulk writer
registers that it started with .relsize pages in the relfork, and use
that for smgrextend() decisions. It now also records pages_written as
a separate but accurate value.

Side note: Interestingly, on a replica you would see all relation's
data in that table, or at least until the primary writes new FPIs for
those pages, as the smgrextend() which clears the pages isn't
WAL-logged and thus doesn't get transfered to a replica.

Kind regards,

Matthias van de Meent

Attachments:

v0-0001-Fix-data-loss-when-restarting-the-bulk_write-faci.patchapplication/octet-stream; name=v0-0001-Fix-data-loss-when-restarting-the-bulk_write-faci.patchDownload
From 3e2a9a13818f2d1b420bd45a9f2caae90205bb97 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 22 Nov 2024 13:48:11 +0100
Subject: [PATCH v0] Fix data loss when restarting the bulk_write facility
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If a user started a bulk write operation on a fork with existing data to
append data in bulk, the bulk_write machinery would zero out all previously
written pages up to the last page written by the new bulk_write operation.
Though PostgreSQL doesn't use the bulk_write facility twice on the same fork,
there are use cases where bulk writing to a fork with data makes sense, so
this commit fixes that issue.

Extensions could hit this issue if they used rewriteheap on multiple tables
to merge their data into a single relation, or manually created multiple
bulk_write operations for the same fork.

Reported-By: Erik Nordström <erik@timescale.com>
---
 src/backend/storage/smgr/bulk_write.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
index 1a5f3ce96e1..94b9d071c22 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -4,8 +4,10 @@
  *	  Efficiently and reliably populate a new relation
  *
  * The assumption is that no other backends access the relation while we are
- * loading it, so we can take some shortcuts.  Do not mix operations through
- * the regular buffer manager and the bulk loading interface!
+ * loading it, so we can take some shortcuts.  Pages already present in the
+ * indicated fork when the bulk write operation is started are not touched
+ * unless explicitly written to.  Do not mix operations through the regular
+ * buffer manager and the bulk loading interface!
  *
  * We bypass the buffer manager to avoid the locking overhead, and call
  * smgrextend() directly.  A downside is that the pages will need to be
@@ -68,8 +70,10 @@ struct BulkWriteState
 	int			npending;
 	PendingWrite pending_writes[MAX_PENDING_WRITES];
 
+	/* Pages written to disk */
+	BlockNumber	pages_written;
 	/* Current size of the relation */
-	BlockNumber pages_written;
+	BlockNumber relsize;
 
 	/* The RedoRecPtr at the time that the bulk operation started */
 	XLogRecPtr	start_RedoRecPtr;
@@ -107,6 +111,7 @@ smgr_bulk_start_smgr(SMgrRelation smgr, ForkNumber forknum, bool use_wal)
 
 	state->npending = 0;
 	state->pages_written = 0;
+	state->relsize = smgrnblocks(smgr, forknum);
 
 	state->start_RedoRecPtr = GetRedoRecPtr();
 
@@ -280,7 +285,7 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 
 		PageSetChecksumInplace(page, blkno);
 
-		if (blkno >= bulkstate->pages_written)
+		if (blkno >= bulkstate->relsize)
 		{
 			/*
 			 * If we have to write pages nonsequentially, fill in the space
@@ -289,21 +294,22 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 			 * space will read as zeroes anyway), but it should help to avoid
 			 * fragmentation.  The dummy pages aren't WAL-logged though.
 			 */
-			while (blkno > bulkstate->pages_written)
+			while (blkno > bulkstate->relsize)
 			{
 				/* don't set checksum for all-zero page */
 				smgrextend(bulkstate->smgr, bulkstate->forknum,
-						   bulkstate->pages_written++,
+						   bulkstate->relsize++,
 						   &zero_buffer,
 						   true);
 			}
 
 			smgrextend(bulkstate->smgr, bulkstate->forknum, blkno, page, true);
-			bulkstate->pages_written = pending_writes[i].blkno + 1;
+			bulkstate->relsize = pending_writes[i].blkno + 1;
 		}
 		else
 			smgrwrite(bulkstate->smgr, bulkstate->forknum, blkno, page, true);
 		pfree(page);
+		bulkstate->pages_written++;
 	}
 
 	bulkstate->npending = 0;
-- 
2.45.2

#5Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Matthias van de Meent (#4)
Re: Changed behavior in rewriteheap

On 22/11/2024 15:02, Matthias van de Meent wrote:

I think I'd go with a patch like attached, where the bulk writer
registers that it started with .relsize pages in the relfork, and use
that for smgrextend() decisions. It now also records pages_written as
a separate but accurate value.

Looks good to me. Eric, can you confirm that Matthias's patch fixes the
problem for you?

--
Heikki Linnakangas
Neon (https://neon.tech)

#6Erik Nordström
erik@timescale.com
In reply to: Heikki Linnakangas (#5)
Re: Changed behavior in rewriteheap

On Fri, Nov 22, 2024 at 2:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 22/11/2024 15:02, Matthias van de Meent wrote:

I think I'd go with a patch like attached, where the bulk writer
registers that it started with .relsize pages in the relfork, and use
that for smgrextend() decisions. It now also records pages_written as
a separate but accurate value.

Looks good to me. Eric, can you confirm that Matthias's patch fixes the
problem for you?

Yes, it solves the issue so it looks good.

Just a minor nit: the code uses both blokno as local variable for
pending_writes[i].blkno and directly accessing pending_writes[i].blkno.
Maybe it is better to just use the local variable. For example, change

++ b/src/backend/storage/smgr/bulk_write.c
@@ -304,7 +304,8 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
                        }
                        smgrextend(bulkstate->smgr, bulkstate->forknum,
blkno, page, true);
-                       bulkstate->relsize = pending_writes[i].blkno + 1;
+                       bulkstate->relsize++;
+                       Assert(bulkstate->relsize == blkno + 1);

Just a suggestion.

Thanks for the quick action!

-Erik

#7Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Erik Nordström (#6)
Re: Changed behavior in rewriteheap

On 22/11/2024 15:56, Erik Nordström wrote:

Yes, it solves the issue so it looks good.

Just a minor nit: the code uses both blokno as local variable for
pending_writes[i].blkno and directly accessing pending_writes[i].blkno.
Maybe it is better to just use the local variable. For example, change

++ b/src/backend/storage/smgr/bulk_write.c
@@ -304,7 +304,8 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
                        }
                        smgrextend(bulkstate->smgr, bulkstate->forknum, 
blkno, page, true);
-                       bulkstate->relsize = pending_writes[i].blkno + 1;
+                       bulkstate->relsize++;
+                       Assert(bulkstate->relsize == blkno + 1);

Just a suggestion.

Made that change and committed to master and REL_17_STABLE. I didn't
bother with the assertion though. Also I removed the 'pages_written'
field, it was not used for anything anymore.

Thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

#8Erik Nordström
erik@timescale.com
In reply to: Heikki Linnakangas (#7)
Re: Changed behavior in rewriteheap

On Fri, Nov 22, 2024 at 3:53 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 22/11/2024 15:56, Erik Nordström wrote:

Yes, it solves the issue so it looks good.

Just a minor nit: the code uses both blokno as local variable for
pending_writes[i].blkno and directly accessing pending_writes[i].blkno.
Maybe it is better to just use the local variable. For example, change

++ b/src/backend/storage/smgr/bulk_write.c
@@ -304,7 +304,8 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
}
smgrextend(bulkstate->smgr, bulkstate->forknum,
blkno, page, true);
-                       bulkstate->relsize = pending_writes[i].blkno + 1;
+                       bulkstate->relsize++;
+                       Assert(bulkstate->relsize == blkno + 1);

Just a suggestion.

Made that change and committed to master and REL_17_STABLE. I didn't
bother with the assertion though. Also I removed the 'pages_written'
field, it was not used for anything anymore.

Sounds good. Thank you again!

-Erik