logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

Started by Tomas Vondraabout 7 years ago9 messages
#1Tomas Vondra
tomas.vondra@2ndquadrant.com
1 attachment(s)

Hi,

It seems we have pretty annoying problem with logical decoding when
performing VACUUM FULL / CLUSTER on a table with toast-ed data.

The trouble is that the rewritten heap is WAL-logged using XLOG/FPI
records, the TOAST data is logged as regular INSERT records. XLOG/FPI is
ignored in logical decoding, and so reorderbuffer never gets those
records. But we do decode the TOAST data, and reorderbuffer stashes them
in toast_hash hash table. Which gets reset only when handling a row from
the main heap, and that never arrives. So we end up stashing all the
TOAST data in memory :-(

So do VACUUM FULL (or CLUSTER) on a sufficiently large table, and you're
likely to break any logical replication connection. And it does not
matter if you replicate this particular table.

Luckily enough, this can leverage some of the pieces introduced by
e9edc1ba which was meant to deal with rewrites of system tables, and in
raw_heap_insert it added this:

/*
* The new relfilenode's relcache entrye doesn't have the necessary
* information to determine whether a relation should emit data for
* logical decoding. Force it to off if necessary.
*/
if (!RelationIsLogicallyLogged(state->rs_old_rel))
options |= HEAP_INSERT_NO_LOGICAL;

As raw_heap_insert is used only for heap rewrites, we can simply remove
the if condition and use the HEAP_INSERT_NO_LOGICAL flag for all TOAST
data logged from here.

This does fix the issue, because we still decode the TOAST changes but
there are no data and so

if (change->data.tp.newtuple != NULL)
{
dlist_delete(&change->node);
ReorderBufferToastAppendChunk(rb, txn, relation,
change);
}

ends up not stashing the change in the hash table. It's imperfect,
because we still decode the changes (and stash them to disk), but ISTM
that can be fixed by tweaking DecodeInsert a bit to just ignore those
changes entirely.

Attached is a PoC patch with these two fixes.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

rewrite-toast-fix.patchtext/x-patch; name=rewrite-toast-fix.patchDownload
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c5db75afa1..ce6f9ed117 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -658,13 +658,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 		if (!state->rs_use_wal)
 			options |= HEAP_INSERT_SKIP_WAL;
 
-		/*
-		 * The new relfilenode's relcache entrye doesn't have the necessary
-		 * information to determine whether a relation should emit data for
-		 * logical decoding.  Force it to off if necessary.
-		 */
-		if (!RelationIsLogicallyLogged(state->rs_old_rel))
-			options |= HEAP_INSERT_NO_LOGICAL;
+		/* do not decode TOAST data for heap rewrites */
+		options |= HEAP_INSERT_NO_LOGICAL;
 
 		heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 										 options);
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afb497227e..f23cb120e8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -665,6 +665,9 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 static void
 DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 {
+	Size		datalen;
+	char	   *tupledata;
+	Size		tuplelen;
 	XLogReaderState *r = buf->record;
 	xl_heap_insert *xlrec;
 	ReorderBufferChange *change;
@@ -672,6 +675,10 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	xlrec = (xl_heap_insert *) XLogRecGetData(r);
 
+	/* ignore insert records without new tuples */
+	if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
+		return;
+
 	/* only interested in our database */
 	XLogRecGetBlockTag(r, 0, &target_node, NULL, NULL);
 	if (target_node.dbNode != ctx->slot->data.database)
@@ -690,17 +697,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
 
-	if (xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE)
-	{
-		Size		datalen;
-		char	   *tupledata = XLogRecGetBlockData(r, 0, &datalen);
-		Size		tuplelen = datalen - SizeOfHeapHeader;
+	tupledata = XLogRecGetBlockData(r, 0, &datalen);
+	tuplelen = datalen - SizeOfHeapHeader;
 
-		change->data.tp.newtuple =
-			ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
+	change->data.tp.newtuple =
+		ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
 
-		DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
-	}
+	DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
 
 	change->data.tp.clear_toast_afterwards = true;
 
#2Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#1)
Re: logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

On Mon, Nov 19, 2018 at 6:52 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

It seems we have pretty annoying problem with logical decoding when
performing VACUUM FULL / CLUSTER on a table with toast-ed data.

The trouble is that the rewritten heap is WAL-logged using XLOG/FPI
records, the TOAST data is logged as regular INSERT records. XLOG/FPI is
ignored in logical decoding, and so reorderbuffer never gets those
records. But we do decode the TOAST data, and reorderbuffer stashes them
in toast_hash hash table. Which gets reset only when handling a row from
the main heap, and that never arrives. So we end up stashing all the
TOAST data in memory :-(

So do VACUUM FULL (or CLUSTER) on a sufficiently large table, and you're
likely to break any logical replication connection. And it does not
matter if you replicate this particular table.

Luckily enough, this can leverage some of the pieces introduced by
e9edc1ba which was meant to deal with rewrites of system tables, and in
raw_heap_insert it added this:

/*
* The new relfilenode's relcache entrye doesn't have the necessary
* information to determine whether a relation should emit data for
* logical decoding. Force it to off if necessary.
*/
if (!RelationIsLogicallyLogged(state->rs_old_rel))
options |= HEAP_INSERT_NO_LOGICAL;

As raw_heap_insert is used only for heap rewrites, we can simply remove
the if condition and use the HEAP_INSERT_NO_LOGICAL flag for all TOAST
data logged from here.

This fix seems fine to me.

This does fix the issue, because we still decode the TOAST changes but
there are no data and so

if (change->data.tp.newtuple != NULL)
{
dlist_delete(&change->node);
ReorderBufferToastAppendChunk(rb, txn, relation,
change);
}

ends up not stashing the change in the hash table.

With the below change you proposed can we remove the above condition
because toast-insertion changes being processed by the reorder buffer
always have a new tuple? If a toast-insertion record doesn't have a
new tuple it has already ignored when decoding.

It's imperfect,
because we still decode the changes (and stash them to disk), but ISTM
that can be fixed by tweaking DecodeInsert a bit to just ignore those
changes entirely.

Attached is a PoC patch with these two fixes.

I think this change is also fine.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Masahiko Sawada (#2)
Re: logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

On 11/19/18 10:28 AM, Masahiko Sawada wrote:

On Mon, Nov 19, 2018 at 6:52 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

It seems we have pretty annoying problem with logical decoding when
performing VACUUM FULL / CLUSTER on a table with toast-ed data.

The trouble is that the rewritten heap is WAL-logged using XLOG/FPI
records, the TOAST data is logged as regular INSERT records. XLOG/FPI is
ignored in logical decoding, and so reorderbuffer never gets those
records. But we do decode the TOAST data, and reorderbuffer stashes them
in toast_hash hash table. Which gets reset only when handling a row from
the main heap, and that never arrives. So we end up stashing all the
TOAST data in memory :-(

So do VACUUM FULL (or CLUSTER) on a sufficiently large table, and you're
likely to break any logical replication connection. And it does not
matter if you replicate this particular table.

Luckily enough, this can leverage some of the pieces introduced by
e9edc1ba which was meant to deal with rewrites of system tables, and in
raw_heap_insert it added this:

/*
* The new relfilenode's relcache entrye doesn't have the necessary
* information to determine whether a relation should emit data for
* logical decoding. Force it to off if necessary.
*/
if (!RelationIsLogicallyLogged(state->rs_old_rel))
options |= HEAP_INSERT_NO_LOGICAL;

As raw_heap_insert is used only for heap rewrites, we can simply remove
the if condition and use the HEAP_INSERT_NO_LOGICAL flag for all TOAST
data logged from here.

This fix seems fine to me.

This does fix the issue, because we still decode the TOAST changes but
there are no data and so

if (change->data.tp.newtuple != NULL)
{
dlist_delete(&change->node);
ReorderBufferToastAppendChunk(rb, txn, relation,
change);
}

ends up not stashing the change in the hash table.

With the below change you proposed can we remove the above condition
because toast-insertion changes being processed by the reorder buffer
always have a new tuple? If a toast-insertion record doesn't have a
new tuple it has already ignored when decoding.

Good point. I think you're right the reorderbuffer part may be
simplified as you propose.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#3)
1 attachment(s)
Re: logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

On 11/19/18 11:44 AM, Tomas Vondra wrote:

On 11/19/18 10:28 AM, Masahiko Sawada wrote:

On Mon, Nov 19, 2018 at 6:52 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

This does fix the issue, because we still decode the TOAST changes but
there are no data and so

     if (change->data.tp.newtuple != NULL)
     {
         dlist_delete(&change->node);
         ReorderBufferToastAppendChunk(rb, txn, relation,
                                       change);
     }

ends up not stashing the change in the hash table.

With the below change you proposed can we remove the above condition
because toast-insertion changes being processed by the reorder buffer
always have a new tuple? If a toast-insertion record doesn't have a
new tuple it has already ignored when decoding.

Good point. I think you're right the reorderbuffer part may be
simplified as you propose.

OK, here's an updated patch, tweaking the reorderbuffer part. I plan to
push this sometime mid next week.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

rewrite-toast-fix-v2.patchtext/x-patch; name=rewrite-toast-fix-v2.patchDownload
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d5bd282f8c..44caeca336 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -659,12 +659,11 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			options |= HEAP_INSERT_SKIP_WAL;
 
 		/*
-		 * The new relfilenode's relcache entrye doesn't have the necessary
-		 * information to determine whether a relation should emit data for
-		 * logical decoding.  Force it to off if necessary.
+		 * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
+		 * for the TOAST table are not logically decoded.  The main heap is
+		 * WAL-logged as XLOG FPI records, which are not logically decoded.
 		 */
-		if (!RelationIsLogicallyLogged(state->rs_old_rel))
-			options |= HEAP_INSERT_NO_LOGICAL;
+		options |= HEAP_INSERT_NO_LOGICAL;
 
 		heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 										 options);
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afb497227e..e3b05657f8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -665,6 +665,9 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 static void
 DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 {
+	Size		datalen;
+	char	   *tupledata;
+	Size		tuplelen;
 	XLogReaderState *r = buf->record;
 	xl_heap_insert *xlrec;
 	ReorderBufferChange *change;
@@ -672,6 +675,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	xlrec = (xl_heap_insert *) XLogRecGetData(r);
 
+	/*
+	 * Ignore insert records without new tuples (this does happen when
+	 * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
+	 */
+	if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
+		return;
+
 	/* only interested in our database */
 	XLogRecGetBlockTag(r, 0, &target_node, NULL, NULL);
 	if (target_node.dbNode != ctx->slot->data.database)
@@ -690,17 +700,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
 
-	if (xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE)
-	{
-		Size		datalen;
-		char	   *tupledata = XLogRecGetBlockData(r, 0, &datalen);
-		Size		tuplelen = datalen - SizeOfHeapHeader;
+	tupledata = XLogRecGetBlockData(r, 0, &datalen);
+	tuplelen = datalen - SizeOfHeapHeader;
 
-		change->data.tp.newtuple =
-			ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
+	change->data.tp.newtuple =
+		ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
 
-		DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
-	}
+	DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
 
 	change->data.tp.clear_toast_afterwards = true;
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bed63c768e..23466bade2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1598,17 +1598,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * transaction's changes. Otherwise it will get
 						 * freed/reused while restoring spooled data from
 						 * disk.
-						 *
-						 * But skip doing so if there's no tuple-data. That
-						 * happens if a non-mapped system catalog with a toast
-						 * table is rewritten.
 						 */
-						if (change->data.tp.newtuple != NULL)
-						{
-							dlist_delete(&change->node);
-							ReorderBufferToastAppendChunk(rb, txn, relation,
-														  change);
-						}
+						Assert(change->data.tp.newtuple != NULL);
+
+						dlist_delete(&change->node);
+						ReorderBufferToastAppendChunk(rb, txn, relation,
+													  change);
 					}
 
 			change_done:
#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#4)
Re: logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

On 11/24/18 12:20 AM, Tomas Vondra wrote:

...

OK, here's an updated patch, tweaking the reorderbuffer part. I plan
to push this sometime mid next week.

Pushed and backpatched to 9.4- (same as e9edc1ba).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#5)
Re: logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

Hi,

On 2018-11-28 02:04:18 +0100, Tomas Vondra wrote:

On 11/24/18 12:20 AM, Tomas Vondra wrote:

...

OK, here's an updated patch, tweaking the reorderbuffer part. I plan
to push this sometime mid next week.

Pushed and backpatched to 9.4- (same as e9edc1ba).

Backpatching seems on the more aggressive end of things for an
optimization. Could you at least announce that beforehand next time?

Greetings,

Andres Freund

#7Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#6)
Re: logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

Hi,

On 28/11/2018 02:14, Andres Freund wrote:

Hi,

On 2018-11-28 02:04:18 +0100, Tomas Vondra wrote:

On 11/24/18 12:20 AM, Tomas Vondra wrote:

...

OK, here's an updated patch, tweaking the reorderbuffer part. I plan
to push this sometime mid next week.

Pushed and backpatched to 9.4- (same as e9edc1ba).

Backpatching seems on the more aggressive end of things for an
optimization. Could you at least announce that beforehand next time?

Well, it may be optimization, but from what I've seen the problems
arising from this can easily prevent logical replication from working
altogether as reorder buffer hits OOM on bigger tables. So ISTM that it
does warrant backpatch.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#8Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#7)
Re: logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

Hi,

On 2018-11-28 03:06:58 +0100, Petr Jelinek wrote:

On 28/11/2018 02:14, Andres Freund wrote:

On 2018-11-28 02:04:18 +0100, Tomas Vondra wrote:

Pushed and backpatched to 9.4- (same as e9edc1ba).

Backpatching seems on the more aggressive end of things for an
optimization. Could you at least announce that beforehand next time?

Well, it may be optimization, but from what I've seen the problems
arising from this can easily prevent logical replication from working
altogether as reorder buffer hits OOM on bigger tables. So ISTM that it
does warrant backpatch.

I think that's a fair argument to be made. But it should be made both
before the commit and in the commit message.

Greetings,

Andres Freund

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#8)
Re: logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

On 11/28/18 3:31 AM, Andres Freund wrote:

Hi,

On 2018-11-28 03:06:58 +0100, Petr Jelinek wrote:

On 28/11/2018 02:14, Andres Freund wrote:

On 2018-11-28 02:04:18 +0100, Tomas Vondra wrote:

Pushed and backpatched to 9.4- (same as e9edc1ba).

Backpatching seems on the more aggressive end of things for an
optimization. Could you at least announce that beforehand next time?

Well, it may be optimization, but from what I've seen the problems
arising from this can easily prevent logical replication from working
altogether as reorder buffer hits OOM on bigger tables. So ISTM that it
does warrant backpatch.

I think that's a fair argument to be made. But it should be made
both before the commit and in the commit message.

Understood. I thought I stated the intent to backpatch when announcing
I'll push it this week, but clearly that did not happen. Oops :-(

That being said, I see this more like a bugfix than an optimization,
because (as Petr already stated) rewrite of any sufficiently large table
can irreparably break the replication. So it's not just slower, it dies.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services