logical replication of truncate command with trigger causes Assert

Started by Mark Dilgerover 4 years ago15 messages
#1Mark Dilger
mark.dilger@enterprisedb.com
1 attachment(s)

Hackers,

On master, when a statement level trigger is fired for a replicated truncate command, the following stack trace is generated:

TRAP: FailedAssertion("portal != NULL", File: "pquery.c", Line: 1760, PID: 93854)
0 postgres 0x0000000108e269f2 ExceptionalCondition + 130
1 postgres 0x0000000108bef2f4 EnsurePortalSnapshotExists + 100
2 postgres 0x0000000108a93231 _SPI_execute_plan + 529
3 postgres 0x0000000108a93c0f SPI_execute_plan_with_paramlist + 127
4 plpgsql.so 0x00000001098bf9e5 exec_stmt_execsql + 277
5 plpgsql.so 0x00000001098bbaf6 exec_stmts + 294
6 plpgsql.so 0x00000001098bb367 exec_stmt_block + 1127
7 plpgsql.so 0x00000001098ba57a plpgsql_exec_trigger + 442
8 plpgsql.so 0x00000001098cb5b1 plpgsql_call_handler + 305
9 postgres 0x0000000108a3137c ExecCallTriggerFunc + 348
10 postgres 0x0000000108a3447d afterTriggerInvokeEvents + 1517
11 postgres 0x0000000108a33bb0 AfterTriggerEndQuery + 128
12 postgres 0x0000000108a1a9e2 ExecuteTruncateGuts + 2210
13 postgres 0x0000000108b83369 apply_dispatch + 3913
14 postgres 0x0000000108b82185 LogicalRepApplyLoop + 485
15 postgres 0x0000000108b81f87 ApplyWorkerMain + 1047
16 postgres 0x0000000108b474a2 StartBackgroundWorker + 386
17 postgres 0x0000000108b55cf6 maybe_start_bgworkers + 1254
18 postgres 0x0000000108b54510 sigusr1_handler + 464
19 libsystem_platform.dylib 0x00007fff69f3d5fd _sigtramp + 29
20 ??? 0x0000000000000000 0x0 + 0
21 postgres 0x0000000108b537ae PostmasterMain + 3726
22 postgres 0x0000000108aaa140 help + 0
23 libdyld.dylib 0x00007fff69d44cc9 start + 1
24 ??? 0x0000000000000004 0x0 + 4

I believe the issue was introduced in commit 84f5c2908da which added EnsurePortalSnapshotExists. That's not going to work in the case of logical replication, because there isn't an ActivePortal nor a snapshot.

Attached patch v1-0001 reliably reproduces the problem, though you have to Ctrl-C out of it, because the logical replication gets stuck in a loop after the Assert is triggered. You can see the stack trace by opening tmp_check/log/021_truncate_subscriber.log

Attachments:

v1-0001-Adding-test-to-trigger-logical-replication-assert.patchapplication/octet-stream; name=v1-0001-Adding-test-to-trigger-logical-replication-assert.patch; x-unix-mode=0644Download
From 668024d17016a42637a54a1e4df6f55407e443b5 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 8 Jun 2021 14:30:41 -0700
Subject: [PATCH v1] Adding test to trigger logical replication assertion

---
 src/test/subscription/t/021_truncate.pl | 76 +++++++++++++++++++++++++
 1 file changed, 76 insertions(+)
 create mode 100644 src/test/subscription/t/021_truncate.pl

diff --git a/src/test/subscription/t/021_truncate.pl b/src/test/subscription/t/021_truncate.pl
new file mode 100644
index 0000000000..d89290b8c0
--- /dev/null
+++ b/src/test/subscription/t/021_truncate.pl
@@ -0,0 +1,76 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# This test checks whether an after statement trigger fired as a consequence of
+# the logical replication of a truncate command will cause an Assertion.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+my $cmd;
+
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create schema on both publisher and subscriber
+$cmd = q(
+	CREATE TABLE public.t (i INTEGER);
+	ALTER TABLE public.t REPLICA IDENTITY FULL;
+);
+$node_publisher->safe_psql('postgres', $cmd);
+$node_subscriber->safe_psql('postgres', $cmd);
+
+# Create after statement trigger on the subscriber only
+$cmd = q(
+	CREATE FUNCTION public.truncate_trigger_func() RETURNS TRIGGER AS $$
+	BEGIN
+		INSERT INTO public.t (i) VALUES (1);
+		RETURN NULL;
+	END;
+	$$ LANGUAGE plpgsql;
+	CREATE TRIGGER truncate_trigger
+		AFTER TRUNCATE ON public.t
+		FOR EACH STATEMENT EXECUTE FUNCTION public.truncate_trigger_func();
+	ALTER TABLE public.t
+		ENABLE ALWAYS TRIGGER truncate_trigger;
+);
+$node_subscriber->safe_psql('postgres', $cmd);
+
+# Create publication and subscription
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', qq(
+CREATE PUBLICATION pub FOR TABLE public.t;
+));
+$node_subscriber->safe_psql('postgres', qq(
+CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub;
+));
+$node_publisher->wait_for_catchup('sub');
+
+# Wait for initial sync to finish as well
+my $synced_query =
+  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('s', 'r');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+#
+# Check replication of TRUNCATE command causing the firing of a statement
+# level trigger
+#
+$node_publisher->safe_psql('postgres', 'TRUNCATE public.t');
+$node_publisher->wait_for_catchup('sub');
+
+#
+# Not reached -- If the subscriber backend Asserts as expected, we'll be stuck
+# waiting forever, above
+#
+is( $node_subscriber->safe_psql(
+	'postgres', "SELECT 1"), 1, "still alive");
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.21.1 (Apple Git-122.3)

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#1)
Re: logical replication of truncate command with trigger causes Assert

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On master, when a statement level trigger is fired for a replicated truncate command, the following stack trace is generated:

Hmm.

I believe the issue was introduced in commit 84f5c2908da which added EnsurePortalSnapshotExists. That's not going to work in the case of logical replication, because there isn't an ActivePortal nor a snapshot.

The right way to say that is "commit 84f5c2908da exposed the pre-existing
unsafe behavior of this code". It's not okay to run arbitrary user code
without holding a snapshot to protect TOAST dereference operations.

I suppose that either apply_dispatch or LogicalRepApplyLoop needs to
grow some more snapshot management logic, but I've not looked at that
code much, so I don't have an opinion on just where to add it.

There's a reasonable case to be made that running user code outside
a Portal is a just-plain-bad idea, so we should fix the logical
apply worker to make it create a suitable Portal. I speculated about
that in the commit message for 84f5c2908da, but I don't feel like
taking point on such work.

regards, tom lane

#3Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#2)
Re: logical replication of truncate command with trigger causes Assert

On Jun 8, 2021, at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The right way to say that is "commit 84f5c2908da exposed the pre-existing
unsafe behavior of this code". It's not okay to run arbitrary user code
without holding a snapshot to protect TOAST dereference operations.

Sure, I didn't expect that you'd broken things so much as we now have an Assert where, at least for simple commands, things were working back in April. Those things may not have been working correctly -- I'll have to do some more test development to see if I can get the pre-84f5c2908da to misbehave -- but this may appear to be a regression in version 14 if we don't do something.

Calling ExecuteTruncateGuts from within the logical replication worker was introduced in commit 039eb6e92f2, "Logical replication support for TRUNCATE", back in April 2018. So whatever we do will likely need to be backpatched.

I suppose that either apply_dispatch or LogicalRepApplyLoop needs to
grow some more snapshot management logic, but I've not looked at that
code much, so I don't have an opinion on just where to add it.

I was looking at those for other reasons prior to hitting this bug. My purpose was to figure out how to get the code to respect privileges. Perhaps the solution to these two issues is related. I don't know yet.

As it stands, a subscription can only be created by a superuser, and the replication happens under that user's current_user and session_user. I naively thought that adding a built-in role pg_logical_replication which could create subscriptions would be of some use. I implemented that but, but now if I create a user named "replication_manager" with membership in pg_logical_replication but not superuser, it turns out that even though the apply worker runs as replication_manager, the insert/update/delete commands work without checking privileges. (They can insert/update/delete tables and execute functions owned by a database superuser for which "replication_manager" has no privileges.) So I need to go a bit further to get acl checks called from this code path.

There's a reasonable case to be made that running user code outside
a Portal is a just-plain-bad idea, so we should fix the logical
apply worker to make it create a suitable Portal. I speculated about
that in the commit message for 84f5c2908da, but I don't feel like
taking point on such work.

I'll dig into this a bit more.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#3)
Re: logical replication of truncate command with trigger causes Assert

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Jun 8, 2021, at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I suppose that either apply_dispatch or LogicalRepApplyLoop needs to
grow some more snapshot management logic, but I've not looked at that
code much, so I don't have an opinion on just where to add it.

I was looking at those for other reasons prior to hitting this bug.

After looking at it a bit, I see a couple of options:

1. Just wrap the call of ExecuteTruncateGuts with
PushActiveSnapshot(GetTransactionSnapshot()) and PopActiveSnapshot().

2. Decide that we ought to ensure that a snapshot exists throughout
most of this code. It's not entirely obvious to me that there is no
code path reachable from, say, apply_handle_truncate's collection of
relation OIDs that needs a snapshot. If we went for that, I'd think
the right solution is to do PushActiveSnapshot right after each
ensure_transaction call, and then PopActiveSnapshot on the way out of
the respective subroutine. We could then drop the snapshot management
calls that are currently associated with the executor state.

My purpose was to figure out how to get the code to respect
privileges. Perhaps the solution to these two issues is related.
I don't know yet.

Doesn't seem tremendously related. But yeah, there is Stuff That
Is Missing in these code paths.

regards, tom lane

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#4)
Re: logical replication of truncate command with trigger causes Assert

On Wed, Jun 9, 2021 at 5:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Jun 8, 2021, at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I suppose that either apply_dispatch or LogicalRepApplyLoop needs to
grow some more snapshot management logic, but I've not looked at that
code much, so I don't have an opinion on just where to add it.

I was looking at those for other reasons prior to hitting this bug.

After looking at it a bit, I see a couple of options:

1. Just wrap the call of ExecuteTruncateGuts with
PushActiveSnapshot(GetTransactionSnapshot()) and PopActiveSnapshot().

2. Decide that we ought to ensure that a snapshot exists throughout
most of this code. It's not entirely obvious to me that there is no
code path reachable from, say, apply_handle_truncate's collection of
relation OIDs that needs a snapshot. If we went for that, I'd think
the right solution is to do PushActiveSnapshot right after each
ensure_transaction call, and then PopActiveSnapshot on the way out of
the respective subroutine. We could then drop the snapshot management
calls that are currently associated with the executor state.

+1 for the second option as with that, apart from what you said it
will take off some load from future developers to decide which part of
changes should be after acquiring snapshot.

--
With Regards,
Amit Kapila.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#5)
1 attachment(s)
Re: logical replication of truncate command with trigger causes Assert

Amit Kapila <amit.kapila16@gmail.com> writes:

On Wed, Jun 9, 2021 at 5:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. Decide that we ought to ensure that a snapshot exists throughout
most of this code. It's not entirely obvious to me that there is no
code path reachable from, say, apply_handle_truncate's collection of
relation OIDs that needs a snapshot. If we went for that, I'd think
the right solution is to do PushActiveSnapshot right after each
ensure_transaction call, and then PopActiveSnapshot on the way out of
the respective subroutine. We could then drop the snapshot management
calls that are currently associated with the executor state.

+1 for the second option as with that, apart from what you said it
will take off some load from future developers to decide which part of
changes should be after acquiring snapshot.

Here's a draft patch for that. I decided the most sensible way to
organize this is to pair the existing ensure_transaction() subroutine
with a cleanup subroutine. Rather unimaginatively, perhaps, I renamed
it to begin_transaction_step and named the cleanup end_transaction_step.
(Better ideas welcome.)

As written, this'll result in creating and deleting a snapshot for some
stream-control messages that maybe don't need one; but the point here is
not to have to think too hard about whether they do, so that's OK with
me. There are more CommandCounterIncrement calls than before, too,
but (a) those are cheap if there's nothing to do and (b) it's not real
clear to me that the extra calls are not necessary.

Somewhat unrelated, but ... am I reading the code correctly that
apply_handle_stream_start and related routines are using Asserts
to check that the remote sent stream-control messages in the correct
order? That seems many degrees short of acceptable.

regards, tom lane

Attachments:

reorganize-replication-worker-snapshots.patchtext/x-diff; charset=us-ascii; name=reorganize-replication-worker-snapshots.patchDownload
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447ea97..9e9b47ce4f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -282,30 +282,41 @@ should_apply_changes_for_rel(LogicalRepRelMapEntry *rel)
 }
 
 /*
- * Make sure that we started local transaction.
+ * Begin one step (one INSERT, UPDATE, etc) of a replication transaction.
  *
- * Also switches to ApplyMessageContext as necessary.
+ * Start a transaction, if this is the first step (else we keep using the
+ * existing transaction).
+ * Also provide a global snapshot and ensure we run in ApplyMessageContext.
  */
-static bool
-ensure_transaction(void)
+static void
+begin_transaction_step(void)
 {
-	if (IsTransactionState())
-	{
-		SetCurrentStatementStartTimestamp();
-
-		if (CurrentMemoryContext != ApplyMessageContext)
-			MemoryContextSwitchTo(ApplyMessageContext);
+	SetCurrentStatementStartTimestamp();
 
-		return false;
+	if (!IsTransactionState())
+	{
+		StartTransactionCommand();
+		maybe_reread_subscription();
 	}
 
-	SetCurrentStatementStartTimestamp();
-	StartTransactionCommand();
-
-	maybe_reread_subscription();
+	PushActiveSnapshot(GetTransactionSnapshot());
 
 	MemoryContextSwitchTo(ApplyMessageContext);
-	return true;
+}
+
+/*
+ * Finish up one step of a replication transaction.
+ * Callers of begin_transaction_step() must also call this.
+ *
+ * We don't close out the transaction here, but we should increment
+ * the command counter to make the effects of this step visible.
+ */
+static void
+end_transaction_step(void)
+{
+	PopActiveSnapshot();
+
+	CommandCounterIncrement();
 }
 
 /*
@@ -359,13 +370,6 @@ create_edata_for_relation(LogicalRepRelMapEntry *rel)
 	RangeTblEntry *rte;
 	ResultRelInfo *resultRelInfo;
 
-	/*
-	 * Input functions may need an active snapshot, as may AFTER triggers
-	 * invoked during finish_edata.  For safety, ensure an active snapshot
-	 * exists throughout all our usage of the executor.
-	 */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	edata = (ApplyExecutionData *) palloc0(sizeof(ApplyExecutionData));
 	edata->targetRel = rel;
 
@@ -433,8 +437,6 @@ finish_edata(ApplyExecutionData *edata)
 	ExecResetTupleTable(estate->es_tupleTable, false);
 	FreeExecutorState(estate);
 	pfree(edata);
-
-	PopActiveSnapshot();
 }
 
 /*
@@ -831,7 +833,7 @@ apply_handle_stream_start(StringInfo s)
 	 * transaction for handling the buffile, used for serializing the
 	 * streaming data and subxact info.
 	 */
-	ensure_transaction();
+	begin_transaction_step();
 
 	/* notify handle methods we're processing a remote transaction */
 	in_streamed_transaction = true;
@@ -861,6 +863,8 @@ apply_handle_stream_start(StringInfo s)
 		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
 
 	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	end_transaction_step();
 }
 
 /*
@@ -937,7 +941,7 @@ apply_handle_stream_abort(StringInfo s)
 		StreamXidHash *ent;
 
 		subidx = -1;
-		ensure_transaction();
+		begin_transaction_step();
 		subxact_info_read(MyLogicalRepWorker->subid, xid);
 
 		for (i = subxact_data.nsubxacts; i > 0; i--)
@@ -958,7 +962,7 @@ apply_handle_stream_abort(StringInfo s)
 		{
 			/* Cleanup the subxact info */
 			cleanup_subxact_info();
-
+			end_transaction_step();
 			CommitTransactionCommand();
 			return;
 		}
@@ -986,6 +990,7 @@ apply_handle_stream_abort(StringInfo s)
 		/* write the updated subxact list */
 		subxact_info_write(MyLogicalRepWorker->subid, xid);
 
+		end_transaction_step();
 		CommitTransactionCommand();
 	}
 }
@@ -1013,7 +1018,8 @@ apply_handle_stream_commit(StringInfo s)
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
 
-	ensure_transaction();
+	/* Make sure we have an open transaction */
+	begin_transaction_step();
 
 	/*
 	 * Allocate file handle and memory required to process all the messages in
@@ -1046,6 +1052,8 @@ apply_handle_stream_commit(StringInfo s)
 	in_remote_transaction = true;
 	pgstat_report_activity(STATE_RUNNING, NULL);
 
+	end_transaction_step();
+
 	/*
 	 * Read the entries one by one and pass them through the same logic as in
 	 * apply_dispatch.
@@ -1227,7 +1235,7 @@ apply_handle_insert(StringInfo s)
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
-	ensure_transaction();
+	begin_transaction_step();
 
 	relid = logicalrep_read_insert(s, &newtup);
 	rel = logicalrep_rel_open(relid, RowExclusiveLock);
@@ -1238,6 +1246,7 @@ apply_handle_insert(StringInfo s)
 		 * transaction so it's safe to unlock it.
 		 */
 		logicalrep_rel_close(rel, RowExclusiveLock);
+		end_transaction_step();
 		return;
 	}
 
@@ -1266,7 +1275,7 @@ apply_handle_insert(StringInfo s)
 
 	logicalrep_rel_close(rel, NoLock);
 
-	CommandCounterIncrement();
+	end_transaction_step();
 }
 
 /*
@@ -1346,7 +1355,7 @@ apply_handle_update(StringInfo s)
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
-	ensure_transaction();
+	begin_transaction_step();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
 								   &newtup);
@@ -1358,6 +1367,7 @@ apply_handle_update(StringInfo s)
 		 * transaction so it's safe to unlock it.
 		 */
 		logicalrep_rel_close(rel, RowExclusiveLock);
+		end_transaction_step();
 		return;
 	}
 
@@ -1416,7 +1426,7 @@ apply_handle_update(StringInfo s)
 
 	logicalrep_rel_close(rel, NoLock);
 
-	CommandCounterIncrement();
+	end_transaction_step();
 }
 
 /*
@@ -1501,7 +1511,7 @@ apply_handle_delete(StringInfo s)
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
-	ensure_transaction();
+	begin_transaction_step();
 
 	relid = logicalrep_read_delete(s, &oldtup);
 	rel = logicalrep_rel_open(relid, RowExclusiveLock);
@@ -1512,6 +1522,7 @@ apply_handle_delete(StringInfo s)
 		 * transaction so it's safe to unlock it.
 		 */
 		logicalrep_rel_close(rel, RowExclusiveLock);
+		end_transaction_step();
 		return;
 	}
 
@@ -1542,7 +1553,7 @@ apply_handle_delete(StringInfo s)
 
 	logicalrep_rel_close(rel, NoLock);
 
-	CommandCounterIncrement();
+	end_transaction_step();
 }
 
 /*
@@ -1867,7 +1878,7 @@ apply_handle_truncate(StringInfo s)
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
-	ensure_transaction();
+	begin_transaction_step();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
 
@@ -1958,7 +1969,7 @@ apply_handle_truncate(StringInfo s)
 		table_close(rel, NoLock);
 	}
 
-	CommandCounterIncrement();
+	end_transaction_step();
 }
 
 
#7Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#6)
Re: logical replication of truncate command with trigger causes Assert

On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Here's a draft patch for that. I decided the most sensible way to
organize this is to pair the existing ensure_transaction() subroutine
with a cleanup subroutine. Rather unimaginatively, perhaps, I renamed
it to begin_transaction_step and named the cleanup end_transaction_step.
(Better ideas welcome.)

Thanks! The regression test I posted earlier passes with this patch applied.

Somewhat unrelated, but ... am I reading the code correctly that
apply_handle_stream_start and related routines are using Asserts
to check that the remote sent stream-control messages in the correct
order? That seems many degrees short of acceptable.

Even if you weren't reading that correctly, this bit:

xid = pq_getmsgint(s, 4);

Assert(TransactionIdIsValid(xid));

simply asserts that the sending server didn't send an invalid subtransaction id.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#7)
Re: logical replication of truncate command with trigger causes Assert

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Somewhat unrelated, but ... am I reading the code correctly that
apply_handle_stream_start and related routines are using Asserts
to check that the remote sent stream-control messages in the correct
order? That seems many degrees short of acceptable.

Even if you weren't reading that correctly, this bit:

xid = pq_getmsgint(s, 4);

Assert(TransactionIdIsValid(xid));

simply asserts that the sending server didn't send an invalid subtransaction id.

Ugh, yeah. We should never be using Asserts to validate incoming
messages -- a test-and-elog is more appropriate.

regards, tom lane

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#7)
Re: logical replication of truncate command with trigger causes Assert

On Wed, Jun 9, 2021 at 8:44 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Here's a draft patch for that. I decided the most sensible way to
organize this is to pair the existing ensure_transaction() subroutine
with a cleanup subroutine. Rather unimaginatively, perhaps, I renamed
it to begin_transaction_step and named the cleanup end_transaction_step.
(Better ideas welcome.)

Thanks! The regression test I posted earlier passes with this patch applied.

I have also read the patch and it looks good to me.

Somewhat unrelated, but ... am I reading the code correctly that
apply_handle_stream_start and related routines are using Asserts
to check that the remote sent stream-control messages in the correct
order?

Yes. I think you are talking about Assert(!in_streamed_transaction).
There is no particular reason that such Asserts are required, so we
can change to test-and-elog as you suggested later in your email.

That seems many degrees short of acceptable.

Even if you weren't reading that correctly, this bit:

xid = pq_getmsgint(s, 4);

Assert(TransactionIdIsValid(xid));

simply asserts that the sending server didn't send an invalid subtransaction id.

This also needs to be changed to test-and-elog.

--
With Regards,
Amit Kapila.

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#9)
1 attachment(s)
Re: logical replication of truncate command with trigger causes Assert

Amit Kapila <amit.kapila16@gmail.com> writes:

On Wed, Jun 9, 2021 at 8:44 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Somewhat unrelated, but ... am I reading the code correctly that
apply_handle_stream_start and related routines are using Asserts
to check that the remote sent stream-control messages in the correct
order?

This also needs to be changed to test-and-elog.

Here's a proposed patch for this. It looks like pretty much all of the
bogosity is new with the streaming business. You might quibble with
which things I thought deserved elog versus ereport. Another thing
I'm wondering is how many of these messages really need to be
translated. We could use errmsg_internal and avoid burdening the
translators, perhaps.

regards, tom lane

Attachments:

clean-up-dubious-assert-usage.patchtext/x-diff; charset=us-ascii; name=clean-up-dubious-assert-usage.patchDownload
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 98c26002e8..26738d3589 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -177,7 +177,7 @@ bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
 /* fields valid only when processing streamed transaction */
-bool		in_streamed_transaction = false;
+static bool in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
@@ -345,7 +345,8 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
 	 */
 	xid = pq_getmsgint(s, 4);
 
-	Assert(TransactionIdIsValid(xid));
+	if (!TransactionIdIsValid(xid))
+		elog(ERROR, "invalid transaction ID in streamed replication transaction");
 
 	/* Add the new subxact to the array (unless already there). */
 	subxact_info_add(xid);
@@ -785,7 +786,10 @@ apply_handle_commit(StringInfo s)
 
 	logicalrep_read_commit(s, &commit_data);
 
-	Assert(commit_data.commit_lsn == remote_final_lsn);
+	if (commit_data.commit_lsn != remote_final_lsn)
+		elog(ERROR, "incorrect commit LSN %X/%X in commit message (expected %X/%X)",
+			 LSN_FORMAT_ARGS(commit_data.commit_lsn),
+			 LSN_FORMAT_ARGS(remote_final_lsn));
 
 	apply_handle_commit_internal(s, &commit_data);
 
@@ -824,7 +828,10 @@ apply_handle_stream_start(StringInfo s)
 	bool		first_segment;
 	HASHCTL		hash_ctl;
 
-	Assert(!in_streamed_transaction);
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg("duplicate STREAM START message")));
 
 	/*
 	 * Start a transaction on stream start, this transaction will be committed
@@ -873,7 +880,10 @@ apply_handle_stream_start(StringInfo s)
 static void
 apply_handle_stream_stop(StringInfo s)
 {
-	Assert(in_streamed_transaction);
+	if (!in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg("STREAM STOP message without STREAM START")));
 
 	/*
 	 * Close the file with serialized changes, and serialize information about
@@ -905,7 +915,10 @@ apply_handle_stream_abort(StringInfo s)
 	TransactionId xid;
 	TransactionId subxid;
 
-	Assert(!in_streamed_transaction);
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg("STREAM ABORT message without STREAM STOP")));
 
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
@@ -967,13 +980,12 @@ apply_handle_stream_abort(StringInfo s)
 			return;
 		}
 
-		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
-
 		ent = (StreamXidHash *) hash_search(xidhash,
 											(void *) &xid,
 											HASH_FIND,
 											&found);
-		Assert(found);
+		if (!found)
+			elog(ERROR, "transaction %u not found in hash table", xid);
 
 		/* open the changes file */
 		changes_filename(path, MyLogicalRepWorker->subid, xid);
@@ -1012,7 +1024,10 @@ apply_handle_stream_commit(StringInfo s)
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg("STREAM COMMIT message without STREAM STOP")));
 
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
@@ -1031,11 +1046,14 @@ apply_handle_stream_commit(StringInfo s)
 	/* open the spool file for the committed transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
+
 	ent = (StreamXidHash *) hash_search(xidhash,
 										(void *) &xid,
 										HASH_FIND,
 										&found);
-	Assert(found);
+	if (!found)
+		elog(ERROR, "transaction %u not found in hash table", xid);
+
 	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
 
 	buffer = palloc(BLCKSZ);
@@ -1080,7 +1098,9 @@ apply_handle_stream_commit(StringInfo s)
 					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
 							path)));
 
-		Assert(len > 0);
+		if (len <= 0)
+			elog(ERROR, "incorrect length %d in streaming transaction's changes file \"%s\"",
+				 len, path);
 
 		/* make sure we have sufficiently large buffer */
 		buffer = repalloc(buffer, len);
@@ -1108,7 +1128,7 @@ apply_handle_stream_commit(StringInfo s)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(DEBUG1, "replayed %d changes from file \"%s\"",
 				 nchanges, path);
 	}
 
#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#10)
Re: logical replication of truncate command with trigger causes Assert

On Fri, Jun 11, 2021 at 12:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Wed, Jun 9, 2021 at 8:44 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Jun 9, 2021, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Somewhat unrelated, but ... am I reading the code correctly that
apply_handle_stream_start and related routines are using Asserts
to check that the remote sent stream-control messages in the correct
order?

This also needs to be changed to test-and-elog.

Here's a proposed patch for this. It looks like pretty much all of the
bogosity is new with the streaming business.

Except for the change in change in apply_handle_commit which seems to
be from the time it is introduced in commit 7c4f5240

You might quibble with
which things I thought deserved elog versus report.

I wonder why you used elog in handle_streamed_transaction and
apply_handle_commit? It seems all the other places use ereport for
anything wrong it got from the protocol message.

Another thing
I'm wondering is how many of these messages really need to be
translated. We could use errmsg_internal and avoid burdening the
translators, perhaps.

Not sure but I see all existing similar ereport calls don't use errmsg_internal.

--
With Regards,
Amit Kapila.

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#11)
Re: logical replication of truncate command with trigger causes Assert

Amit Kapila <amit.kapila16@gmail.com> writes:

On Fri, Jun 11, 2021 at 12:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another thing
I'm wondering is how many of these messages really need to be
translated. We could use errmsg_internal and avoid burdening the
translators, perhaps.

Not sure but I see all existing similar ereport calls don't use errmsg_internal.

I was thinking maybe we could mark all these replication protocol
violation errors non-translatable. While we don't want to crash on a
protocol violation, it shouldn't really be a user-facing case either.
Thoughts?

regards, tom lane

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#12)
Re: logical replication of truncate command with trigger causes Assert

On Fri, Jun 11, 2021 at 8:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Fri, Jun 11, 2021 at 12:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another thing
I'm wondering is how many of these messages really need to be
translated. We could use errmsg_internal and avoid burdening the
translators, perhaps.

Not sure but I see all existing similar ereport calls don't use errmsg_internal.

I was thinking maybe we could mark all these replication protocol
violation errors non-translatable. While we don't want to crash on a
protocol violation, it shouldn't really be a user-facing case either.

I don't see any problem with that as these are not directly related to
any user operation. So, +1 for making these non-translatable.

--
With Regards,
Amit Kapila.

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#13)
Re: logical replication of truncate command with trigger causes Assert

Amit Kapila <amit.kapila16@gmail.com> writes:

On Fri, Jun 11, 2021 at 8:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I was thinking maybe we could mark all these replication protocol
violation errors non-translatable. While we don't want to crash on a
protocol violation, it shouldn't really be a user-facing case either.

I don't see any problem with that as these are not directly related to
any user operation. So, +1 for making these non-translatable.

Done that way. On re-reading the code, there were a bunch more
Asserts that could be triggered by bad input data, so the committed
patch has rather more corrections than I posted before.

regards, tom lane

#15Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tom Lane (#14)
Re: logical replication of truncate command with trigger causes Assert

On 2021-Jun-12, Tom Lane wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Fri, Jun 11, 2021 at 8:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I was thinking maybe we could mark all these replication protocol
violation errors non-translatable. While we don't want to crash on a
protocol violation, it shouldn't really be a user-facing case either.

I don't see any problem with that as these are not directly related to
any user operation. So, +1 for making these non-translatable.

Done that way.

Good call, thanks. Not only it's not very useful to translate such
messages, but it's also quite a burden because some of them are
difficult to translate.

--
�lvaro Herrera Valdivia, Chile
"Tiene valor aquel que admite que es un cobarde" (Fernandel)