[BUG] Logical replica crash if there was an error in a function.

Started by Anton A. Melnikovover 3 years ago23 messages

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

3 attachment(s)

Hello!

Here is a fix for the bug first described in:
/messages/by-id/adf0452f-8c6b-7def-d35e-ab516c80088e@inbox.ru

Reproduction:
1) On master with 'wal_level = logical' execute mascmd.sql attached.

2) On replica substitute the correct port in repcmd.sql and execute it.

3) On master execute command:
INSERT INTO rul_rule_set VALUES ('1', 'name','1','age','true');

Replica will crash with:
FailedAssertion("ActivePortal && ActivePortal->status == PORTAL_ACTIVE", File: "pg_proc.c", Line: 1038, PID: 42894)
in infinite loop.

After applying this patch replica will give the correct error message instead of assertion:

2022-08-21 17:08:39.935 MSK [143171] ERROR: relation "rul_rule_set" does not exist at character 172
2022-08-21 17:08:39.935 MSK [143171] QUERY:
-- Last modified: 2022-08-21 17:08:39.930842+03
with parameters as (
<<--- skipped by me --- >>>
)
2022-08-21 17:08:39.935 MSK [143171] CONTEXT: SQL statement "create or replace function public.rule_set_selector(
<<--- skipped by me --- >>>
SQL statement "call public.rebuild_rule_set_selector()"
PL/pgSQL function public.rul_rule_set_trg() line 4 at CALL
processing remote data for replication origin "pg_16401" during "INSERT"
for replication target relation "public.rul_rule_set" in transaction 741 finished at 0/17BE180

With best regards,
--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

mascmd.sqlapplication/sql; name=mascmd.sqlDownload

repcmd.sqlapplication/sql; name=repcmd.sqlDownload

v1-0001-Fix-logical-replica-assert-on-func-error.patchtext/x-patch; charset=UTF-8; name=v1-0001-Fix-logical-replica-assert-on-func-error.patchDownload

commit 585d0cd944d952f08f7649d02f1d6b6644e93611
Author: Peter Eisentraut <peter@eisentraut.org>
Date:   Sat Aug 20 20:48:47 2022 +0200

    Remove dummyret definition
    
    This hasn't been used in a while (last use removed by 50d22de932, and
    before that 84b6d5f359), and since we are now preferring inline
    functions over complex macros, it's unlikely to be needed again.
    
    Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
    Discussion: https://www.postgresql.org/message-id/flat/7110ab37-8ddd-437f-905c-6aa6205c6185%40enterprisedb.com

diff --git a/src/include/c.h b/src/include/c.h
index 65e91a6b89..dfc366b026 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -333,16 +333,6 @@
 	_61,_62,_63,  N, ...) \
 	(N)
 
-/*
- * dummyret is used to set return values in macros that use ?: to make
- * assignments.  gcc wants these to be void, other compilers like char
- */
-#ifdef __GNUC__					/* GNU cc */
-#define dummyret	void
-#else
-#define dummyret	char
-#endif
-
 /*
  * Generic function pointer.  This can be used in the rare cases where it's
  * necessary to cast a function pointer to a seemingly incompatible function

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

In reply to: Anton A. Melnikov (#1)

1 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

Hello!

On 21.08.2022 17:33, Anton A. Melnikov wrote:

Hello!

Here is a fix for the bug first described in:
/messages/by-id/adf0452f-8c6b-7def-d35e-ab516c80088e@inbox.ru

Sorry, there was a wrong patch in the first letter.
Here is a right version.

With best regards,
--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v2-0001-Fix-logical-replica-assert-on-func-error.patchtext/x-patch; charset=UTF-8; name=v2-0001-Fix-logical-replica-assert-on-func-error.patchDownload

commit 9bff84bda455609820c259bdc47d200bebcba7ab
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Date:   Sun Aug 21 18:27:44 2022 +0300

    Fix logical replica crash if there was an error in a function.

diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index a9fe45e347..1381fae575 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -1007,8 +1007,9 @@ sql_function_parse_error_callback(void *arg)
  * anonymous-block handler, not only for SQL-language functions.
  * It is assumed that the syntax error position is initially relative to the
  * function body string (as passed in).  If possible, we adjust the position
- * to reference the original command text; if we can't manage that, we set
- * up an "internal query" syntax error instead.
+ * to reference the original command text; if we can't manage that or
+ * can't get the original command text when ActivePortal is not defined,
+ * we set up an "internal query" syntax error instead.
  *
  * Returns true if a syntax error was processed, false if not.
  */
@@ -1016,7 +1017,7 @@ bool
 function_parse_error_transpose(const char *prosrc)
 {
 	int			origerrposition;
-	int			newerrposition;
+	int			newerrposition = 0;
 	const char *queryText;
 
 	/*
@@ -1034,12 +1035,15 @@ function_parse_error_transpose(const char *prosrc)
 			return false;
 	}
 
-	/* We can get the original query text from the active portal (hack...) */
-	Assert(ActivePortal && ActivePortal->status == PORTAL_ACTIVE);
-	queryText = ActivePortal->sourceText;
+	if (ActivePortal)
+	{
+		/* We can get the original query text from the active portal (hack...) */
+		Assert(ActivePortal->status == PORTAL_ACTIVE);
+		queryText = ActivePortal->sourceText;
 
-	/* Try to locate the prosrc in the original text */
-	newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition);
+		/* Try to locate the prosrc in the original text */
+		newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition);
+	}
 
 	if (newerrposition > 0)
 	{
@@ -1052,7 +1056,8 @@ function_parse_error_transpose(const char *prosrc)
 	else
 	{
 		/*
-		 * If unsuccessful, convert the position to an internal position
+		 * If unsuccessful or ActivePortal not defined and original command
+		 * text is unreachable, convert the position to an internal position
 		 * marker and give the function text as the internal query.
 		 */
 		errposition(0);

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

In reply to: Anton A. Melnikov (#2)

2 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

Hello!

The patch was rebased on current master.
And here is a simplified crash reproduction:
1) On primary with 'wal_level = logical' execute:
CREATE TABLE public.test (id int NOT NULL, val integer);
CREATE PUBLICATION test_pub FOR TABLE test;

2) On replica replace XXXX in the repcmd.sql attached with primary port and execute it:
psql -f repcmd.sql

3) On master execute command:
INSERT INTO test VALUES ('1');

With best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v3-0001-Fix-logical-replica-assert-on-func-error.patchtext/x-patch; charset=UTF-8; name=v3-0001-Fix-logical-replica-assert-on-func-error.patchDownload

commit 4de66e1b1ffaeaacdd72f3e72789ca05b114476b
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Date:   Sun Aug 21 18:27:44 2022 +0300

    Fix logical replica crash if there was an error in a function.

diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index a9fe45e347..1381fae575 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -1007,8 +1007,9 @@ sql_function_parse_error_callback(void *arg)
  * anonymous-block handler, not only for SQL-language functions.
  * It is assumed that the syntax error position is initially relative to the
  * function body string (as passed in).  If possible, we adjust the position
- * to reference the original command text; if we can't manage that, we set
- * up an "internal query" syntax error instead.
+ * to reference the original command text; if we can't manage that or
+ * can't get the original command text when ActivePortal is not defined,
+ * we set up an "internal query" syntax error instead.
  *
  * Returns true if a syntax error was processed, false if not.
  */
@@ -1016,7 +1017,7 @@ bool
 function_parse_error_transpose(const char *prosrc)
 {
 	int			origerrposition;
-	int			newerrposition;
+	int			newerrposition = 0;
 	const char *queryText;
 
 	/*
@@ -1034,12 +1035,15 @@ function_parse_error_transpose(const char *prosrc)
 			return false;
 	}
 
-	/* We can get the original query text from the active portal (hack...) */
-	Assert(ActivePortal && ActivePortal->status == PORTAL_ACTIVE);
-	queryText = ActivePortal->sourceText;
+	if (ActivePortal)
+	{
+		/* We can get the original query text from the active portal (hack...) */
+		Assert(ActivePortal->status == PORTAL_ACTIVE);
+		queryText = ActivePortal->sourceText;
 
-	/* Try to locate the prosrc in the original text */
-	newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition);
+		/* Try to locate the prosrc in the original text */
+		newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition);
+	}
 
 	if (newerrposition > 0)
 	{
@@ -1052,7 +1056,8 @@ function_parse_error_transpose(const char *prosrc)
 	else
 	{
 		/*
-		 * If unsuccessful, convert the position to an internal position
+		 * If unsuccessful or ActivePortal not defined and original command
+		 * text is unreachable, convert the position to an internal position
 		 * marker and give the function text as the internal query.
 		 */
 		errposition(0);

repcmd.sqlapplication/sql; name=repcmd.sqlDownload

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

In reply to: Anton A. Melnikov (#3)

2 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

Hello!

Added a TAP test for this case.

On 30.08.2022 10:09, Anton A. Melnikov wrote:

Hello!

The patch was rebased on current master.
And here is a simplified crash reproduction:
1) On primary with 'wal_level = logical' execute:
CREATE TABLE public.test (id int NOT NULL, val integer);
CREATE PUBLICATION test_pub FOR TABLE test;

2) On replica replace XXXX in the repcmd.sql attached with primary port and execute it:
psql -f repcmd.sql

3) On master execute command:
INSERT INTO test VALUES ('1');

With best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v4-0001-Fix-logical-replica-assert-on-func-error.patchtext/x-patch; charset=UTF-8; name=v4-0001-Fix-logical-replica-assert-on-func-error.patchDownload

commit d539e1e36ef7af33e1a89eaee00183739c716797
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Date:   Sun Aug 21 18:27:44 2022 +0300

    Fix logical replica crash if there was an error in a function.

diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index a9fe45e347..1381fae575 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -1007,8 +1007,9 @@ sql_function_parse_error_callback(void *arg)
  * anonymous-block handler, not only for SQL-language functions.
  * It is assumed that the syntax error position is initially relative to the
  * function body string (as passed in).  If possible, we adjust the position
- * to reference the original command text; if we can't manage that, we set
- * up an "internal query" syntax error instead.
+ * to reference the original command text; if we can't manage that or
+ * can't get the original command text when ActivePortal is not defined,
+ * we set up an "internal query" syntax error instead.
  *
  * Returns true if a syntax error was processed, false if not.
  */
@@ -1016,7 +1017,7 @@ bool
 function_parse_error_transpose(const char *prosrc)
 {
 	int			origerrposition;
-	int			newerrposition;
+	int			newerrposition = 0;
 	const char *queryText;
 
 	/*
@@ -1034,12 +1035,15 @@ function_parse_error_transpose(const char *prosrc)
 			return false;
 	}
 
-	/* We can get the original query text from the active portal (hack...) */
-	Assert(ActivePortal && ActivePortal->status == PORTAL_ACTIVE);
-	queryText = ActivePortal->sourceText;
+	if (ActivePortal)
+	{
+		/* We can get the original query text from the active portal (hack...) */
+		Assert(ActivePortal->status == PORTAL_ACTIVE);
+		queryText = ActivePortal->sourceText;
 
-	/* Try to locate the prosrc in the original text */
-	newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition);
+		/* Try to locate the prosrc in the original text */
+		newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition);
+	}
 
 	if (newerrposition > 0)
 	{
@@ -1052,7 +1056,8 @@ function_parse_error_transpose(const char *prosrc)
 	else
 	{
 		/*
-		 * If unsuccessful, convert the position to an internal position
+		 * If unsuccessful or ActivePortal not defined and original command
+		 * text is unreachable, convert the position to an internal position
 		 * marker and give the function text as the internal query.
 		 */
 		errposition(0);
diff --git a/src/test/recovery/t/034_logical_replica_on_error.pl b/src/test/recovery/t/034_logical_replica_on_error.pl
new file mode 100644
index 0000000000..380ad74590
--- /dev/null
+++ b/src/test/recovery/t/034_logical_replica_on_error.pl
@@ -0,0 +1,105 @@
+# Copyright (c) 2022, Postgres Professional
+
+# There was an assertion if function text contains an error. See PGPRO-7025
+# Тhis file has a prefix (120_) to prevent prefix collision with
+# upstream test files.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 2;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+wal_level = logical
+));
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	'CREATE TABLE public.test (id int NOT NULL, val integer);');
+
+$node_primary->safe_psql('postgres',
+	'CREATE PUBLICATION test_pub FOR TABLE test;');
+
+# Initialize logical replica node
+my $node_replica = PostgreSQL::Test::Cluster->new('replica');
+$node_replica->init(has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_replica->safe_psql('postgres',
+	'CREATE TABLE test (id int NOT NULL, val integer);');
+
+$node_replica->safe_psql('postgres', q{
+	create or replace procedure rebuild_test(
+	) as
+	$body$
+	declare
+		l_code  text:= E'create or replace function public.test_selector(
+	) returns setof public.test as
+	\$body\$
+		select * from  error_name
+	\$body\$ language sql;';
+	begin
+		execute l_code;
+		perform public.test_selector();
+	end
+	$body$ language plpgsql;
+});
+
+$node_replica->safe_psql('postgres', '
+	create or replace function test_trg()
+	returns trigger as
+	$body$
+		declare
+		begin
+			call public.rebuild_test();
+			return NULL;
+		end
+	$body$ language plpgsql;
+');
+
+$node_replica->safe_psql('postgres',
+	'create trigger test_trigger after insert or update or delete on test for each row execute function test_trg();');
+
+$node_replica->safe_psql('postgres',
+	'alter table test enable replica trigger test_trigger;');
+
+my $primary_port   = $node_primary->port;
+my $conn_params = "port=" . $primary_port . " dbname=postgres";
+
+my ($stdin, $stdout, $stderr) = $node_replica->psql('postgres',
+	qq[CREATE SUBSCRIPTION test_sub CONNECTION \'$conn_params\' PUBLICATION test_pub;]
+);
+
+ok($stderr =~ /NOTICE:  created replication slot \"test_sub\" on publisher/,
+	"Logical decoding correctly fails on function error");
+
+my $worker_pid =
+	  $node_replica->safe_psql('postgres', q{
+		select pid from pg_stat_subscription where subname = 'test_sub' limit 1;
+	});
+
+$node_primary->safe_psql('postgres', q{
+	INSERT INTO test VALUES ('1');
+	});
+
+$node_replica->poll_query_until('postgres', qq[
+		select $worker_pid != any(array_agg(pid)) as ready
+			from pg_stat_subscription where subname = \'test_sub\';
+	])
+  or die "Timed out while waiting for new WAL with error from primary";
+
+$node_replica->stop;
+$node_primary->stop;
+
+my $log = $node_replica->logfile();
+
+my $log_contents = slurp_file($log);
+
+ok($log_contents =~ /ERROR:  relation \"error_name\" does not exist at character/,
+	"Logical decoding correctly fails on function error");

repcmd.sqlapplication/sql; name=repcmd.sqlDownload

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Anton A. Melnikov (#4)

Re: [BUG] Logical replica crash if there was an error in a function.

"Anton A. Melnikov" <aamelnikov@inbox.ru> writes:

[ v4-0001-Fix-logical-replica-assert-on-func-error.patch ]

I took a quick look at this. I think you're solving the
problem in the wrong place. The real issue is why are
we not setting up ActivePortal correctly when running
user-defined code in a logrep worker? There is other code
that expects that to be set, eg EnsurePortalSnapshotExists.

regards, tom lane

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

In reply to: Tom Lane (#5)

1 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

Hello!

Thanks for reply!

On 24.09.2022 20:27, Tom Lane wrote:

I think you're solving the
problem in the wrong place. The real issue is why are
we not setting up ActivePortal correctly when running
user-defined code in a logrep worker?

During a common query from the backend ActivePortal becomes defined
in the PortalRun and then AfterTriggerEndQuery starts with
non-NULL ActivePortal after ExecutorFinish.
When the logrep worker is applying messages there are neither
PortalStart nor PortalRun calls. And AfterTriggerEndQuery starts
with undefined ActivePortal after finish-edata().
May be it's normal behavior?

There is other code
that expects that to be set, eg EnsurePortalSnapshotExists.

When the logrep worker is applying message it doesn't have to use
ActivePortal in EnsurePortalSnapshotExists because ActiveSnapshot is already installed.
It is set at the beginning of each transaction in the begin_replication_step call.

On the other hand, function_parse_error_transpose() tries to get
the original query text (INSERT INTO test VALUES ('1') in our case) from
the ActivePortal to clarify the location of the error.
But in the logrep worker there is no way to restore original query text
from the logrep message. There is only 'zipped' query equivalent to the original.
So any function_parse_error_transpose() call seems to be useless
in the logrep worker.

And it looks like we can simply omit match_prosrc_to_query() call there.
The attached patch does it.

Best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v5-0001-Fix-logical-replica-assert-on-func-error.patchtext/x-patch; charset=UTF-8; name=v5-0001-Fix-logical-replica-assert-on-func-error.patchDownload

commit bfaa02ac7a7cbeb793747be71962a7799c60c21c
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Date:   Sun Oct 9 11:56:20 2022 +0300

    Fix logical replica crash if there was an error in a user function.

diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index a9fe45e347..1e8f1b1097 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -36,6 +36,7 @@
 #include "parser/parse_coerce.h"
 #include "parser/parse_type.h"
 #include "pgstat.h"
+#include "replication/logicalworker.h"
 #include "rewrite/rewriteHandler.h"
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
@@ -1034,12 +1035,20 @@ function_parse_error_transpose(const char *prosrc)
 			return false;
 	}
 
-	/* We can get the original query text from the active portal (hack...) */
-	Assert(ActivePortal && ActivePortal->status == PORTAL_ACTIVE);
-	queryText = ActivePortal->sourceText;
+	/*
+	 * In the logical replication worker there is no way to restore original
+	 * query text from the logical replication message. There is only 'zipped'
+	 * query equivalent to the original text.
+	 */
+	if (!IsLogicalWorker())
+	{
+		/* We can get the original query text from the active portal (hack...) */
+		Assert(ActivePortal && ActivePortal->status == PORTAL_ACTIVE);
+		queryText = ActivePortal->sourceText;
 
-	/* Try to locate the prosrc in the original text */
-	newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition);
+		/* Try to locate the prosrc in the original text */
+		newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition);
+	}
 
 	if (newerrposition > 0)
 	{

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Tom Lane (#5)

Re: [BUG] Logical replica crash if there was an error in a function.

On 2022-Sep-24, Tom Lane wrote:

"Anton A. Melnikov" <aamelnikov@inbox.ru> writes:

[ v4-0001-Fix-logical-replica-assert-on-func-error.patch ]

I took a quick look at this. I think you're solving the
problem in the wrong place. The real issue is why are
we not setting up ActivePortal correctly when running
user-defined code in a logrep worker? There is other code
that expects that to be set, eg EnsurePortalSnapshotExists.

Right ... mostly, the logical replication *does* attempt to set up a
transaction and active snapshot when replaying actions (c.f.
begin_replication_step()). Is this firing at an inappropriate time,
perhaps?

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Anton A. Melnikov (#6)

Re: [BUG] Logical replica crash if there was an error in a function.

On Sun, Oct 09, 2022 at 12:24:23PM +0300, Anton A. Melnikov wrote:

On the other hand, function_parse_error_transpose() tries to get
the original query text (INSERT INTO test VALUES ('1') in our case) from
the ActivePortal to clarify the location of the error.
But in the logrep worker there is no way to restore original query text
from the logrep message. There is only 'zipped' query equivalent to the original.
So any function_parse_error_transpose() call seems to be useless
in the logrep worker.

Yeah, the query string is not available in this context, but it surely
looks wrong to me to assume that something as low-level as
function_parse_error_transpose() needs to be updated for the sake of a
logical worker, while we have other areas that would expect a portal
to be set up.
--
Michael

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Michael Paquier (#8)

1 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

Michael Paquier <michael@paquier.xyz> writes:

Yeah, the query string is not available in this context, but it surely
looks wrong to me to assume that something as low-level as
function_parse_error_transpose() needs to be updated for the sake of a
logical worker, while we have other areas that would expect a portal
to be set up.

After thinking about this some more, I'm withdrawing my opposition to
fixing it by making function_parse_error_transpose() cope with not
having an active portal. I have a few reasons:

* A Portal is intended to contain an executor state. While worker.c
does fake up an EState, there's certainly no plan tree or planstate tree,
and I doubt it'd be sane to create dummy ones. So even if we made a
Portal, it'd be lacking a lot of the stuff one would expect to find there.
I fear that moving the cause of this sort of problem from "there's no
ActivePortal" to "there's an ActivePortal but it lacks field X" wouldn't
be an improvement.

* There is actually just one other consumer of ActivePortal,
namely EnsurePortalSnapshotExists, and that doesn't offer a lot of
support for the idea that ActivePortal must always be set. It says

* Nothing to do if a snapshot is set. (We take it on faith that the
* outermost active snapshot belongs to some Portal; or if there is no
* Portal, it's somebody else's responsibility to manage things.)

and "it's somebody else's responsibility" summarizes the situation
here pretty perfectly. worker.c *does* set up an active snapshot.

* The comment in function_parse_error_transpose() freely admits that
looking at the ActivePortal is a hack. It works, more or less, for
the intended case of reporting a function-body syntax error nicely
during CREATE FUNCTION. But it's capable of getting false-positive
matches, so maybe someday we should replace it with something more
bulletproof.

* There isn't any strong reason why function_parse_error_transpose()
has to succeed at finding the original query text. Its fallback
approach of treating the syntax error position as internal to the
function body text is fine, in fact it's just what we want here.

So I'm now good with the idea of just not failing. I don't like
the patch as presented though. First, the cfbot is quite rightly
complaining about the "uninitialized variable" warning it draws.
Second, I don't see a good reason to tie the change to logical
replication in any way. Let's just change the Assert to an if(),
as attached.

regards, tom lane

Attachments:

v6-0001-Fix-logical-replica-assert-on-func-error.patchtext/x-diff; charset=us-ascii; name=v6-0001-Fix-logical-replica-assert-on-func-error.patchDownload

diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index a9fe45e347..e03b98bcd2 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -1017,7 +1017,6 @@ function_parse_error_transpose(const char *prosrc)
 {
 	int			origerrposition;
 	int			newerrposition;
-	const char *queryText;
 
 	/*
 	 * Nothing to do unless we are dealing with a syntax error that has a
@@ -1035,11 +1034,22 @@ function_parse_error_transpose(const char *prosrc)
 	}
 
 	/* We can get the original query text from the active portal (hack...) */
-	Assert(ActivePortal && ActivePortal->status == PORTAL_ACTIVE);
-	queryText = ActivePortal->sourceText;
+	if (ActivePortal && ActivePortal->status == PORTAL_ACTIVE)
+	{
+		const char *queryText = ActivePortal->sourceText;
 
-	/* Try to locate the prosrc in the original text */
-	newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition);
+		/* Try to locate the prosrc in the original text */
+		newerrposition = match_prosrc_to_query(prosrc, queryText,
+											   origerrposition);
+	}
+	else
+	{
+		/*
+		 * Quietly give up if no ActivePortal.  This is an unusual situation
+		 * but it can happen in, e.g., logical replication workers.
+		 */
+		newerrposition = -1;
+	}
 
 	if (newerrposition > 0)
 	{

#10

Amit Kapila

amit.kapila16@gmail.com

about 3 years ago

In reply to: Tom Lane (#9)

Re: [BUG] Logical replica crash if there was an error in a function.

On Wed, Nov 2, 2022 at 11:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So I'm now good with the idea of just not failing. I don't like
the patch as presented though. First, the cfbot is quite rightly
complaining about the "uninitialized variable" warning it draws.
Second, I don't see a good reason to tie the change to logical
replication in any way. Let's just change the Assert to an if(),
as attached.

LGTM. I don't know if it is a good idea to omit the test case for this
scenario. If required, we can reuse the test case from Sawada-San's
patch in the email [1]/messages/by-id/CAD21AoDKA+MB4M9BOnct_=Zj5bNHbkYn6oKZ2aOQp8m=3x2GhQ@mail.gmail.com.

[1]: /messages/by-id/CAD21AoDKA+MB4M9BOnct_=Zj5bNHbkYn6oKZ2aOQp8m=3x2GhQ@mail.gmail.com

--
With Regards,
Amit Kapila.

#11

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Amit Kapila (#10)

Re: [BUG] Logical replica crash if there was an error in a function.

Amit Kapila <amit.kapila16@gmail.com> writes:

LGTM. I don't know if it is a good idea to omit the test case for this
scenario. If required, we can reuse the test case from Sawada-San's
patch in the email [1].

I don't think the cost of that test case is justified by the tiny
probability that it'd ever catch anything. If we were just adding a
query or two to an existing scenario, that could be okay; but spinning
up and syncing a whole new primary and standby database is *expensive*
when you multiply it by the number of times developers and buildfarm
animals are going to run the tests in the future.

There's also the little issue that I'm not sure it would actually
detect a problem if we had one. The case is going to fail, and
what we want to know is just how messily it fails, and I think the
TAP infrastructure isn't very sensitive to that ... especially
if the test isn't even checking for specific error messages.

regards, tom lane

#12

Anton A. Melnikov

aamelnikov@inbox.ru

about 3 years ago

In reply to: Tom Lane (#11)

1 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

Hello!

On 02.11.2022 21:02, Tom Lane wrote:

So I'm now good with the idea of just not failing. I don't like
the patch as presented though. First, the cfbot is quite rightly
complaining about the "uninitialized variable" warning it draws.
Second, I don't see a good reason to tie the change to logical
replication in any way. Let's just change the Assert to an if(),
as attached.

Fully agreed that is the most optimal solution for that case. Thanks!
Surely it's very rare one but there was a real segfault at production server.
Someone came up with the idea to modify function like public.test_selector()
in repcmd.sql (see above) on the fly with adding to it :last_modification:
field from current time and some other parameters with the help of replace()
inside the creation of the rebuild_test() procedure.

On 03.11.2022 18:29, Tom Lane wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

LGTM. I don't know if it is a good idea to omit the test case for this
scenario. If required, we can reuse the test case from Sawada-San's
patch in the email [1].

I don't think the cost of that test case is justified by the tiny
probability that it'd ever catch anything. If we were just adding a
query or two to an existing scenario, that could be okay; but spinning
up and syncing a whole new primary and standby database is *expensive*
when you multiply it by the number of times developers and buildfarm
animals are going to run the tests in the future.

There's also the little issue that I'm not sure it would actually
detect a problem if we had one. The case is going to fail, and
what we want to know is just how messily it fails, and I think the
TAP infrastructure isn't very sensitive to that ... especially
if the test isn't even checking for specific error messages.

Seems it is possible to do a test without these remarks. The attached
test uses existing nodes and checks the specific error message. Additionally
i've tried to reduce overall number of nodes previously
used in this test in a similar way.

Would be glad for comments and remarks.

With best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v1-0001-Test-for-func-error-in-logrep-worker.patchtext/x-patch; charset=UTF-8; name=v1-0001-Test-for-func-error-in-logrep-worker.patchDownload

commit 722fa6d8c629eb83b3d11ea88b49bab1f700b48d
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Date:   Mon Nov 14 08:30:26 2022 +0300

    Add test for syntax error in the function in a a logical replication
    worker and combine some test nodes.

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 6247aa7730..76a6c12cae 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -69,6 +69,9 @@ $node_publisher->wait_for_catchup('sub1');
 
 pass('index predicates do not cause crash');
 
+$node_publisher->safe_psql('postgres',
+	"CHECKPOINT");
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
@@ -81,9 +84,8 @@ $node_subscriber->stop('fast');
 # identity set before accepting updates.  If it did not it would cause
 # an error when an update was attempted.
 
-$node_publisher = PostgreSQL::Test::Cluster->new('publisher2');
-$node_publisher->init(allows_streaming => 'logical');
-$node_publisher->start;
+$node_publisher->rotate_logfile();
+$node_publisher->start();
 
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub FOR ALL TABLES");
@@ -102,6 +104,9 @@ is( $node_publisher->psql(
 	'update to unlogged table without replica identity with FOR ALL TABLES publication'
 );
 
+$node_publisher->safe_psql('postgres',
+	"CHECKPOINT");
+
 $node_publisher->stop('fast');
 
 # Bug #16643 - https://postgr.es/m/16643-eaadeb2a1a58d28c@postgresql.org
@@ -226,13 +231,13 @@ $node_sub->stop('fast');
 # target table's relcache was not being invalidated. This leads to skipping
 # UPDATE/DELETE operations during apply on the subscriber side as the columns
 # required to search corresponding rows won't get logged.
-$node_publisher = PostgreSQL::Test::Cluster->new('publisher3');
-$node_publisher->init(allows_streaming => 'logical');
-$node_publisher->start;
 
-$node_subscriber = PostgreSQL::Test::Cluster->new('subscriber3');
-$node_subscriber->init(allows_streaming => 'logical');
-$node_subscriber->start;
+$node_publisher->rotate_logfile();
+$node_publisher->start();
+
+$node_subscriber->rotate_logfile();
+$node_subscriber->rotate_logfile(); # call twice to synchronize lognames between master and replica
+$node_subscriber->start();
 
 $node_publisher->safe_psql('postgres',
 	"CREATE TABLE tab_replidentity_index(a int not null, b int not null)");
@@ -296,7 +301,89 @@ is( $node_subscriber->safe_psql(
 	qq(-1|1),
 	"update works with REPLICA IDENTITY");
 
+$node_publisher->safe_psql('postgres',
+	"CHECKPOINT");
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
+# https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru
+
+# The bug was that when a syntax error occurred in a SQL-language or PL/pgSQL-language
+# CREATE FUNCTION or DO command executed in a logical replication worker,
+# we'd suffer a null pointer dereference or assertion failure.
+
+$node_publisher->rotate_logfile();
+$node_publisher->start();
+
+$node_subscriber->rotate_logfile();
+$node_subscriber->start();
+
+$node_publisher->safe_psql('postgres',
+	'CREATE TABLE public.test (id int NOT NULL, val integer);');
+
+$node_publisher->safe_psql('postgres',
+	'CREATE PUBLICATION test_pub FOR TABLE test;');
+
+$node_subscriber->safe_psql('postgres',
+	'CREATE TABLE test (id int NOT NULL, val integer);');
+
+$node_subscriber->safe_psql('postgres', q{
+	create or replace procedure rebuild_test(
+	) as
+	$body$
+	declare
+		l_code  text:= E'create or replace function public.test_selector(
+	) returns setof public.test as
+	\$body\$
+		select * from  error_name
+	\$body\$ language sql;';
+	begin
+		execute l_code;
+		perform public.test_selector();
+	end
+	$body$ language plpgsql;
+});
+
+$node_subscriber->safe_psql('postgres', '
+	create or replace function test_trg()
+	returns trigger as
+	$body$
+		declare
+		begin
+			call public.rebuild_test();
+			return NULL;
+		end
+	$body$ language plpgsql;
+');
+
+$node_subscriber->safe_psql('postgres',
+	'create trigger test_trigger after insert or update or delete on test for each row execute function test_trg();');
+
+$node_subscriber->safe_psql('postgres',
+	'alter table test enable replica trigger test_trigger;');
+
+my $primary_port   = $node_publisher->port;
+my $conn_params = "port=" . $primary_port . " dbname=postgres";
+
+my ($stdin, $stdout, $stderr) = $node_subscriber->psql('postgres',
+	qq[CREATE SUBSCRIPTION test_sub CONNECTION \'$conn_params\' PUBLICATION test_pub;]
+);
+
+$node_publisher->wait_for_catchup('test_sub');
+
+ok($stderr =~ /NOTICE:  created replication slot \"test_sub\" on publisher/,
+	"ERROR: Subscription on replica or replication slot on publisher weren't created.");
+
+$node_publisher->safe_psql('postgres', q{
+	INSERT INTO test VALUES ('1');
+	});
+
+my $result = $node_subscriber->wait_for_log(
+	"ERROR:  relation \"error_name\" does not exist at character"
+);
+
+ok($result,
+	"ERROR: Logical decoding doesn't fail on function error");
+
 done_testing();

#13

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Anton A. Melnikov (#12)

Re: [BUG] Logical replica crash if there was an error in a function.

"Anton A. Melnikov" <aamelnikov@inbox.ru> writes:

On 02.11.2022 21:02, Tom Lane wrote:

I don't think the cost of that test case is justified by the tiny
probability that it'd ever catch anything.

Seems it is possible to do a test without these remarks. The attached
test uses existing nodes and checks the specific error message.

My opinion remains unchanged: this would be a very poor use of test
cycles.

Additionally
i've tried to reduce overall number of nodes previously
used in this test in a similar way.

Optimizing existing tests isn't an answer to that. We could
install those optimizations without adding a new test case.

regards, tom lane

#14

Anton A. Melnikov

aamelnikov@inbox.ru

about 3 years ago

In reply to: Tom Lane (#13)

2 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

Thanks a lot for the fast reply!

On 03.11.2022 18:29, Tom Lane wrote:

If we were just adding a
query or two to an existing scenario, that could be okay; but spinning
up and syncing a whole new primary and standby database is *expensive*
when you multiply it by the number of times developers and buildfarm
animals are going to run the tests in the future.

On 15.11.2022 04:59, Tom Lane wrote:

"Anton A. Melnikov" <aamelnikov@inbox.ru> writes:

On 02.11.2022 21:02, Tom Lane wrote:

I don't think the cost of that test case is justified by the tiny
probability that it'd ever catch anything.

Seems it is possible to do a test without these remarks. The attached
test uses existing nodes and checks the specific error message.

My opinion remains unchanged: this would be a very poor use of test
cycles.

Sorry, i didn't fully understand what is required and
added some functions to the test that spend extra cpu time. But i found
that it is possible to make a test according to previous remarks by adding
only a few extra queries to an existent test without any additional syncing.

Experimentally, i could not observe any significant difference in
CPU usage due to this test addition.
The difference in the CPU usage was less than standard error.
See 100_bugs-CPU-time.txt attached.

Additionally
i've tried to reduce overall number of nodes previously
used in this test in a similar way.

Optimizing existing tests isn't an answer to that. We could
install those optimizations without adding a new test case.

Oh sure! Here is a separate patch for this optimization:
/messages/by-id/eb7aa992-c2d7-6ce7-4942-0c784231a362@inbox.ru
On the contrary with previous case in that one the difference in the CPU usage
during 100_bugs.pl is essential; it decreases approximately by 30%.

With the best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v2-0001-Test-for-func-error-in-logrep-worker.patchtext/x-patch; charset=UTF-8; name=v2-0001-Test-for-func-error-in-logrep-worker.patchDownload

commit 2449158ba576d7c6d97852d14f85dadb3aced262
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Date:   Wed Nov 16 11:46:54 2022 +0300

    Add test for syntax error in the function in a a logical replication
    worker. See dea834938.

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 6247aa7730..eb4ab6d359 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -69,6 +69,51 @@ $node_publisher->wait_for_catchup('sub1');
 
 pass('index predicates do not cause crash');
 
+# https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru
+
+# The bug was that when a syntax error occurred in a SQL-language or PL/pgSQL-language
+# CREATE FUNCTION or DO command executed in a logical replication worker,
+# we'd suffer a null pointer dereference or assertion failure.
+
+$node_subscriber->safe_psql('postgres', q{
+	create or replace procedure rebuild_test(
+	) as
+	$body$
+	declare
+		l_code  text:= E'create or replace function public.test_selector(
+	) returns setof public.tab1 as
+	\$body\$
+		select * from  error_name
+	\$body\$ language sql;';
+	begin
+		execute l_code;
+		perform public.test_selector();
+	end
+	$body$ language plpgsql;
+	create or replace function test_trg()
+	returns trigger as
+	$body$
+		declare
+		begin
+			call public.rebuild_test();
+			return NULL;
+		end
+	$body$ language plpgsql;
+	create trigger test_trigger after insert on tab1 for each row
+								execute function test_trg();
+	alter table tab1 enable replica trigger test_trigger;
+});
+
+# This would crash on the subscriber if not fixed
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)");
+
+my $result = $node_subscriber->wait_for_log(
+	"ERROR:  relation \"error_name\" does not exist at character"
+);
+
+ok($result,
+	"ERROR: Logical decoding doesn't fail on function error");
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');

100_bugs-CPU-time.txttext/plain; charset=UTF-8; name=100_bugs-CPU-time.txtDownload

#15

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Anton A. Melnikov (#14)

Re: [BUG] Logical replica crash if there was an error in a function.

Hi,

On 2022-11-16 17:52:50 +0300, Anton A. Melnikov wrote:

Sorry, i didn't fully understand what is required and
added some functions to the test that spend extra cpu time. But i found
that it is possible to make a test according to previous remarks by adding
only a few extra queries to an existent test without any additional syncing.

Experimentally, i could not observe any significant difference in
CPU usage due to this test addition.
The difference in the CPU usage was less than standard error.
See 100_bugs-CPU-time.txt attached.

Additionally
i've tried to reduce overall number of nodes previously
used in this test in a similar way.

Optimizing existing tests isn't an answer to that. We could
install those optimizations without adding a new test case.

Oh sure! Here is a separate patch for this optimization:
/messages/by-id/eb7aa992-c2d7-6ce7-4942-0c784231a362@inbox.ru
On the contrary with previous case in that one the difference in the CPU usage
during 100_bugs.pl is essential; it decreases approximately by 30%.

This CF entry causes tests to fail on all platforms:
https://cirrus-ci.com/build/5755408111894528

E.g.
https://api.cirrus-ci.com/v1/artifact/task/5298457144459264/testrun/build/testrun/subscription/100_bugs/log/regress_log_100_bugs

#### Begin standard error
psql:<stdin>:1: NOTICE: dropped replication slot "sub1" on publisher
#### End standard error
timed out waiting for match: ERROR: relation "error_name" does not exist at character at /tmp/cirrus-ci-build/src/test/subscription/t/100_bugs.pl line 115.

Greetings,

Andres Freund

#16

Anton A. Melnikov

aamelnikov@inbox.ru

about 3 years ago

In reply to: Andres Freund (#15)

1 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

On 07.12.2022 21:03, Andres Freund wrote:

This CF entry causes tests to fail on all platforms:
https://cirrus-ci.com/build/5755408111894528

E.g.
https://api.cirrus-ci.com/v1/artifact/task/5298457144459264/testrun/build/testrun/subscription/100_bugs/log/regress_log_100_bugs

#### Begin standard error
psql:<stdin>:1: NOTICE: dropped replication slot "sub1" on publisher
#### End standard error
timed out waiting for match: ERROR: relation "error_name" does not exist at character at /tmp/cirrus-ci-build/src/test/subscription/t/100_bugs.pl line 115.

Greetings,

Andres Freund

Thank you for reminding!

There was a conflict when applying v2 on current master.
Rebased v3 is attached.

Best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v3-0001-Test-for-func-error-in-logrep-worker.patchtext/x-patch; charset=UTF-8; name=v3-0001-Test-for-func-error-in-logrep-worker.patchDownload

commit 07eaa674953ed700a53174410a6e1eb81151d7e8
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Date:   Sun Dec 11 06:26:03 2022 +0300

    Add test for syntax error in the function in a a logical replication
        worker. See dea834938.

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 7b3cd66be5..0bf4fdfa47 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -69,6 +69,51 @@ $node_publisher->wait_for_catchup('sub1');
 
 pass('index predicates do not cause crash');
 
+# https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru
+
+# The bug was that when a syntax error occurred in a SQL-language or PL/pgSQL-language
+# CREATE FUNCTION or DO command executed in a logical replication worker,
+# we'd suffer a null pointer dereference or assertion failure.
+
+$node_subscriber->safe_psql('postgres', q{
+	create or replace procedure rebuild_test(
+	) as
+	$body$
+	declare
+		l_code  text:= E'create or replace function public.test_selector(
+	) returns setof public.tab1 as
+	\$body\$
+		select * from  error_name
+	\$body\$ language sql;';
+	begin
+		execute l_code;
+		perform public.test_selector();
+	end
+	$body$ language plpgsql;
+	create or replace function test_trg()
+	returns trigger as
+	$body$
+		declare
+		begin
+			call public.rebuild_test();
+			return NULL;
+		end
+	$body$ language plpgsql;
+	create trigger test_trigger after insert on tab1 for each row
+								execute function test_trg();
+	alter table tab1 enable replica trigger test_trigger;
+});
+
+# This would crash on the subscriber if not fixed
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)");
+
+my $result = $node_subscriber->wait_for_log(
+	"ERROR:  relation \"error_name\" does not exist at character"
+);
+
+ok($result,
+	"ERROR: Logical decoding doesn't fail on function error");
+
 # We'll re-use these nodes below, so drop their replication state.
 # We don't bother to drop the tables though.
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");

#17

vignesh C

vignesh21@gmail.com

about 3 years ago

In reply to: Anton A. Melnikov (#16)

Re: [BUG] Logical replica crash if there was an error in a function.

On Sun, 11 Dec 2022 at 09:21, Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

On 07.12.2022 21:03, Andres Freund wrote:

This CF entry causes tests to fail on all platforms:
https://cirrus-ci.com/build/5755408111894528

E.g.
https://api.cirrus-ci.com/v1/artifact/task/5298457144459264/testrun/build/testrun/subscription/100_bugs/log/regress_log_100_bugs

#### Begin standard error
psql:<stdin>:1: NOTICE: dropped replication slot "sub1" on publisher
#### End standard error
timed out waiting for match: ERROR: relation "error_name" does not exist at character at /tmp/cirrus-ci-build/src/test/subscription/t/100_bugs.pl line 115.

Greetings,

Andres Freund

Thank you for reminding!

There was a conflict when applying v2 on current master.
Rebased v3 is attached.

Few suggestions:
1) There is a warning:
+# This would crash on the subscriber if not fixed
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)");
+
+my $result = $node_subscriber->wait_for_log(
+       "ERROR:  relation \"error_name\" does not exist at character"
+);

"my" variable $result masks earlier declaration in same scope at
t/100_bugs.pl line 400.

You can change:
my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1");
to
$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1");

2) Now that the crash is fixed, you could change it to a better message:
+# This would crash on the subscriber if not fixed
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)");
+
+my $result = $node_subscriber->wait_for_log(
+       "ERROR:  relation \"error_name\" does not exist at character"
+);

Regards,
Vignesh

#18

Anton A. Melnikov

aamelnikov@inbox.ru

about 3 years ago

In reply to: vignesh C (#17)

1 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

Thanks for your remarks.

On 07.01.2023 15:27, vignesh C wrote:

Few suggestions:
1) There is a warning:
+# This would crash on the subscriber if not fixed
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)");
+
+my $result = $node_subscriber->wait_for_log(
+       "ERROR:  relation \"error_name\" does not exist at character"
+);
"my" variable $result masks earlier declaration in same scope at
t/100_bugs.pl line 400.

You can change:
my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1");
to
$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1");

The reason is that the patch fell behind the master.
Fixed in v4 together with rebasing on current master.

2) Now that the crash is fixed, you could change it to a better message:
+# This would crash on the subscriber if not fixed
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)");
+
+my $result = $node_subscriber->wait_for_log(
+       "ERROR:  relation \"error_name\" does not exist at character"
+);

Tried to make this comment more clear.

Best wishes for the new year!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v4-0001-Test-for-func-error-in-logrep-worker.patchtext/x-patch; charset=UTF-8; name=v4-0001-Test-for-func-error-in-logrep-worker.patchDownload

commit 16beb574708e0e2e993eb2fa1b093be97b8e53cf
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Date:   Sun Dec 11 06:26:03 2022 +0300

    Add test for syntax error in the function in a a logical replication
        worker. See dea834938.

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 036576acab..b241a7d498 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -69,6 +69,51 @@ $node_publisher->wait_for_catchup('sub1');
 
 pass('index predicates do not cause crash');
 
+# https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru
+
+# The bug was that when a syntax error occurred in a SQL-language or PL/pgSQL-language
+# CREATE FUNCTION or DO command executed in a logical replication worker,
+# we'd suffer a null pointer dereference or assertion failure.
+
+$node_subscriber->safe_psql('postgres', q{
+	create or replace procedure rebuild_test(
+	) as
+	$body$
+	declare
+		l_code  text:= E'create or replace function public.test_selector(
+	) returns setof public.tab1 as
+	\$body\$
+		select * from  error_name
+	\$body\$ language sql;';
+	begin
+		execute l_code;
+		perform public.test_selector();
+	end
+	$body$ language plpgsql;
+	create or replace function test_trg()
+	returns trigger as
+	$body$
+		declare
+		begin
+			call public.rebuild_test();
+			return NULL;
+		end
+	$body$ language plpgsql;
+	create trigger test_trigger after insert on tab1 for each row
+								execute function test_trg();
+	alter table tab1 enable replica trigger test_trigger;
+});
+
+# This command will cause a crash on the replica if that bug hasn't been fixed
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)");
+
+my $result = $node_subscriber->wait_for_log(
+	"ERROR:  relation \"error_name\" does not exist at character"
+);
+
+ok($result,
+	"ERROR: Logical decoding doesn't fail on function error");
+
 # We'll re-use these nodes below, so drop their replication state.
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
@@ -352,7 +397,7 @@ $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub_sch');
 
 # Subscriber's sch1.t1 should receive the row inserted into the new sch1.t1,
 # but not the row inserted into the old sch1.t1 post-rename.
-my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1");
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1");
 is( $result, qq(1
 2), 'check data in subscriber sch1.t1 after schema rename');

#19

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Anton A. Melnikov (#18)

1 attachment(s)

Re: [BUG] Logical replica crash if there was an error in a function.

Hello!

On 15.03.2023 21:29, Gregory Stark (as CFM) wrote:

These patches that are "Needs Review" and have received no comments at
all since before March 1st are these. If your patch is amongst this
list I would suggest any of:

1) Move it yourself to the next CF (or withdraw it)
2) Post to the list with any pending questions asking for specific
feedback -- it's much more likely to get feedback than just a generic
"here's a patch plz review"...
3) Mark it Ready for Committer and possibly post explaining the
resolution to any earlier questions to make it easier for a committer
to understand the state

There were some remarks:
1) very poor use of test cycles (by Tom Lane)
Solved in v2 by adding few extra queries to an existent test without any additional syncing.
2) the patch-tester fails on all platforms (by Andres Freund)
Fixed in v3.
3) warning with "my" variable $result and suggestion to correct comment (by vignesh C)
Both fixed in v4.

Now there are no any pending questions, so moved it to RFC.

With the best regards!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v4-0001-Test-for-func-error-in-logrep-worker.patchtext/x-patch; charset=UTF-8; name=v4-0001-Test-for-func-error-in-logrep-worker.patchDownload

commit 16beb574708e0e2e993eb2fa1b093be97b8e53cf
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Date:   Sun Dec 11 06:26:03 2022 +0300

    Add test for syntax error in the function in a a logical replication
        worker. See dea834938.

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 036576acab..b241a7d498 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -69,6 +69,51 @@ $node_publisher->wait_for_catchup('sub1');
 
 pass('index predicates do not cause crash');
 
+# https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru
+
+# The bug was that when a syntax error occurred in a SQL-language or PL/pgSQL-language
+# CREATE FUNCTION or DO command executed in a logical replication worker,
+# we'd suffer a null pointer dereference or assertion failure.
+
+$node_subscriber->safe_psql('postgres', q{
+	create or replace procedure rebuild_test(
+	) as
+	$body$
+	declare
+		l_code  text:= E'create or replace function public.test_selector(
+	) returns setof public.tab1 as
+	\$body\$
+		select * from  error_name
+	\$body\$ language sql;';
+	begin
+		execute l_code;
+		perform public.test_selector();
+	end
+	$body$ language plpgsql;
+	create or replace function test_trg()
+	returns trigger as
+	$body$
+		declare
+		begin
+			call public.rebuild_test();
+			return NULL;
+		end
+	$body$ language plpgsql;
+	create trigger test_trigger after insert on tab1 for each row
+								execute function test_trg();
+	alter table tab1 enable replica trigger test_trigger;
+});
+
+# This command will cause a crash on the replica if that bug hasn't been fixed
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)");
+
+my $result = $node_subscriber->wait_for_log(
+	"ERROR:  relation \"error_name\" does not exist at character"
+);
+
+ok($result,
+	"ERROR: Logical decoding doesn't fail on function error");
+
 # We'll re-use these nodes below, so drop their replication state.
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
@@ -352,7 +397,7 @@ $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub_sch');
 
 # Subscriber's sch1.t1 should receive the row inserted into the new sch1.t1,
 # but not the row inserted into the old sch1.t1 post-rename.
-my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1");
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1");
 is( $result, qq(1
 2), 'check data in subscriber sch1.t1 after schema rename');

#20

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Anton A. Melnikov (#19)

Re: [BUG] Logical replica crash if there was an error in a function.

"Anton A. Melnikov" <aamelnikov@inbox.ru> writes:

Now there are no any pending questions, so moved it to RFC.

I did not think this case was worth memorializing in a test before,
and I still do not. I'm inclined to reject this patch.

regards, tom lane

#21

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Tom Lane (#20)

Re: [BUG] Logical replica crash if there was an error in a function.

Hello!

On 03.04.2023 21:49, Tom Lane wrote:

"Anton A. Melnikov" <aamelnikov@inbox.ru> writes:

Now there are no any pending questions, so moved it to RFC.

I did not think this case was worth memorializing in a test before,
and I still do not. I'm inclined to reject this patch.

Earlier, when discussing this test, there was a suggestion like this:

If we were just adding a
query or two to an existing scenario, that could be okay;

The current version of the test seems to be satisfies this condition.
The queries added do not affect the total test time within the measurement error.
This case is rare, of cause, but it really took place in practice.

So either there are some more reasons why this test should not be accepted that
i do not understand, or i misunderstood something from the above.

Could you help me to figure out, please.

Would be very grateful.

Sincerely yours,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#22

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Anton A. Melnikov (#21)

Re: [BUG] Logical replica crash if there was an error in a function.

"Anton A. Melnikov" <aamelnikov@inbox.ru> writes:

On 03.04.2023 21:49, Tom Lane wrote:

I did not think this case was worth memorializing in a test before,
and I still do not. I'm inclined to reject this patch.

Could you help me to figure out, please.

The problem was an Assert that was speculative when it went in,
and which we eventually found was wrong in the context of logical
replication. We removed the Assert. I don't think we need a test
case to keep us from putting back the Assert. That line of thinking
leads to test suites that run for fourteen hours and are near useless
because developers can't run them easily.

regards, tom lane

#23

Anton A. Melnikov

aamelnikov@inbox.ru

almost 3 years ago

In reply to: Tom Lane (#22)

Re: [BUG] Logical replica crash if there was an error in a function.

On 05.04.2023 17:35, Tom Lane wrote:

"Anton A. Melnikov" <aamelnikov@inbox.ru> writes:

On 03.04.2023 21:49, Tom Lane wrote:

I did not think this case was worth memorializing in a test before,
and I still do not. I'm inclined to reject this patch.

Could you help me to figure out, please.

The problem was an Assert that was speculative when it went in,
and which we eventually found was wrong in the context of logical
replication. We removed the Assert. I don't think we need a test
case to keep us from putting back the Assert. That line of thinking
leads to test suites that run for fourteen hours and are near useless
because developers can't run them easily.

regards, tom lane

Ok, i understand! Thanks a lot for the clarification. A rather important point,
i'll take it into account for the future.
Let's do that. Revoked the patch.

With the best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company