Continuing instability in insert-conflict-specconflict test

Started by Tom Laneover 5 years ago15 messages

tgl@sss.pgh.pa.us

over 5 years ago

1 attachment(s)

We've seen repeated failures in the tests added by commit 43e084197:

I dug into this a bit today, and found that I can reproduce the failure
reliably by adding a short delay in the right place, as attached.

However, after studying the test awhile I have to admit that I do not
understand why all these failures look the same, because it seems to
me that this test is a house of cards. It *repeatedly* expects that
issuing a command to session X will result in session Y reporting
some notice before X's command terminates, even though X's command will
never meet the conditions for isolationtester to think it's blocked.
AFAICS that is nothing but wishful thinking. Why is it that only one of
those places has failed so far?

regards, tom lane

Attachments:

sleep-in-specconflict-test.patchtext/x-diff; charset=us-ascii; name=sleep-in-specconflict-test.patchDownload

diff --git a/src/test/isolation/specs/insert-conflict-specconflict.spec b/src/test/isolation/specs/insert-conflict-specconflict.spec
index 6028397491..92459047b6 100644
--- a/src/test/isolation/specs/insert-conflict-specconflict.spec
+++ b/src/test/isolation/specs/insert-conflict-specconflict.spec
@@ -28,6 +28,7 @@ setup
         RAISE NOTICE 'blurt_and_lock_4() called for % in session %', $1, current_setting('spec.session')::int;
         RAISE NOTICE 'acquiring advisory lock on 4';
         PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 4);
+        PERFORM pg_sleep(0.1);
     RETURN $1;
     END;$$;

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#1)

Re: Continuing instability in insert-conflict-specconflict test

Hi,

On 2020-08-23 22:53:18 -0400, Tom Lane wrote:

We've seen repeated failures in the tests added by commit 43e084197:
...
I dug into this a bit today, and found that I can reproduce the failure
reliably by adding a short delay in the right place, as attached.

However, after studying the test awhile I have to admit that I do not
understand why all these failures look the same, because it seems to
me that this test is a house of cards. It *repeatedly* expects that
issuing a command to session X will result in session Y reporting
some notice before X's command terminates, even though X's command will
never meet the conditions for isolationtester to think it's blocked.

AFAICS that is nothing but wishful thinking. Why is it that only one of
those places has failed so far?

Are there really that many places expecting that? I've not gone through
this again exhaustively by any means, but most places seem to print
something only before waiting for a lock.

This test is really hairy, which isn't great. But until we have a proper
framework for controlling server side execution, I am not sure how we
better can achieve test coverage for this stuff. And there've been bugs,
so it's worth testing.

Greetings,

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Andres Freund (#2)

Re: Continuing instability in insert-conflict-specconflict test

On 2020-08-24 13:42:35 -0700, Andres Freund wrote:

Hi,

On 2020-08-23 22:53:18 -0400, Tom Lane wrote:

We've seen repeated failures in the tests added by commit 43e084197:
...
I dug into this a bit today, and found that I can reproduce the failure
reliably by adding a short delay in the right place, as attached.

However, after studying the test awhile I have to admit that I do not
understand why all these failures look the same, because it seems to
me that this test is a house of cards. It *repeatedly* expects that
issuing a command to session X will result in session Y reporting
some notice before X's command terminates, even though X's command will
never meet the conditions for isolationtester to think it's blocked.

AFAICS that is nothing but wishful thinking. Why is it that only one of
those places has failed so far?

Are there really that many places expecting that? I've not gone through
this again exhaustively by any means, but most places seem to print
something only before waiting for a lock.

ISTM the issue at hand isn't so much that X expects something to be
printed by Y before it terminates, but that it expects the next step to
not be executed before Y unlocks. If I understand the wrong output
correctly, what happens is that "controller_print_speculative_locks" is
executed, even though s1 hasn't yet acquired the next lock. Note how the
+s1: NOTICE:  blurt_and_lock_123() called for k1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
is *after* "step controller_print_speculative_locks", not just after
"step s2_upsert: <... completed>"

To be clear, there'd still be an issue about whether the NOTICE is
printed before/after the "s2_upsert: <... completed>", but it looks to
me the issue is bigger than that. It's easy enough to add another wait
in s2_upsert, but that doesn't help if the entire scheduling just
continues regardless of there not really being a waiter.

Am I missing something here?

Greetings,

Andres Freund

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andres Freund (#3)

Re: Continuing instability in insert-conflict-specconflict test

Andres Freund <andres@anarazel.de> writes:

ISTM the issue at hand isn't so much that X expects something to be
printed by Y before it terminates, but that it expects the next step to
not be executed before Y unlocks. If I understand the wrong output
correctly, what happens is that "controller_print_speculative_locks" is
executed, even though s1 hasn't yet acquired the next lock.

That's one way to look at it perhaps.

I've spent the day fooling around with a re-implementation of
isolationtester that waits for all its controlled sessions to quiesce
(either wait for client input, or block on a lock held by another
session) before moving on to the next step. That was not a feasible
approach before we had the wait_event infrastructure, but it's
seeming like it might be workable now. Still have a few issues to
sort out though ...

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Tom Lane (#4)

Re: Continuing instability in insert-conflict-specconflict test

I wrote:

I've spent the day fooling around with a re-implementation of
isolationtester that waits for all its controlled sessions to quiesce
(either wait for client input, or block on a lock held by another
session) before moving on to the next step. That was not a feasible
approach before we had the wait_event infrastructure, but it's
seeming like it might be workable now. Still have a few issues to
sort out though ...

I wasted a good deal of time on this idea, and eventually concluded
that it's a dead end, because there is an unremovable race condition.
Namely, that even if the isolationtester's observer backend has
observed that test session X has quiesced according to its
wait_event_info, it is possible for the report of that fact to arrive
at the isolationtester client process before test session X's output
does.

It's quite obvious how that might happen if the isolationtester is
on a different machine than the PG server --- just imagine a dropped
packet in X's output that has to be retransmitted. You might think
it shouldn't happen within a single machine, but I'm seeing results
that I cannot explain any other way (on an 8-core RHEL8 box).
It appears to not be particularly rare, either.

Andres Freund <andres@anarazel.de> writes:

ISTM the issue at hand isn't so much that X expects something to be
printed by Y before it terminates, but that it expects the next step to
not be executed before Y unlocks. If I understand the wrong output
correctly, what happens is that "controller_print_speculative_locks" is
executed, even though s1 hasn't yet acquired the next lock.

The problem as I'm now understanding it is that
insert-conflict-specconflict.spec issues multiple commands in sequence
to its "controller" session, and expects that NOTICE outputs from a
different test session will arrive at a determinate point in that
sequence. In practice that's not guaranteed, because (a) the other
test session might not send the NOTICE soon enough --- as my modified
specfile proves --- and (b) even if the NOTICE is timely sent, the
kernel will not guarantee timely receipt. We could fix (a) by
introducing some explicit interlock between the controller and test
sessions, but (b) is a killer.

I think what we have to do to salvage this test is to get rid of the
use of NOTICE outputs, and instead have the test functions insert
log records into some table, which we can inspect after the fact
to verify that things happened as we expect.

regards, tom lane

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 5 years ago

In reply to: Andres Freund (#2)

Re: Continuing instability in insert-conflict-specconflict test

On 8/24/20 4:42 PM, Andres Freund wrote:

This test is really hairy, which isn't great. But until we have a proper
framework for controlling server side execution, I am not sure how we
better can achieve test coverage for this stuff. And there've been bugs,
so it's worth testing.

What would the framework look like?

cheers

andrew

Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Asim Praveen

pasim@vmware.com

over 5 years ago

In reply to: Tom Lane (#5)

2 attachment(s)

Re: Continuing instability in insert-conflict-specconflict test

Let me (rather shamelessly) extract a couple of patches from the
patch set that was already shared in the fault injection framework
proposal [1]/messages/by-id/CANXE4Tc+RYRC48=dKYn1PvAjE26Ew4hh=XUjBRGj=J9eob-S6g@mail.gmail.com.

The first patch incorporates a new syntax in isolation spec grammar to
explicitly mark a step that is expected to block (due to reasons other
than locks). E.g.

permutation step1 step2& step3

The “&” suffix indicates that step2 is expected to block and isolation
tester should move on to step3 without waiting for step2 to finish.

The second patch implements the insert-conflict scenario that is being
discussed here - one session waits (using a “suspend” fault) after
inserting a tuple into the heap relation but before updating the
index. Another session concurrently inserts a conflicting tuple in
the heap and the index, and commits. Then the fault is reset so that
the blocked session resumes and detects conflict when updating the
index.

On 25-Aug-2020, at 9:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I've spent the day fooling around with a re-implementation of
isolationtester that waits for all its controlled sessions to quiesce
(either wait for client input, or block on a lock held by another
session) before moving on to the next step. That was not a feasible
approach before we had the wait_event infrastructure, but it's
seeming like it might be workable now. Still have a few issues to
sort out though ...

I wasted a good deal of time on this idea, and eventually concluded
that it's a dead end, because there is an unremovable race condition.
Namely, that even if the isolationtester's observer backend has
observed that test session X has quiesced according to its
wait_event_info, it is possible for the report of that fact to arrive
at the isolationtester client process before test session X's output
does.

The attached test evades this race condition by not depending on any
output from the blocked session X. It queries status of the injected
fault to ascertain that a specific point in the code was reached
during execution.

I think what we have to do to salvage this test is to get rid of the
use of NOTICE outputs, and instead have the test functions insert
log records into some table, which we can inspect after the fact
to verify that things happened as we expect.

+1 to getting rid of NOTICE outputs.

Please refer to https://github.com/asimrp/postgres/tree/faultinjector
for the full patch set proposed in [1]/messages/by-id/CANXE4Tc+RYRC48=dKYn1PvAjE26Ew4hh=XUjBRGj=J9eob-S6g@mail.gmail.com that is now rebased against the
latest master.

Asim

[1]: /messages/by-id/CANXE4Tc+RYRC48=dKYn1PvAjE26Ew4hh=XUjBRGj=J9eob-S6g@mail.gmail.com

Attachments:

0001-Add-syntax-to-declare-a-step-that-is-expected-to-blo.patchapplication/octet-stream; name=0001-Add-syntax-to-declare-a-step-that-is-expected-to-blo.patchDownload

From d5353fa452d342e970a3aa96d8c6d3ad2d15a089 Mon Sep 17 00:00:00 2001
From: Asim R P <pasim@vmware.com>
Date: Mon, 31 Aug 2020 19:41:01 +0530
Subject: [PATCH 3/6] Add syntax to declare a step that is expected to block

The syntax is (note the "&" suffix):

    permutation "step_1" "step_2"& "step_3"

This is useful for defining steps that are expected to block for reasons
other than waiting on a lock.  For example, an injected fault may cause
a backend to suspend until another event occurs.  Isolation tester will
move on to execute step_3 without waiting for step_2 to finish, because
it is expected to block.

I've incorporated review feedback from Alvaro Herrera and
Michael Paquier.
---
 src/test/isolation/isolationtester.c | 31 ++++++++++++++++-----------
 src/test/isolation/isolationtester.h | 11 ++++++++--
 src/test/isolation/specparse.y       | 32 ++++++++++++++++++++++------
 src/test/isolation/specscanner.l     |  8 +++++++
 4 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index f80261c0229..896d150189a 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -345,7 +345,7 @@ run_named_permutations(TestSpec *testspec)
 		/* Find all the named steps using the lookup table */
 		for (j = 0; j < p->nsteps; j++)
 		{
-			Step	  **this = (Step **) bsearch(p->stepnames[j],
+			Step	  **this = (Step **) bsearch(p->steps[j].name,
 												 testspec->allsteps,
 												 testspec->nallsteps,
 												 sizeof(Step *),
@@ -354,10 +354,11 @@ run_named_permutations(TestSpec *testspec)
 			if (this == NULL)
 			{
 				fprintf(stderr, "undefined step \"%s\" specified in permutation\n",
-						p->stepnames[j]);
+						p->steps[j].name);
 				exit(1);
 			}
 			steps[j] = *this;
+			steps[j]->blocks = p->steps[j].blocks;
 		}
 
 		/* And run them */
@@ -726,19 +727,23 @@ try_complete_step(TestSpec *testspec, Step *step, int flags)
 			if (flags & STEP_NONBLOCK)
 			{
 				bool		waiting;
-
-				res = PQexecPrepared(conns[0], PREP_WAITING, 1,
-									 &backend_pid_strs[step->session + 1],
-									 NULL, NULL, 0);
-				if (PQresultStatus(res) != PGRES_TUPLES_OK ||
-					PQntuples(res) != 1)
+				if (step->blocks)
+					waiting = true;
+				else
 				{
-					fprintf(stderr, "lock wait query failed: %s",
-							PQerrorMessage(conns[0]));
-					exit(1);
+					res = PQexecPrepared(conns[0], PREP_WAITING, 1,
+										 &backend_pid_strs[step->session + 1],
+										 NULL, NULL, 0);
+					if (PQresultStatus(res) != PGRES_TUPLES_OK ||
+						PQntuples(res) != 1)
+					{
+						fprintf(stderr, "lock wait query failed: %s",
+								PQerrorMessage(conns[0]));
+						exit(1);
+					}
+					waiting = ((PQgetvalue(res, 0, 0))[0] == 't');
+					PQclear(res);
 				}
-				waiting = ((PQgetvalue(res, 0, 0))[0] == 't');
-				PQclear(res);
 
 				if (waiting)	/* waiting to acquire a lock */
 				{
diff --git a/src/test/isolation/isolationtester.h b/src/test/isolation/isolationtester.h
index 9cf50124168..4b39b421276 100644
--- a/src/test/isolation/isolationtester.h
+++ b/src/test/isolation/isolationtester.h
@@ -30,6 +30,7 @@ struct Step
 {
 	int			session;
 	bool		used;
+	bool		blocks;
 	char	   *name;
 	char	   *sql;
 	char	   *errormsg;
@@ -37,8 +38,14 @@ struct Step
 
 typedef struct
 {
-	int			nsteps;
-	char	  **stepnames;
+	char	   *name;
+	bool		blocks;
+} PermutationStep;
+
+typedef struct
+{
+	int nsteps;
+	PermutationStep	*steps;
 } Permutation;
 
 typedef struct
diff --git a/src/test/isolation/specparse.y b/src/test/isolation/specparse.y
index 5e007e1bf09..7c6832cec54 100644
--- a/src/test/isolation/specparse.y
+++ b/src/test/isolation/specparse.y
@@ -44,8 +44,8 @@ TestSpec		parseresult;			/* result of parsing is left here */
 %type <step> step
 %type <permutation> permutation
 
-%token <str> sqlblock string_literal
-%token PERMUTATION SESSION SETUP STEP TEARDOWN TEST
+%token <str> sqlblock string_literal string_with_blocks
+%token BLOCKING PERMUTATION SESSION SETUP STEP TEARDOWN TEST
 
 %%
 
@@ -143,6 +143,7 @@ step:
 			STEP string_literal sqlblock
 			{
 				$$ = pg_malloc(sizeof(Step));
+				$$->blocks = false;
 				$$->name = $2;
 				$$->sql = $3;
 				$$->used = false;
@@ -183,7 +184,7 @@ permutation:
 			PERMUTATION string_literal_list
 			{
 				$$ = pg_malloc(sizeof(Permutation));
-				$$->stepnames = (char **) $2.elements;
+				$$->steps = (PermutationStep *) $2.elements;
 				$$->nsteps = $2.nelements;
 			}
 		;
@@ -192,15 +193,32 @@ string_literal_list:
 			string_literal_list string_literal
 			{
 				$$.elements = pg_realloc($1.elements,
-										 ($1.nelements + 1) * sizeof(void *));
-				$$.elements[$1.nelements] = $2;
+										 ($1.nelements + 1) * sizeof(PermutationStep));
+				((PermutationStep *) ($$.elements))[$1.nelements].name = $2;
+				((PermutationStep *) ($$.elements))[$1.nelements].blocks = false;
 				$$.nelements = $1.nelements + 1;
 			}
 			| string_literal
 			{
 				$$.nelements = 1;
-				$$.elements = pg_malloc(sizeof(void *));
-				$$.elements[0] = $1;
+				$$.elements = pg_malloc(sizeof(PermutationStep));
+				((PermutationStep *) ($$.elements))[0].name = $1;
+				((PermutationStep *) ($$.elements))[0].blocks = false;
+			}
+			| string_literal_list string_with_blocks
+			{
+				$$.elements = pg_realloc($1.elements,
+										 ($1.nelements + 1) * sizeof(PermutationStep));
+				((PermutationStep *) ($$.elements))[$1.nelements].name = $2;
+				((PermutationStep *) ($$.elements))[$1.nelements].blocks = true;
+				$$.nelements = $1.nelements + 1;
+			}
+			| string_with_blocks
+			{
+				$$.nelements = 1;
+				$$.elements = pg_malloc(sizeof(PermutationStep));
+				((PermutationStep *) ($$.elements))[0].name = $1;
+				((PermutationStep *) ($$.elements))[0].blocks = true;
 			}
 		;
 
diff --git a/src/test/isolation/specscanner.l b/src/test/isolation/specscanner.l
index 410f17727e1..a3848db7674 100644
--- a/src/test/isolation/specscanner.l
+++ b/src/test/isolation/specscanner.l
@@ -39,6 +39,8 @@ static void addlitchar(char c);
 non_newline		[^\n\r]
 space			[ \t\r\f]
 
+blocks			[&]
+
 comment			("#"{non_newline}*)
 
 %%
@@ -69,6 +71,12 @@ teardown		{ return TEARDOWN; }
 					BEGIN(INITIAL);
 					return(string_literal);
 				}
+<qstr>\"{blocks}		{
+					litbuf[litbufpos] = '\0';
+					yylval.str = pg_strdup(litbuf);
+					BEGIN(INITIAL);
+					return(string_with_blocks);
+				}
 <qstr>.			{ addlitchar(yytext[0]); }
 <qstr>\n		{ yyerror("unexpected newline in quoted string"); }
 <qstr><<EOF>>	{ yyerror("unterminated quoted string"); }
-- 
2.24.3 (Apple Git-128)

0002-Speculative-insert-isolation-test-spec-using-fault-i.patchapplication/octet-stream; name=0002-Speculative-insert-isolation-test-spec-using-fault-i.patchDownload

From 7b52ced5c87b94c9a78570f9112c54fda937d1d9 Mon Sep 17 00:00:00 2001
From: Asim R P <pasim@vmware.com>
Date: Mon, 31 Aug 2020 19:41:02 +0530
Subject: [PATCH 4/6] Speculative insert isolation test spec using fault
 injector

This spec simulates a specific interleaving of concurrent speculative
inserts that is rather cumbersome to achieve without injecting faults.
The interleaving is such that one speculative insert should encounter
conflict only after inserting a tuple into heap relation but before
inserting it into index relation.

Discussion: https://www.postgresql.org/message-id/CAAKRu_a7hbyrk%3DwveHYhr4LbcRnRCG%3DyPUVoQYB9YO1CdUBE9Q%40mail.gmail.com
---
 src/backend/access/heap/heapam.c              |  2 +
 src/backend/executor/execIndexing.c           |  2 +
 .../expected/insert-conflict-with-faults.out  | 25 +++++++++
 src/test/isolation/faultinjector_schedule     |  1 +
 .../specs/insert-conflict-with-faults.spec    | 53 +++++++++++++++++++
 5 files changed, 83 insertions(+)
 create mode 100644 src/test/isolation/expected/insert-conflict-with-faults.out
 create mode 100644 src/test/isolation/faultinjector_schedule
 create mode 100644 src/test/isolation/specs/insert-conflict-with-faults.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9b5f417eac4..aaa2dec3aba 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -64,6 +64,7 @@
 #include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
+#include "utils/faultinjector.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/relcache.h"
@@ -5576,6 +5577,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 	TransactionId prune_xid;
 
 	Assert(ItemPointerIsValid(tid));
+	SIMPLE_FAULT_INJECTOR("heap_abort_speculative");
 
 	block = ItemPointerGetBlockNumber(tid);
 	buffer = ReadBuffer(relation, block);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1862af621be..294c5eb5ffc 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -114,6 +114,7 @@
 #include "executor/executor.h"
 #include "nodes/nodeFuncs.h"
 #include "storage/lmgr.h"
+#include "utils/faultinjector.h"
 #include "utils/snapmgr.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
@@ -289,6 +290,7 @@ ExecInsertIndexTuples(TupleTableSlot *slot,
 	bool		isnull[INDEX_MAX_KEYS];
 
 	Assert(ItemPointerIsValid(tupleid));
+	SIMPLE_FAULT_INJECTOR("insert_index_tuples");
 
 	/*
 	 * Get information from the result relation info structure.
diff --git a/src/test/isolation/expected/insert-conflict-with-faults.out b/src/test/isolation/expected/insert-conflict-with-faults.out
new file mode 100644
index 00000000000..6ab71bab1f1
--- /dev/null
+++ b/src/test/isolation/expected/insert-conflict-with-faults.out
@@ -0,0 +1,25 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_upsert s2_upsert s2_unblock_s1 s1_fault_status s1_select
+inject_fault   
+
+Success:       
+step s1_upsert: INSERT INTO upserttest(key, data) VALUES('k1', 'inserted s1') ON CONFLICT (key) DO UPDATE SET data = upserttest.data || ' with conflict update s1'; <waiting ...>
+step s2_upsert: INSERT INTO upserttest(key, data) VALUES('k1', 'inserted s2') ON CONFLICT (key) DO UPDATE SET data = upserttest.data || ' with conflict update s2';
+step s2_unblock_s1: SELECT * FROM inject_fault('insert_index_tuples', 'resume');
+inject_fault   
+
+Success:       
+step s1_upsert: <... completed>
+step s1_fault_status: SELECT * FROM inject_fault('heap_abort_speculative', 'status');
+inject_fault   
+
+Success: fault name:'heap_abort_speculative' fault type:'skip' database name:'' table name:'' start occurrence:'1' end occurrence:'1' extra arg:'0' fault injection state:'completed' num times hit:'1' 
+
+step s1_select: SELECT * FROM upserttest;
+key            data           
+
+k1             inserted s2 with conflict update s1
+inject_fault   
+
+Success:       
diff --git a/src/test/isolation/faultinjector_schedule b/src/test/isolation/faultinjector_schedule
new file mode 100644
index 00000000000..c658196e614
--- /dev/null
+++ b/src/test/isolation/faultinjector_schedule
@@ -0,0 +1 @@
+test: insert-conflict-with-faults
diff --git a/src/test/isolation/specs/insert-conflict-with-faults.spec b/src/test/isolation/specs/insert-conflict-with-faults.spec
new file mode 100644
index 00000000000..32b11d28de5
--- /dev/null
+++ b/src/test/isolation/specs/insert-conflict-with-faults.spec
@@ -0,0 +1,53 @@
+# INSERT ... ON CONFLICT test verifying that speculative insertion
+# failures are handled
+
+setup
+{
+    CREATE TABLE upserttest(key text, data text);
+    CREATE UNIQUE INDEX ON upserttest(key);
+
+    CREATE EXTENSION faultinjector;
+    -- start with a clean slate
+    SELECT * FROM inject_fault('all', 'reset');
+
+    -- inject fault to suspend insert transaction after a tuple has
+    -- been inserted into the heap but before it is inserted into the
+    -- index.
+    SELECT * FROM inject_fault('insert_index_tuples', 'suspend');
+    SELECT * FROM inject_fault('heap_abort_speculative', 'skip');
+}
+
+teardown
+{
+    DROP TABLE upserttest;
+    SELECT * FROM inject_fault('all', 'reset');
+}
+
+
+session "s1"
+setup
+{
+  SET default_transaction_isolation = 'read committed';
+}
+step "s1_upsert" { INSERT INTO upserttest(key, data) VALUES('k1', 'inserted s1') ON CONFLICT (key) DO UPDATE SET data = upserttest.data || ' with conflict update s1'; }
+step "s1_fault_status" { SELECT * FROM inject_fault('heap_abort_speculative', 'status'); }
+step "s1_select" { SELECT * FROM upserttest; }
+
+session "s2"
+setup
+{
+  SET default_transaction_isolation = 'read committed';
+}
+step "s2_upsert" { INSERT INTO upserttest(key, data) VALUES('k1', 'inserted s2') ON CONFLICT (key) DO UPDATE SET data = upserttest.data || ' with conflict update s2'; }
+step "s2_unblock_s1" { SELECT * FROM inject_fault('insert_index_tuples', 'resume'); }
+
+# Test that speculative locks are correctly acquired and released, s2
+# inserts, s1 updates.
+permutation
+   # S1 should hit the fault and block
+   "s1_upsert"&
+   # S2 should insert without conflict
+   "s2_upsert"
+   "s2_unblock_s1"
+   "s1_fault_status"
+   "s1_select"
-- 
2.24.3 (Apple Git-128)

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Tom Lane (#5)

Re: Continuing instability in insert-conflict-specconflict test

The test material added in commit 43e0841 continues to yield buildfarm
failures. Failures new since the rest of this thread:

On Tue, Aug 25, 2020 at 12:04:41PM -0400, Tom Lane wrote:

I think what we have to do to salvage this test is to get rid of the
use of NOTICE outputs, and instead have the test functions insert
log records into some table, which we can inspect after the fact
to verify that things happened as we expect.

That sounds promising. Are those messages important for observing server
bugs, or are they for debugging/modifying the test itself? If the latter, one
could just change the messages to LOG. Any of the above won't solve things
completely, because 3 of the 21 failures have diffs in the pg_locks output:

dory │ 2020-03-14 19:35:31 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2020-03-14%2019%3A35%3A31
walleye │ 2021-03-25 05:00:44 │ REL_13_STABLE │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=walleye&dt=2021-03-25%2005%3A00%3A44
walleye │ 2021-05-05 17:00:41 │ REL_13_STABLE │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=walleye&dt=2021-05-05%2017%3A00%3A41

Perhaps the pg_locks query should poll until pg_locks has the expected rows.
Or else poll until all test sessions are waiting or idle.

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Noah Misch (#8)

Re: Continuing instability in insert-conflict-specconflict test

Noah Misch <noah@leadboat.com> writes:

The test material added in commit 43e0841 continues to yield buildfarm
failures.

Yeah. It's only a relatively small fraction of the total volume of
isolation-test failures, so I'm not sure it's worth expending a
whole lot of effort on this issue by itself.

On Tue, Aug 25, 2020 at 12:04:41PM -0400, Tom Lane wrote:

I think what we have to do to salvage this test is to get rid of the
use of NOTICE outputs, and instead have the test functions insert
log records into some table, which we can inspect after the fact
to verify that things happened as we expect.

That sounds promising. Are those messages important for observing server
bugs, or are they for debugging/modifying the test itself? If the latter, one
could just change the messages to LOG.

I think they are important, because they show that the things we expect
to happen occur when we expect them to happen.

I experimented with replacing the RAISE NOTICEs with INSERTs, and ran
into two problems:

1. You can't do an INSERT in an IMMUTABLE function. This is easily
worked around by putting the INSERT in a separate, volatile function.
That's cheating like mad of course, but so is the rest of the stuff
this test does in "immutable" functions.

2. The inserted events don't become visible from the outside until the
respective session commits. This seems like an absolute show-stopper.
After the fact, we can see that the events happened in the expected
relative order; but we don't have proof that they happened in the right
order relative to the actions visible in the test output file.

... Any of the above won't solve things
completely, because 3 of the 21 failures have diffs in the pg_locks output:

Yeah, it looks like the issue there is that session 2 reports completion
of its step before session 1 has a chance to make progress after obtaining
the lock. This seems to me to be closely related to the race conditions
I described upthread.

[ thinks for awhile ... ]

I wonder whether we could do better with something along these lines:

* Adjust the test script's functions to emit a NOTICE *after* acquiring
a lock, not before.

* Annotate permutations with something along the lines of "expect N
NOTICE outputs before allowing this step to be considered complete",
which we'd attach to the unlock steps.

This idea is only half baked at present, but maybe there's something
to work with there. If it works, maybe we could improve the other
test cases that are always pseudo-failing in a similar way. For
example, in the deadlock tests, annotate steps with "expect step
Y to finish before step X".

regards, tom lane

#10

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Tom Lane (#9)

Re: Continuing instability in insert-conflict-specconflict test

On Sun, Jun 13, 2021 at 04:48:48PM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Tue, Aug 25, 2020 at 12:04:41PM -0400, Tom Lane wrote:

I think what we have to do to salvage this test is to get rid of the
use of NOTICE outputs, and instead have the test functions insert
log records into some table, which we can inspect after the fact
to verify that things happened as we expect.

That sounds promising. Are those messages important for observing server
bugs, or are they for debugging/modifying the test itself? If the latter, one
could just change the messages to LOG.

I think they are important, because they show that the things we expect
to happen occur when we expect them to happen.

I experimented with replacing the RAISE NOTICEs with INSERTs, and ran
into two problems:

1. You can't do an INSERT in an IMMUTABLE function. This is easily
worked around by putting the INSERT in a separate, volatile function.
That's cheating like mad of course, but so is the rest of the stuff
this test does in "immutable" functions.

2. The inserted events don't become visible from the outside until the
respective session commits. This seems like an absolute show-stopper.
After the fact, we can see that the events happened in the expected
relative order; but we don't have proof that they happened in the right
order relative to the actions visible in the test output file.

One could send the inserts over dblink, I suppose.

... Any of the above won't solve things
completely, because 3 of the 21 failures have diffs in the pg_locks output:

* Adjust the test script's functions to emit a NOTICE *after* acquiring
a lock, not before.

I suspect that particular lock acquisition merely unblocks the processing that
reaches the final lock state expected by the test. So, ...

* Annotate permutations with something along the lines of "expect N
NOTICE outputs before allowing this step to be considered complete",
which we'd attach to the unlock steps.

... I don't expect this to solve $SUBJECT. It could be a separately-useful
feature, though.

This idea is only half baked at present, but maybe there's something
to work with there. If it works, maybe we could improve the other
test cases that are always pseudo-failing in a similar way. For
example, in the deadlock tests, annotate steps with "expect step
Y to finish before step X".

Yeah, a special permutation list entry like PQgetResult(s8) could solve
failures like
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-06-11%2017%3A13%3A44

Incidentally, I have a different idle patch relevant to deadlock test failures
like that. Let me see if it has anything useful.

#11

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Noah Misch (#10)

Re: Continuing instability in insert-conflict-specconflict test

Noah Misch <noah@leadboat.com> writes:

On Sun, Jun 13, 2021 at 04:48:48PM -0400, Tom Lane wrote:

* Adjust the test script's functions to emit a NOTICE *after* acquiring
a lock, not before.

I suspect that particular lock acquisition merely unblocks the processing that
reaches the final lock state expected by the test. So, ...

Ah, you're probably right.

* Annotate permutations with something along the lines of "expect N
NOTICE outputs before allowing this step to be considered complete",
which we'd attach to the unlock steps.

... I don't expect this to solve $SUBJECT. It could be a separately-useful
feature, though.

I think it would solve it. In the examples at hand, where we have

@@ -377,8 +377,6 @@
pg_advisory_unlock

t
-s1: NOTICE: blurt_and_lock_123() called for k1 in session 1
-s1: NOTICE: acquiring advisory lock on 2
step s2_upsert: <... completed>
step controller_print_speculative_locks:
SELECT pa.application_name, locktype, mode, granted

and then those notices show up sometime later, I'm hypothesizing
that the actions did happen timely, but the actual delivery of
those packets to the isolationtester client did not. If we
annotated step s2_upsert with a marker to the effect of "wait
for 2 NOTICEs from session 1 before considering this step done",
we could resolve that race condition. Admittedly, this is putting
a thumb on the scales a little bit, but it's hard to see how to
deal with inconsistent TCP delivery delays without that.

(BTW, I find that removing the pq_flush() call at the bottom of
send_message_to_frontend produces this failure and a bunch of
other similar ones.)

Yeah, a special permutation list entry like PQgetResult(s8) could solve
failures like
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-06-11%2017%3A13%3A44

Right. I'm visualizing it more like annotating s7a8 as requiring
s8a1 to complete first -- or vice versa, either would stabilize
that test result I think.

We might be able to get rid of the stuff about concurrent step
completion in isolationtester.c if we required the spec files
to use annotations to force a deterministic step completion
order in all such cases.

regards, tom lane

#12

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Tom Lane (#11)

Re: Continuing instability in insert-conflict-specconflict test

On Sun, Jun 13, 2021 at 06:09:20PM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Sun, Jun 13, 2021 at 04:48:48PM -0400, Tom Lane wrote:

* Adjust the test script's functions to emit a NOTICE *after* acquiring
a lock, not before.

I suspect that particular lock acquisition merely unblocks the processing that
reaches the final lock state expected by the test. So, ...

Ah, you're probably right.

* Annotate permutations with something along the lines of "expect N
NOTICE outputs before allowing this step to be considered complete",
which we'd attach to the unlock steps.

... I don't expect this to solve $SUBJECT. It could be a separately-useful
feature, though.

I think it would solve it. In the examples at hand, where we have

@@ -377,8 +377,6 @@
pg_advisory_unlock

t
-s1: NOTICE: blurt_and_lock_123() called for k1 in session 1
-s1: NOTICE: acquiring advisory lock on 2
step s2_upsert: <... completed>
step controller_print_speculative_locks:
SELECT pa.application_name, locktype, mode, granted

It would solve that one particular diff. I meant that it wouldn't solve the
aforementioned pg_locks diffs:

We might be able to get rid of the stuff about concurrent step
completion in isolationtester.c if we required the spec files
to use annotations to force a deterministic step completion
order in all such cases.

Yeah. If we're willing to task spec authors with that, the test program can't
then guess wrong under unusual timing.

#13

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Noah Misch (#12)

Re: Continuing instability in insert-conflict-specconflict test

Hi,

On 2021-06-13 15:22:12 -0700, Noah Misch wrote:

On Sun, Jun 13, 2021 at 06:09:20PM -0400, Tom Lane wrote:

We might be able to get rid of the stuff about concurrent step
completion in isolationtester.c if we required the spec files
to use annotations to force a deterministic step completion
order in all such cases.

Yeah. If we're willing to task spec authors with that, the test program can't
then guess wrong under unusual timing.

I think it'd make it *easier* for spec authors. Right now one needs to
find some way to get a consistent ordering, which is often hard and
complicates tests way more than specifying an explicit ordering
would. And it's often unreliable, as evidenced here and in plenty other
tests.

Greetings,

Andres Freund

#14

Gavin Flower

GavinFlower@archidevsys.co.nz

over 4 years ago

In reply to: Andres Freund (#13)

Re: Continuing instability in insert-conflict-specconflict test

On 14/06/21 11:49 am, Andres Freund wrote:

Hi,

On 2021-06-13 15:22:12 -0700, Noah Misch wrote:

On Sun, Jun 13, 2021 at 06:09:20PM -0400, Tom Lane wrote:

We might be able to get rid of the stuff about concurrent step
completion in isolationtester.c if we required the spec files
to use annotations to force a deterministic step completion
order in all such cases.

Yeah. If we're willing to task spec authors with that, the test program can't
then guess wrong under unusual timing.

I think it'd make it *easier* for spec authors. Right now one needs to
find some way to get a consistent ordering, which is often hard and
complicates tests way more than specifying an explicit ordering
would. And it's often unreliable, as evidenced here and in plenty other
tests.

Greetings,

Andres Freund

How about adding a keyword like 'DETERMINISTIC' to the top level SELECT,
the idea being the output would be deterministic (given the same table
values after filtering etc), but not necessarily in any particular
order?ï¿œ So pg could decide the optimum way to achieve that which may not
necessarily need a sort.

Cheers,
Gavin

#15

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Andres Freund (#13)

Re: Continuing instability in insert-conflict-specconflict test

On Sun, Jun 13, 2021 at 04:49:04PM -0700, Andres Freund wrote:

On 2021-06-13 15:22:12 -0700, Noah Misch wrote:

On Sun, Jun 13, 2021 at 06:09:20PM -0400, Tom Lane wrote:

We might be able to get rid of the stuff about concurrent step
completion in isolationtester.c if we required the spec files
to use annotations to force a deterministic step completion
order in all such cases.

Yeah. If we're willing to task spec authors with that, the test program can't
then guess wrong under unusual timing.

I think it'd make it *easier* for spec authors. Right now one needs to
find some way to get a consistent ordering, which is often hard and
complicates tests way more than specifying an explicit ordering
would. And it's often unreliable, as evidenced here and in plenty other
tests.

Fine with me. Even if it weren't easier for spec authors, it shifts efforts
to spec authors and away from buildfarm observers, which is a good thing.