Test 031_recovery_conflict fails when a conflict counted twice

Started by Alexander Lakhinover 1 year ago2 messages

exclusion@gmail.com

over 1 year ago

1 attachment(s)

Hello hackers,

As a recent buildfarm test failure on olingo (which builds postgres with
-O1 and address sanitizer) [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-05-15%2023%3A03%3A30 shows:
...
[23:12:02.127](0.166s) not ok 6 - snapshot conflict: stats show conflict on standby
[23:12:02.130](0.003s) #   Failed test 'snapshot conflict: stats show conflict on standby'
#   at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 332.
[23:12:02.130](0.000s) #          got: '2'
#     expected: '1'
...
[23:12:06.848](1.291s) not ok 17 - 5 recovery conflicts shown in pg_stat_database
[23:12:06.887](0.040s) #   Failed test '5 recovery conflicts shown in pg_stat_database'
#   at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 286.
[23:12:06.887](0.000s) #          got: '6'
#     expected: '5'
Waiting for replication conn standby's replay_lsn to pass 0/3459160 on primary
done
...

pgsql.build/testrun/recovery/031_recovery_conflict/log/031_recovery_conflict_standby.log:
2024-05-15 23:12:01.616 UTC [1299981][client backend][2/2:0] LOG: statement: FETCH FORWARD FROM
test_recovery_conflict_cursor;
2024-05-15 23:12:01.617 UTC [1299981][client backend][2/2:0] LOG: statement: ;
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] LOG: recovery still waiting after 15.289 ms: recovery conflict on
snapshot
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] DETAIL: Conflicting process: 1299981.
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] CONTEXT: WAL redo at 0/344F468 for Heap2/PRUNE_VACUUM_SCAN:
snapshotConflictHorizon: 746, isCatalogRel: F, nplans: 2, nredirected: 18, ndead: 0, nunused: 0, plans: [{ xmax: 0,
infomask: 2816, infomask2: 2, ntuples: 2, offsets: [21, 22] }, { xmax: 0, infomask: 11008, infomask2: 32770, ntuples:
18, offsets: [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] }], redirected: [23->41, 24->42,
25->43, 26->44, 27->45, 28->46, 29->47, 30->48, 31->49, 32->50, 33->51, 34->52, 35->53, 36->54, 37->55, 38->56, 39->57,
40->58]; blkref #0: rel 1663/16385/16386, blk 0
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with recovery
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] DETAIL: User query might have needed to see row versions
that must be removed.
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] HINT: In a moment you should be able to reconnect to the
database and repeat your command.
vvv an important difference from a successful test run
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] LOG: could not send data to client: Broken pipe
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with recovery
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] DETAIL: User query might have needed to see row versions
that must be removed.
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] HINT: In a moment you should be able to reconnect to the
database and repeat your command.
^^^

test 031_recovery_conflict may fail in the following scenario:

031_recovery_conflict.pl:
    executes a query, which produces a conflict:
    ## RECOVERY CONFLICT 2: Snapshot conflict
    ...
    $psql_standby->query_safe(...)

startup process:
        detects a snapshot conflict and sends
        PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
        (ResolveRecoveryConflictWithVirtualXIDs ->
        CancelVirtualTransaction) to the client backend

client backend:
            receives and processes the signal:
            HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
            ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
            ProcessRecoveryConflictInterrupt,

reports the recovery conflict:
pgstat_report_recovery_conflict(reason);

            and reports the error:
            ereport(FATAL, ... "terminating connection due to conflict with
            recovery" ...)
            sends the message to the server log:
            errfinish -> EmitErrorReport -> send_message_to_server_log

031_recovery_conflict.pl:
    # finds the message in the log and resets psql connection:
    check_conflict_log(
      "User query might have needed to see row versions that must
       be removed");
    $psql_standby->reconnect_and_clear();

startup process:
keeps sending PROCSIG_RECOVERY_CONFLICT_SNAPSHOT to the client
backend in a loop inside ResolveRecoveryConflictWithVirtualXIDs

client backend:
            tries to send the message to the client:
            send_message_to_frontend -> socket_flush ->
            internal_flush_buffer,
            gets an error (EPIPE) from secure_write, and calls
            ereport(COMMERROR,
              (errcode_for_socket_access(),
               errmsg("could not send data to client: %m")));

receives the following PROCSIG_RECOVERY_CONFLICT_SNAPSHOT
signal and processes it the same way:

           HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
           ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
           ProcessRecoveryConflictInterrupt,

           reports the recovery conflict:
           pgstat_report_recovery_conflict(reason);
           // now the conflict is counted twice

           and reports the error:
           ereport(FATAL, ... "terminating connection due to conflict with
           recovery" ...)
           sends the message to the server log:
           errfinish -> EmitErrorReport -> send_message_to_server_log

031_recovery_conflict.pl:
    calls
    check_conflict_stat("snapshot");
    and gets 2 instead of 1.

The patch adding delays to reproduce the issue is attached.

With the patch applied, I run the test (against an "-O0" build) in a loop:
for i in `seq 20`; do echo "I $i"; make check -s -C \
src/test/recovery/ PROVE_TESTS="t/031*"; grep ' not ok 6 ' \
src/test/recovery/tmp_check/log/regress_log_031_recovery_conflict && break;
done

and get exactly the same failure as on olingo:
I 1
# +++ tap check in src/test/recovery +++
t/031_recovery_conflict.pl .. 6/?
#   Failed test 'snapshot conflict: stats show conflict on standby'
#   at t/031_recovery_conflict.pl line 333.
#          got: '2'
#     expected: '1'
t/031_recovery_conflict.pl .. 13/?
#   Failed test '5 recovery conflicts shown in pg_stat_database'
#   at t/031_recovery_conflict.pl line 287.
#          got: '6'
#     expected: '5'
# Looks like you failed 2 tests of 18.
t/031_recovery_conflict.pl .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/18 subtests

Test Summary Report
-------------------
t/031_recovery_conflict.pl (Wstat: 512 Tests: 18 Failed: 2)
Failed tests: 6, 17
Non-zero exit status: 2

(Similar failures can be seen with other sections of 031_recovery_conflict.)

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-05-15%2023%3A03%3A30

Best regards,
Alexander

Attachments:

reproduce-031_recovery_conflict-failure.patchtext/x-patch; charset=UTF-8; name=reproduce-031_recovery_conflict-failure.patchDownload

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index d91a85cb2d..303301ddc6 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -465,6 +465,8 @@ errstart(int elevel, const char *domain)
 	return true;
 }
 
+static bool do_sleep = false;
+
 /*
  * errfinish --- end an error-reporting cycle
  *
@@ -542,6 +544,12 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		PG_RE_THROW();
 	}
 
+if (elevel == FATAL && rand() % 3 == 0)
+{
+long r = 0; for (long i = 0; i < 10000000L; i++) r += i % 2;
+fprintf(stderr, "!!!errfinish [%d]| r: %ld,  edata->funcname: %s, sleep() before EmitErrorReport()\n", getpid(), r, edata->funcname);
+do_sleep = true;
+}
 	/* Emit the message to the right places */
 	EmitErrorReport();
 
@@ -1710,6 +1718,12 @@ EmitErrorReport(void)
 	if (edata->output_to_server)
 		send_message_to_server_log(edata);
 
+if (do_sleep)
+{
+long r = 0; for (long i = 0; i < 30000000L; i++) r += i % 2;
+fprintf(stderr, "!!!EmitErrorReport [%d]| r: %ld, sleep() before send_message_to_frontend()...\n", getpid(), r);
+do_sleep = false;
+}
 	/* Send to client, if enabled */
 	if (edata->output_to_client)
 		send_message_to_frontend(edata);
diff --git a/src/test/recovery/t/031_recovery_conflict.pl b/src/test/recovery/t/031_recovery_conflict.pl
index d87efa823f..a59bd13c38 100644
--- a/src/test/recovery/t/031_recovery_conflict.pl
+++ b/src/test/recovery/t/031_recovery_conflict.pl
@@ -153,6 +153,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 check_conflict_log(
 	"User query might have needed to see row versions that must be removed");
 $psql_standby->reconnect_and_clear();
+sleep(1);
 check_conflict_stat("snapshot");

Dmitry Dolgov

9erthalion6@gmail.com

over 1 year ago

In reply to: Alexander Lakhin (#1)

Re: Test 031_recovery_conflict fails when a conflict counted twice

On Mon, May 27, 2024 at 06:00:00PM GMT, Alexander Lakhin wrote:
Hello hackers,

As a recent buildfarm test failure on olingo (which builds postgres with
-O1 and address sanitizer) [1] shows:
...
[23:12:02.127](0.166s) not ok 6 - snapshot conflict: stats show conflict on standby
[23:12:02.130](0.003s) #ï¿½ï¿½ Failed test 'snapshot conflict: stats show conflict on standby'
#ï¿½ï¿½ at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 332.
[23:12:02.130](0.000s) #ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ got: '2'
#ï¿½ï¿½ï¿½ï¿½ expected: '1'
...
[23:12:06.848](1.291s) not ok 17 - 5 recovery conflicts shown in pg_stat_database
[23:12:06.887](0.040s) #ï¿½ï¿½ Failed test '5 recovery conflicts shown in pg_stat_database'
#ï¿½ï¿½ at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 286.
[23:12:06.887](0.000s) #ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ got: '6'
#ï¿½ï¿½ï¿½ï¿½ expected: '5'
Waiting for replication conn standby's replay_lsn to pass 0/3459160 on primary
done
...

pgsql.build/testrun/recovery/031_recovery_conflict/log/031_recovery_conflict_standby.log:
2024-05-15 23:12:01.616 UTC [1299981][client backend][2/2:0] LOG: statement:
FETCH FORWARD FROM test_recovery_conflict_cursor;
2024-05-15 23:12:01.617 UTC [1299981][client backend][2/2:0] LOG: statement: ;
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] LOG: recovery still
waiting after 15.289 ms: recovery conflict on snapshot
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] DETAIL: Conflicting process: 1299981.
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] CONTEXT:ï¿½ WAL redo at
0/344F468 for Heap2/PRUNE_VACUUM_SCAN: snapshotConflictHorizon: 746,
isCatalogRel: F, nplans: 2, nredirected: 18, ndead: 0, nunused: 0, plans: [{
xmax: 0, infomask: 2816, infomask2: 2, ntuples: 2, offsets: [21, 22] }, {
xmax: 0, infomask: 11008, infomask2: 32770, ntuples: 18, offsets: [41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] }],
redirected: [23->41, 24->42, 25->43, 26->44, 27->45, 28->46, 29->47, 30->48,
31->49, 32->50, 33->51, 34->52, 35->53, 36->54, 37->55, 38->56, 39->57,
40->58]; blkref #0: rel 1663/16385/16386, blk 0
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with recovery
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] DETAIL:ï¿½ User
query might have needed to see row versions that must be removed.
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] HINT: In a
moment you should be able to reconnect to the database and repeat your
command.
vvv an important difference from a successful test run
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] LOG: could not send data to client: Broken pipe
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with recovery
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] DETAIL:ï¿½ User
query might have needed to see row versions that must be removed.
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] HINT: In a
moment you should be able to reconnect to the database and repeat your
command.
^^^

testï¿½ 031_recovery_conflict may fail in the following scenario:

031_recovery_conflict.pl:
ï¿½ï¿½ï¿½ executesï¿½ a query, which produces a conflict:
ï¿½ï¿½ï¿½ ## RECOVERY CONFLICT 2: Snapshot conflict
ï¿½ï¿½ï¿½ ...
ï¿½ï¿½ï¿½ $psql_standby->query_safe(...)

startup process:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ detects a snapshot conflict and sends
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ (ResolveRecoveryConflictWithVirtualXIDs ->
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ CancelVirtualTransaction) to the client backend

client backend:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ receives and processes the signal:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ProcessRecoveryConflictInterrupt,

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ reports the recovery conflict:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ pgstat_report_recovery_conflict(reason);

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ and reports the error:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ereport(FATAL, ... "terminating connection due to conflict with
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ recovery" ...)
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ sends the message to the server log:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ errfinish -> EmitErrorReport -> send_message_to_server_log

031_recovery_conflict.pl:
ï¿½ï¿½ï¿½ # finds the message in the log and resets psql connection:
ï¿½ï¿½ï¿½ check_conflict_log(
ï¿½ï¿½ï¿½ï¿½ï¿½ "User query might have needed to see row versions that must
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ be removed");
ï¿½ï¿½ï¿½ $psql_standby->reconnect_and_clear();

startup process:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ keeps sending PROCSIG_RECOVERY_CONFLICT_SNAPSHOT to the client
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ backend in a loop inside ResolveRecoveryConflictWithVirtualXIDs

client backend:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ tries to send the message to the client:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ send_message_to_frontend -> socket_flush ->
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ internal_flush_buffer,
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ gets an error (EPIPE) from secure_write, and calls
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ereport(COMMERROR,
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ (errcode_for_socket_access(),
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ errmsg("could not send data to client: %m")));

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ receives the following PROCSIG_RECOVERY_CONFLICT_SNAPSHOT
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ signal and processes it the same way:

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ProcessRecoveryConflictInterrupt,

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ reports the recovery conflict:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ pgstat_report_recovery_conflict(reason);
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ // now the conflict is counted twice

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ and reports the error:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ereport(FATAL, ... "terminating connection due to conflict with
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ recovery" ...)
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ sends the message to the server log:
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ errfinish -> EmitErrorReport -> send_message_to_server_log

031_recovery_conflict.pl:
ï¿½ï¿½ï¿½ calls
ï¿½ï¿½ï¿½ check_conflict_stat("snapshot");
ï¿½ï¿½ï¿½ and gets 2 instead of 1.

Thanks for the breakdown of the steps, a very interesting case. If I
understand everything correctly, it could be summarized as: the startup
process tries to stop the client backend, which struggles with error
reporting just enough to receive a second signal. It seems this issue is
rare, even on olingo it has appeared only once. Yet it raises a question
whether is it an incorrect behaviour or the test is not robust enough?

From what I see in ProcessRecoveryConflictInterrupt, there is no defence
against duplicated recovery conflict getting reported, except an
expectation that ereport FATAL will terminate the session (the
limitations on the recursive error reporting do not seem to stop that).
I guess this points into the direction that it's expected, and
check_conflict_stat might be checking for number of conflicts > 1.

If that's not the case, and recovery conflicts have to be reported only
once, then either the startup process or the client backend have to take
care of that. My first idea was to experiment with longer waiting
intervals for an unresponsive client backend (currently it's a constant
of 5000 us) -- it seems to reduce chances of reproducing the error (with
the attached patch applied), but I'm not sure it's an appropriate
solution.