Race condition in FetchTableStates() breaks synchronization of subscription tables

Started by Alexander Lakhinalmost 2 years ago25 messages
#1Alexander Lakhin
exclusion@gmail.com

Hello hackers,

After determining a possible cause for intermittent failures of the test
subscription/031_column_list [1]/messages/by-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com, I was wondering what makes another
subscription test (014_binary) fail on the buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-01-22%2001%3A19%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-01-14%2018%3A19%3A20
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2023-12-21%2001%3A11%3A52
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-27%2001%3A42%3A39

All those failures caused by a timeout when waiting for a message expected
in _subscriber.log. For example, in the snakefly's case:
[01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at
/home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269.

_subscriber.log contains:
2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive data from WAL stream: server closed the connection
unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical replication apply worker" (PID 409999) exited with
exit code 1
2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply worker for subscription "tsub" has started
...

While _subscriber.log from a successful test run contains:
2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: session time: 0:00:00.014 user=postgres
database=postgres host=[local]
2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply worker for subscription "tsub" has started
2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table synchronization worker for subscription "tsub",
table "test_mismatching_types" has started
2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data format
2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY test_mismatching_types, line 1, column a

In this case, "logical replication apply worker for subscription "tsub" has
started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later.

I've managed to reproduce this failure locally by running multiple tests in
parallel, and my analysis shows that it is caused by a race condition when
accessing variable table_states_valid inside tablesync.c.

tablesync.c does the following with table_states_valid:
/*
 * Callback from syscache invalidation.
 */
void
invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
{
    table_states_valid = false;
}
...
static bool
FetchTableStates(bool *started_tx)
{
...
    if (!table_states_valid)
    {
...
        /* Fetch all non-ready tables. */
        rstates = GetSubscriptionRelations(MySubscription->oid, true);
...
        table_states_valid = true;
    }

So, when syscache invalidation occurs inside the code block
"if (!table_states_valid)", that invalidation is effectively ignored.

In a failed case I observe the following events:
1. logical replication apply worker performs
 LogicalRepApplyLoop() -> process_syncing_tables() ->
 process_syncing_tables_for_apply() -> FetchTableStates() periodically.

2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
 for SUBSCRIPTIONRELMAP, and that leads to calling
 invalidate_syncing_table_states().

3. If the apply worker manages to fetch no non-ready tables in
 FetchTableStates() and ignore "table_states_valid = false" from
 invalidate_syncing_table_states(), then it just misses the invalidation
 event, so it keeps working without noticing non-ready tables appeared as
 the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips
 a loop "foreach(lc, table_states_not_ready) ..." until some other event
 occurs).

pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my
analysis -- without it, I need tens of iterations (with 50 tests running in
parallel) to catch the failure, but with it, I get the failure on the first
iteration.

(Reproduced on REL_16_STABLE..master, where the test 014_binary tries to
"Refresh the publication to trigger the tablesync", but I think the race
condition exists in the previous versions as well.)

[1]: /messages/by-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com

Best regards,
Alexander

#2vignesh C
vignesh21@gmail.com
In reply to: Alexander Lakhin (#1)
1 attachment(s)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:

Hello hackers,

After determining a possible cause for intermittent failures of the test
subscription/031_column_list [1], I was wondering what makes another
subscription test (014_binary) fail on the buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&amp;dt=2024-01-22%2001%3A19%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=2024-01-14%2018%3A19%3A20
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&amp;dt=2023-12-21%2001%3A11%3A52
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=2023-11-27%2001%3A42%3A39

All those failures caused by a timeout when waiting for a message expected
in _subscriber.log. For example, in the snakefly's case:
[01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at
/home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269.

_subscriber.log contains:
2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data from WAL stream: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical replication apply worker" (PID 409999) exited with
exit code 1
2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply worker for subscription "tsub" has started
...

While _subscriber.log from a successful test run contains:
2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: session time: 0:00:00.014 user=postgres
database=postgres host=[local]
2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply worker for subscription "tsub" has started
2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table synchronization worker for subscription "tsub",
table "test_mismatching_types" has started
2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data format
2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY test_mismatching_types, line 1, column a

In this case, "logical replication apply worker for subscription "tsub" has
started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later.

I've managed to reproduce this failure locally by running multiple tests in
parallel, and my analysis shows that it is caused by a race condition when
accessing variable table_states_valid inside tablesync.c.

tablesync.c does the following with table_states_valid:
/*
* Callback from syscache invalidation.
*/
void
invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
{
table_states_valid = false;
}
...
static bool
FetchTableStates(bool *started_tx)
{
...
if (!table_states_valid)
{
...
/* Fetch all non-ready tables. */
rstates = GetSubscriptionRelations(MySubscription->oid, true);
...
table_states_valid = true;
}

So, when syscache invalidation occurs inside the code block
"if (!table_states_valid)", that invalidation is effectively ignored.

In a failed case I observe the following events:
1. logical replication apply worker performs
LogicalRepApplyLoop() -> process_syncing_tables() ->
process_syncing_tables_for_apply() -> FetchTableStates() periodically.

2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
for SUBSCRIPTIONRELMAP, and that leads to calling
invalidate_syncing_table_states().

3. If the apply worker manages to fetch no non-ready tables in
FetchTableStates() and ignore "table_states_valid = false" from
invalidate_syncing_table_states(), then it just misses the invalidation
event, so it keeps working without noticing non-ready tables appeared as
the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips
a loop "foreach(lc, table_states_not_ready) ..." until some other event
occurs).

pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my
analysis -- without it, I need tens of iterations (with 50 tests running in
parallel) to catch the failure, but with it, I get the failure on the first
iteration.

Thanks for the analysis, I was able to reproduce the issue with the
steps you had shared. I agree with your analysis. I added some logs to
verify that the invalidation was getting missed.

I felt that this invalidation is getting ignored because we have used
a boolean variable here, how about changing it slightly so that
table_states_invalid gets incremented for every invalidation and then
decrementing table_states_invalid after getting the non-ready tables
like in the attached patch. I was able to verify that the test passes
with the attached patch.

Regards,
Vignesh

Attachments:

defect_fix.patchtext/x-patch; charset=US-ASCII; name=defect_fix.patchDownload
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 4207b9356c..92eb86a249 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -123,7 +123,7 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+static unsigned int table_states_invalid = 1;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -273,7 +273,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_invalid++;
 }
 
 /*
@@ -1568,7 +1568,7 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_invalid)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
@@ -1608,7 +1608,7 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		table_states_invalid--;
 	}
 
 	return has_subrels;
#3Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: vignesh C (#2)
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:

Hello hackers,

After determining a possible cause for intermittent failures of the
test subscription/031_column_list [1], I was wondering what makes
another subscription test (014_binary) fail on the buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&amp;dt=20
24-01-22%2001%3A19%3A03

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=202

4-01-14%2018%3A19%3A20
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&amp;dt=202
3-12-21%2001%3A11%3A52

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=202

3-11-27%2001%3A42%3A39

All those failures caused by a timeout when waiting for a message
expected in _subscriber.log. For example, in the snakefly's case:
[01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
incorrect binary data format) at

/home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl
line 269.

_subscriber.log contains:
2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement:
ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data
from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical
replication apply worker" (PID 409999) exited with exit code 1
2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply
worker for subscription "tsub" has started ...

While _subscriber.log from a successful test run contains:
2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement:
ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
session time: 0:00:00.014 user=postgres database=postgres host=[local]
2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply
worker for subscription "tsub" has started
2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table
synchronization worker for subscription "tsub", table
"test_mismatching_types" has started
2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data
format
2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY
test_mismatching_types, line 1, column a

In this case, "logical replication apply worker for subscription
"tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes

later.

I've managed to reproduce this failure locally by running multiple
tests in parallel, and my analysis shows that it is caused by a race
condition when accessing variable table_states_valid inside tablesync.c.

tablesync.c does the following with table_states_valid:
/*
* Callback from syscache invalidation.
*/
void
invalidate_syncing_table_states(Datum arg, int cacheid, uint32
hashvalue) {
table_states_valid = false;
}
...
static bool
FetchTableStates(bool *started_tx)
{
...
if (!table_states_valid)
{
...
/* Fetch all non-ready tables. */
rstates = GetSubscriptionRelations(MySubscription->oid,
true); ...
table_states_valid = true;
}

So, when syscache invalidation occurs inside the code block "if
(!table_states_valid)", that invalidation is effectively ignored.

In a failed case I observe the following events:
1. logical replication apply worker performs
LogicalRepApplyLoop() -> process_syncing_tables() ->
process_syncing_tables_for_apply() -> FetchTableStates() periodically.

2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
for SUBSCRIPTIONRELMAP, and that leads to calling
invalidate_syncing_table_states().

3. If the apply worker manages to fetch no non-ready tables in
FetchTableStates() and ignore "table_states_valid = false" from
invalidate_syncing_table_states(), then it just misses the invalidation
event, so it keeps working without noticing non-ready tables appeared as
the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply()

skips

a loop "foreach(lc, table_states_not_ready) ..." until some other event
occurs).

pg_usleep(100000) added just below GetSubscriptionRelations(...)
proves my analysis -- without it, I need tens of iterations (with 50
tests running in
parallel) to catch the failure, but with it, I get the failure on the
first iteration.

Thanks for the analysis, I was able to reproduce the issue with the steps you had
shared. I agree with your analysis. I added some logs to verify that the
invalidation was getting missed.

I felt that this invalidation is getting ignored because we have used a boolean
variable here, how about changing it slightly so that table_states_invalid gets
incremented for every invalidation and then decrementing table_states_invalid
after getting the non-ready tables like in the attached patch. I was able to
verify that the test passes with the attached patch.

Thanks for the patch.

I am not sure if counting the invalidation number is needed, as even if there
are hundreds of invalidations outside of FetchTableStates, one FetchTableStates
call should reset the count to 0 as it is checking the latest catalog. I think
a better approach is to make the table_states_valid tristate
{TABLE_STATE_NEEDS_REBUILD, TABLE_STATE_REBUILD_START, TABLE_STATE_VALID}. The
cache invalidate callback will set state to TABLE_STATE_NEEDS_REBUILD.
FetchTableStates will set the state to TABLE_STATE_VALID only if the state was
TABLE_STATE_REBUILD_START. See the existing EventTriggerCacheStateType for
reference.

Best Regards,
Hou zj

#4vignesh C
vignesh21@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#3)
1 attachment(s)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:

Hello hackers,

After determining a possible cause for intermittent failures of the
test subscription/031_column_list [1], I was wondering what makes
another subscription test (014_binary) fail on the buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&amp;dt=20
24-01-22%2001%3A19%3A03

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=202

4-01-14%2018%3A19%3A20
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&amp;dt=202
3-12-21%2001%3A11%3A52

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=202

3-11-27%2001%3A42%3A39

All those failures caused by a timeout when waiting for a message
expected in _subscriber.log. For example, in the snakefly's case:
[01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
incorrect binary data format) at

/home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl
line 269.

_subscriber.log contains:
2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement:
ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data
from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical
replication apply worker" (PID 409999) exited with exit code 1
2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply
worker for subscription "tsub" has started ...

While _subscriber.log from a successful test run contains:
2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement:
ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
session time: 0:00:00.014 user=postgres database=postgres host=[local]
2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply
worker for subscription "tsub" has started
2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table
synchronization worker for subscription "tsub", table
"test_mismatching_types" has started
2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data
format
2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY
test_mismatching_types, line 1, column a

In this case, "logical replication apply worker for subscription
"tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes

later.

I've managed to reproduce this failure locally by running multiple
tests in parallel, and my analysis shows that it is caused by a race
condition when accessing variable table_states_valid inside tablesync.c.

tablesync.c does the following with table_states_valid:
/*
* Callback from syscache invalidation.
*/
void
invalidate_syncing_table_states(Datum arg, int cacheid, uint32
hashvalue) {
table_states_valid = false;
}
...
static bool
FetchTableStates(bool *started_tx)
{
...
if (!table_states_valid)
{
...
/* Fetch all non-ready tables. */
rstates = GetSubscriptionRelations(MySubscription->oid,
true); ...
table_states_valid = true;
}

So, when syscache invalidation occurs inside the code block "if
(!table_states_valid)", that invalidation is effectively ignored.

In a failed case I observe the following events:
1. logical replication apply worker performs
LogicalRepApplyLoop() -> process_syncing_tables() ->
process_syncing_tables_for_apply() -> FetchTableStates() periodically.

2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
for SUBSCRIPTIONRELMAP, and that leads to calling
invalidate_syncing_table_states().

3. If the apply worker manages to fetch no non-ready tables in
FetchTableStates() and ignore "table_states_valid = false" from
invalidate_syncing_table_states(), then it just misses the invalidation
event, so it keeps working without noticing non-ready tables appeared as
the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply()

skips

a loop "foreach(lc, table_states_not_ready) ..." until some other event
occurs).

pg_usleep(100000) added just below GetSubscriptionRelations(...)
proves my analysis -- without it, I need tens of iterations (with 50
tests running in
parallel) to catch the failure, but with it, I get the failure on the
first iteration.

Thanks for the analysis, I was able to reproduce the issue with the steps you had
shared. I agree with your analysis. I added some logs to verify that the
invalidation was getting missed.

I felt that this invalidation is getting ignored because we have used a boolean
variable here, how about changing it slightly so that table_states_invalid gets
incremented for every invalidation and then decrementing table_states_invalid
after getting the non-ready tables like in the attached patch. I was able to
verify that the test passes with the attached patch.

Thanks for the patch.

I am not sure if counting the invalidation number is needed, as even if there
are hundreds of invalidations outside of FetchTableStates, one FetchTableStates
call should reset the count to 0 as it is checking the latest catalog.

Another approach I was thinking of is to reset table_states_valid
immediately in the beginning of FetchTableStates, so any new
invalidations will take care of setting table_states_valid again it
again which will be handled in the next iteration of fetching
non-ready tables like in the attached patch.

Regards,
Vigneshj

Attachments:

defect_fix_v2.patchtext/x-patch; charset=US-ASCII; name=defect_fix_v2.patchDownload
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 4207b9356c..708c402e16 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1575,6 +1575,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_valid = true;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1607,8 +1609,6 @@ FetchTableStates(bool *started_tx)
 		 */
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
-
-		table_states_valid = true;
 	}
 
 	return has_subrels;
#5Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: vignesh C (#4)
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Tuesday, January 30, 2024 11:21 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com>

wrote:

On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com>

wrote:

Hello hackers,

After determining a possible cause for intermittent failures of
the test subscription/031_column_list [1], I was wondering what
makes another subscription test (014_binary) fail on the buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&amp;d
t=20
24-01-22%2001%3A19%3A03

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=2

02

4-01-14%2018%3A19%3A20
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&amp;dt
=202
3-12-21%2001%3A11%3A52

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=2

02

3-11-27%2001%3A42%3A39

All those failures caused by a timeout when waiting for a message
expected in _subscriber.log. For example, in the snakefly's case:
[01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
incorrect binary data format) at

/home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin
ary.pl
line 269.

_subscriber.log contains:
2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement:
ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive
data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2023-12-21 01:17:46.760 UTC [405057] LOG: background worker
"logical replication apply worker" (PID 409999) exited with exit
code 1
2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication
apply worker for subscription "tsub" has started ...

While _subscriber.log from a successful test run contains:
2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement:
ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
session time: 0:00:00.014 user=postgres database=postgres
host=[local]
2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication
apply worker for subscription "tsub" has started
2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication
table synchronization worker for subscription "tsub", table
"test_mismatching_types" has started
2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data
format
2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY
test_mismatching_types, line 1, column a

In this case, "logical replication apply worker for subscription
"tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3
minutes

later.

I've managed to reproduce this failure locally by running multiple
tests in parallel, and my analysis shows that it is caused by a
race condition when accessing variable table_states_valid inside

tablesync.c.

tablesync.c does the following with table_states_valid:
/*
* Callback from syscache invalidation.
*/
void
invalidate_syncing_table_states(Datum arg, int cacheid, uint32
hashvalue) {
table_states_valid = false;
}
...
static bool
FetchTableStates(bool *started_tx) { ...
if (!table_states_valid)
{
...
/* Fetch all non-ready tables. */
rstates = GetSubscriptionRelations(MySubscription->oid,
true); ...
table_states_valid = true;
}

So, when syscache invalidation occurs inside the code block "if
(!table_states_valid)", that invalidation is effectively ignored.

In a failed case I observe the following events:
1. logical replication apply worker performs
LogicalRepApplyLoop() -> process_syncing_tables() ->
process_syncing_tables_for_apply() -> FetchTableStates() periodically.

2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates

syscache

for SUBSCRIPTIONRELMAP, and that leads to calling
invalidate_syncing_table_states().

3. If the apply worker manages to fetch no non-ready tables in
FetchTableStates() and ignore "table_states_valid = false" from
invalidate_syncing_table_states(), then it just misses the invalidation
event, so it keeps working without noticing non-ready tables

appeared as

the result of ALTER SUBSCRIPTION
(process_syncing_tables_for_apply()

skips

a loop "foreach(lc, table_states_not_ready) ..." until some other event
occurs).

pg_usleep(100000) added just below GetSubscriptionRelations(...)
proves my analysis -- without it, I need tens of iterations (with
50 tests running in
parallel) to catch the failure, but with it, I get the failure on
the first iteration.

Thanks for the analysis, I was able to reproduce the issue with the
steps you had shared. I agree with your analysis. I added some logs
to verify that the invalidation was getting missed.

I felt that this invalidation is getting ignored because we have
used a boolean variable here, how about changing it slightly so that
table_states_invalid gets incremented for every invalidation and
then decrementing table_states_invalid after getting the non-ready
tables like in the attached patch. I was able to verify that the test passes

with the attached patch.

Thanks for the patch.

I am not sure if counting the invalidation number is needed, as even
if there are hundreds of invalidations outside of FetchTableStates,
one FetchTableStates call should reset the count to 0 as it is checking the

latest catalog.

Another approach I was thinking of is to reset table_states_valid immediately in
the beginning of FetchTableStates, so any new invalidations will take care of
setting table_states_valid again it again which will be handled in the next
iteration of fetching non-ready tables like in the attached patch.

I think this is not the standard appraoch for cache building. Because if any
ERROR happens during the cache building, then we will come into the situation
that table_states_valid=true while the cache data is invalid. Even if we
currently don't access these cache after erroring out, but I think we'd better
to avoid this risk.

Best Regards,
Hou zj

#6vignesh C
vignesh21@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#5)
1 attachment(s)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Tue, 30 Jan 2024 at 17:22, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, January 30, 2024 11:21 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com>

wrote:

On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com>

wrote:

Hello hackers,

After determining a possible cause for intermittent failures of
the test subscription/031_column_list [1], I was wondering what
makes another subscription test (014_binary) fail on the buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&amp;d
t=20
24-01-22%2001%3A19%3A03

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=2

02

4-01-14%2018%3A19%3A20
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&amp;dt
=202
3-12-21%2001%3A11%3A52

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=2

02

3-11-27%2001%3A42%3A39

All those failures caused by a timeout when waiting for a message
expected in _subscriber.log. For example, in the snakefly's case:
[01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
incorrect binary data format) at

/home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin
ary.pl
line 269.

_subscriber.log contains:
2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement:
ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive
data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2023-12-21 01:17:46.760 UTC [405057] LOG: background worker
"logical replication apply worker" (PID 409999) exited with exit
code 1
2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication
apply worker for subscription "tsub" has started ...

While _subscriber.log from a successful test run contains:
2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement:
ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
session time: 0:00:00.014 user=postgres database=postgres
host=[local]
2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication
apply worker for subscription "tsub" has started
2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication
table synchronization worker for subscription "tsub", table
"test_mismatching_types" has started
2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data
format
2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY
test_mismatching_types, line 1, column a

In this case, "logical replication apply worker for subscription
"tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3
minutes

later.

I've managed to reproduce this failure locally by running multiple
tests in parallel, and my analysis shows that it is caused by a
race condition when accessing variable table_states_valid inside

tablesync.c.

tablesync.c does the following with table_states_valid:
/*
* Callback from syscache invalidation.
*/
void
invalidate_syncing_table_states(Datum arg, int cacheid, uint32
hashvalue) {
table_states_valid = false;
}
...
static bool
FetchTableStates(bool *started_tx) { ...
if (!table_states_valid)
{
...
/* Fetch all non-ready tables. */
rstates = GetSubscriptionRelations(MySubscription->oid,
true); ...
table_states_valid = true;
}

So, when syscache invalidation occurs inside the code block "if
(!table_states_valid)", that invalidation is effectively ignored.

In a failed case I observe the following events:
1. logical replication apply worker performs
LogicalRepApplyLoop() -> process_syncing_tables() ->
process_syncing_tables_for_apply() -> FetchTableStates() periodically.

2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates

syscache

for SUBSCRIPTIONRELMAP, and that leads to calling
invalidate_syncing_table_states().

3. If the apply worker manages to fetch no non-ready tables in
FetchTableStates() and ignore "table_states_valid = false" from
invalidate_syncing_table_states(), then it just misses the invalidation
event, so it keeps working without noticing non-ready tables

appeared as

the result of ALTER SUBSCRIPTION
(process_syncing_tables_for_apply()

skips

a loop "foreach(lc, table_states_not_ready) ..." until some other event
occurs).

pg_usleep(100000) added just below GetSubscriptionRelations(...)
proves my analysis -- without it, I need tens of iterations (with
50 tests running in
parallel) to catch the failure, but with it, I get the failure on
the first iteration.

Thanks for the analysis, I was able to reproduce the issue with the
steps you had shared. I agree with your analysis. I added some logs
to verify that the invalidation was getting missed.

I felt that this invalidation is getting ignored because we have
used a boolean variable here, how about changing it slightly so that
table_states_invalid gets incremented for every invalidation and
then decrementing table_states_invalid after getting the non-ready
tables like in the attached patch. I was able to verify that the test passes

with the attached patch.

Thanks for the patch.

I am not sure if counting the invalidation number is needed, as even
if there are hundreds of invalidations outside of FetchTableStates,
one FetchTableStates call should reset the count to 0 as it is checking the

latest catalog.

Another approach I was thinking of is to reset table_states_valid immediately in
the beginning of FetchTableStates, so any new invalidations will take care of
setting table_states_valid again it again which will be handled in the next
iteration of fetching non-ready tables like in the attached patch.

I think this is not the standard appraoch for cache building. Because if any
ERROR happens during the cache building, then we will come into the situation
that table_states_valid=true while the cache data is invalid. Even if we
currently don't access these cache after erroring out, but I think we'd better
to avoid this risk.

Here is an updated patch which changes the boolean variable to a
tri-state enum and set stable state to valid only if no invalidations
have been occurred while the list is being prepared.

Regards,
Vignesh

Attachments:

v3-0001-Table-sync-missed-for-newly-added-tables-because-.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Table-sync-missed-for-newly-added-tables-because-.patchDownload
From cec60bbd7a6d1b66672c6969a11b89a8e18190c9 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v3] Table sync missed for newly added tables because
 subscription relation cache invalidation was not handled in certain
 concurrent scenarios.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +++++++++++++++++----
 src/tools/pgindent/typedefs.list            |  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 5acab3f3e2..b38b0e3ba1 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -123,7 +123,14 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -273,7 +280,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1568,13 +1575,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_valid != SYNC_TABLE_STATE_VALID)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_valid = SYNC_TABLE_STATE_REBUILD_STARTED;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1608,7 +1617,15 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		/*
+		 * If the subscription relation cache has been invalidated since we
+		 * entered this routine, we still use and return the relations we just
+		 * finished constructing, to avoid infinite loops, but we leave the
+		 * table states marked as stale so that we'll rebuild it again on next
+		 * access. Otherwise, we mark the table states as valid.
+		 */
+		if (table_states_valid == SYNC_TABLE_STATE_REBUILD_STARTED)
+			table_states_valid = SYNC_TABLE_STATE_VALID;
 	}
 
 	return has_subrels;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b..5a40f549f9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2703,6 +2703,7 @@ SupportRequestSelectivity
 SupportRequestSimplify
 SupportRequestWFuncMonotonic
 Syn
+SyncingTablesState
 SyncOps
 SyncRepConfigData
 SyncRepStandbyData
-- 
2.34.1

#7Alexander Lakhin
exclusion@gmail.com
In reply to: vignesh C (#6)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

Hello Vignesh and Hou-san,

01.02.2024 07:59, vignesh C wrote:

Here is an updated patch which changes the boolean variable to a
tri-state enum and set stable state to valid only if no invalidations
have been occurred while the list is being prepared.

While testing the v3 patch, I observed another anomaly with the 014_binary
test. When I run the test in parallel, I see that sometimes one of the test
instances runs much longer than others. For example:
49      All tests successful.
49      Files=1, Tests=8,  4 wallclock secs ( 0.03 usr  0.01 sys + 0.43 cusr  0.30 csys =  0.77 CPU)
49      Result: PASS
12      All tests successful.
12      Files=1, Tests=8, 184 wallclock secs ( 0.02 usr  0.01 sys + 0.46 cusr  0.40 csys =  0.89 CPU)
12      Result: PASS

As far as I can see, this happens due to another race condition, this time
in launcher.c.
For such a three-minute case I see in _subscriber.log:
2024-02-01 14:33:13.604 UTC [949255] DEBUG:  relation "public.test_mismatching_types" does not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message
type "INSERT" in transaction 757, finished at 0/153C838
2024-02-01 14:33:13.604 UTC [949255] ERROR:  logical replication target relation "public.test_mismatching_types" does
not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message
type "INSERT" in transaction 757, finished at 0/153C838
...
2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG:  statement: CREATE TABLE public.test_mismatching_types (
            a int PRIMARY KEY
        );
2024-02-01 14:33:13.605 UTC [942451] DEBUG:  unregistering background worker "logical replication apply worker for
subscription 16398"
2024-02-01 14:33:13.605 UTC [942451] LOG:  background worker "logical replication apply worker" (PID 949255) exited with
exit code 1
...
2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
...
2024-02-01 14:36:13.642 UTC [942527] DEBUG:  starting logical replication worker for subscription "tsub"
(there is no interesting activity between 14:33:13 and 14:36:13)

So logical replication apply worker exited because CREATE TABLE was not
executed on subscriber yet, and new replication worker started because of a
timeout occurred in WaitLatch(..., wait_time, ...) inside
ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
(180 sec)).

But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
a result of:
logical replication apply worker| logicalrep_worker_onexit() ->
  ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).

In a bad case, I see that the SetLatch() called as well, but then the latch
gets reset by the following code in WaitForReplicationWorkerAttach():
        rc = WaitLatch(MyLatch,
                       WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
                       10L, WAIT_EVENT_BGWORKER_STARTUP);

        if (rc & WL_LATCH_SET)
        {
            ResetLatch(MyLatch);
            CHECK_FOR_INTERRUPTS();
        }

With pg_usleep(300000); added just before ResetLatch and
 $node_subscriber->safe_psql(
     'postgres', qq(
+SELECT pg_sleep(0.5);
     CREATE TABLE public.test_mismatching_types (
in 014_binary.pl, I can see the anomaly without running tests in parallel,
just when running that test in a loop:
for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done
...
iteration 2
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8,  5 wallclock secs ( 0.00 usr  0.00 sys +  0.24 cusr  0.18 csys =  0.42 CPU)
Result: PASS
iteration 3
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8, 183 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.25 csys =  0.55 CPU)
Result: PASS
...

In other words, the abnormal test execution takes place due to the
following events:
1. logicalrep worker launcher launches replication worker and waits for it
 to attach:
 ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach()
2. logicalrep worker exits due to some error (logical replication target
 relation does not exist, in our case) and sends a signal to set the latch
 for launcher
3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
 it inside WaitForReplicationWorkerAttach()
4. launcher gets back to ApplyLauncherMain() where it waits for the latch
 (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).

Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
test 027_nosuperuser executes for 3+ minutes on each run for me.
And we can find such abnormal execution on the buildfarm too:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=snakefly&amp;dt=2024-02-01%2020%3A34%3A03&amp;stg=subscription-check

So, if we had PG_TEST_TIMEOUT_DEFAULT less than DEFAULT_NAPTIME_PER_CYCLE,
we would notice this situation as a test failure, but that timeouts are
equal (accidentally?), hence, say, wait_for_log() just waits long enough
for an expected replication worker to start eventually.

As to the initial issue fixes, v3 and v2 look good to me, but v2 is more
appealing for back-patching (if this fix going to be back-patched), IMHO.

Best regards,
Alexander

#8vignesh C
vignesh21@gmail.com
In reply to: Alexander Lakhin (#7)
2 attachment(s)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Fri, 2 Feb 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:

Hello Vignesh and Hou-san,

01.02.2024 07:59, vignesh C wrote:

Here is an updated patch which changes the boolean variable to a
tri-state enum and set stable state to valid only if no invalidations
have been occurred while the list is being prepared.

While testing the v3 patch, I observed another anomaly with the 014_binary
test. When I run the test in parallel, I see that sometimes one of the test
instances runs much longer than others. For example:
49 All tests successful.
49 Files=1, Tests=8, 4 wallclock secs ( 0.03 usr 0.01 sys + 0.43 cusr 0.30 csys = 0.77 CPU)
49 Result: PASS
12 All tests successful.
12 Files=1, Tests=8, 184 wallclock secs ( 0.02 usr 0.01 sys + 0.46 cusr 0.40 csys = 0.89 CPU)
12 Result: PASS

As far as I can see, this happens due to another race condition, this time
in launcher.c.
For such a three-minute case I see in _subscriber.log:
2024-02-01 14:33:13.604 UTC [949255] DEBUG: relation "public.test_mismatching_types" does not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message
type "INSERT" in transaction 757, finished at 0/153C838
2024-02-01 14:33:13.604 UTC [949255] ERROR: logical replication target relation "public.test_mismatching_types" does
not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message
type "INSERT" in transaction 757, finished at 0/153C838
...
2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG: statement: CREATE TABLE public.test_mismatching_types (
a int PRIMARY KEY
);
2024-02-01 14:33:13.605 UTC [942451] DEBUG: unregistering background worker "logical replication apply worker for
subscription 16398"
2024-02-01 14:33:13.605 UTC [942451] LOG: background worker "logical replication apply worker" (PID 949255) exited with
exit code 1
...
2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
...
2024-02-01 14:36:13.642 UTC [942527] DEBUG: starting logical replication worker for subscription "tsub"
(there is no interesting activity between 14:33:13 and 14:36:13)

So logical replication apply worker exited because CREATE TABLE was not
executed on subscriber yet, and new replication worker started because of a
timeout occurred in WaitLatch(..., wait_time, ...) inside
ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
(180 sec)).

But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
a result of:
logical replication apply worker| logicalrep_worker_onexit() ->
ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).

In a bad case, I see that the SetLatch() called as well, but then the latch
gets reset by the following code in WaitForReplicationWorkerAttach():
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}

With pg_usleep(300000); added just before ResetLatch and
$node_subscriber->safe_psql(
'postgres', qq(
+SELECT pg_sleep(0.5);
CREATE TABLE public.test_mismatching_types (
in 014_binary.pl, I can see the anomaly without running tests in parallel,
just when running that test in a loop:
for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done
...
iteration 2
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8, 5 wallclock secs ( 0.00 usr 0.00 sys + 0.24 cusr 0.18 csys = 0.42 CPU)
Result: PASS
iteration 3
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8, 183 wallclock secs ( 0.02 usr 0.00 sys + 0.28 cusr 0.25 csys = 0.55 CPU)
Result: PASS
...

In other words, the abnormal test execution takes place due to the
following events:
1. logicalrep worker launcher launches replication worker and waits for it
to attach:
ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach()
2. logicalrep worker exits due to some error (logical replication target
relation does not exist, in our case) and sends a signal to set the latch
for launcher
3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
it inside WaitForReplicationWorkerAttach()
4. launcher gets back to ApplyLauncherMain() where it waits for the latch
(not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).

Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
test 027_nosuperuser executes for 3+ minutes on each run for me.

Thanks for the steps for the issue, I was able to reproduce this issue
in my environment with the steps provided. The attached patch has a
proposed fix where the latch will not be set in case of the apply
worker exiting immediately after starting.

Regards,
Vignesh

Attachments:

v4-0001-Table-sync-missed-for-newly-added-tables-because-.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Table-sync-missed-for-newly-added-tables-because-.patchDownload
From 87b290a6209f078b3029e27b0b61d3644026dae3 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v4 1/2] Table sync missed for newly added tables because
 subscription relation cache invalidation was not handled in certain
 concurrent scenarios.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +++++++++++++++++----
 src/tools/pgindent/typedefs.list            |  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ee06629088..9e24fb608c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -123,7 +123,14 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -273,7 +280,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1568,13 +1575,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_valid != SYNC_TABLE_STATE_VALID)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_valid = SYNC_TABLE_STATE_REBUILD_STARTED;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1608,7 +1617,15 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		/*
+		 * If the subscription relation cache has been invalidated since we
+		 * entered this routine, we still use and return the relations we just
+		 * finished constructing, to avoid infinite loops, but we leave the
+		 * table states marked as stale so that we'll rebuild it again on next
+		 * access. Otherwise, we mark the table states as valid.
+		 */
+		if (table_states_valid == SYNC_TABLE_STATE_REBUILD_STARTED)
+			table_states_valid = SYNC_TABLE_STATE_VALID;
 	}
 
 	return has_subrels;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b..5a40f549f9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2703,6 +2703,7 @@ SupportRequestSelectivity
 SupportRequestSimplify
 SupportRequestWFuncMonotonic
 Syn
+SyncingTablesState
 SyncOps
 SyncRepConfigData
 SyncRepStandbyData
-- 
2.34.1

v4-0002-Apply-worker-will-get-started-after-180-seconds-b.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Apply-worker-will-get-started-after-180-seconds-b.patchDownload
From 817dd4089d7c9c3e4481d0718a0a269f7f0f30c5 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Mon, 5 Feb 2024 14:55:48 +0530
Subject: [PATCH v4 2/2] Apply worker will get started after 180 seconds by the
 launcher in case the apply worker exits immediately after startup.

Apply worker was getting started after 180 seconds tiemout of the
launcher in case the apply worker exits immediately after startup. This
was happening because the launcher process's latch was getting reset
after the apply worker was started, which resulted in the launcher to
wait for the next 180 seconds timeout before starting the apply worker.
Fixed this issue by not setting the latch in case the apply worker has
exited.
---
 src/backend/replication/logical/launcher.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 122db0bb13..2c750e2cd6 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -233,6 +233,19 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 
 		if (rc & WL_LATCH_SET)
 		{
+			bool worker_in_use;
+
+			LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+			worker_in_use = worker->in_use;
+			LWLockRelease(LogicalRepWorkerLock);
+
+			/*
+			 * Latch must not be reset in case of the apply worker exited
+			 * immediately after start.
+			 */
+			if (!worker_in_use)
+				return worker_in_use;
+
 			ResetLatch(MyLatch);
 			CHECK_FOR_INTERRUPTS();
 		}
-- 
2.34.1

#9Alexander Lakhin
exclusion@gmail.com
In reply to: vignesh C (#8)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

05.02.2024 13:13, vignesh C wrote:

Thanks for the steps for the issue, I was able to reproduce this issue
in my environment with the steps provided. The attached patch has a
proposed fix where the latch will not be set in case of the apply
worker exiting immediately after starting.

It looks like the proposed fix doesn't help when ApplyLauncherWakeup()
called by a backend executing CREATE SUBSCRIPTION command.
That is, with the v4-0002 patch applied and pg_usleep(300000L); added
just below
            if (!worker_in_use)
                return worker_in_use;
I still observe the test 027_nosuperuser running for 3+ minutes:
t/027_nosuperuser.pl .. ok
All tests successful.
Files=1, Tests=19, 187 wallclock secs ( 0.01 usr  0.00 sys +  4.82 cusr  4.47 csys =  9.30 CPU)

IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION
regression_sub ...", gets missed when launcher waits for start of another
worker (logical replication worker for subscription "admin_sub"), launched
just before that command.

Best regards,
Alexander

#10vignesh C
vignesh21@gmail.com
In reply to: Alexander Lakhin (#9)
2 attachment(s)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Tue, 6 Feb 2024 at 18:30, Alexander Lakhin <exclusion@gmail.com> wrote:

05.02.2024 13:13, vignesh C wrote:

Thanks for the steps for the issue, I was able to reproduce this issue
in my environment with the steps provided. The attached patch has a
proposed fix where the latch will not be set in case of the apply
worker exiting immediately after starting.

It looks like the proposed fix doesn't help when ApplyLauncherWakeup()
called by a backend executing CREATE SUBSCRIPTION command.
That is, with the v4-0002 patch applied and pg_usleep(300000L); added
just below
if (!worker_in_use)
return worker_in_use;
I still observe the test 027_nosuperuser running for 3+ minutes:
t/027_nosuperuser.pl .. ok
All tests successful.
Files=1, Tests=19, 187 wallclock secs ( 0.01 usr 0.00 sys + 4.82 cusr 4.47 csys = 9.30 CPU)

IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION
regression_sub ...", gets missed when launcher waits for start of another
worker (logical replication worker for subscription "admin_sub"), launched
just before that command.

Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
getting missed in this case. The wakeup call can be sent during
subscription creation/modification and when the apply worker exits.
WaitForReplicationWorkerAttach should not reset the latch here as it
will end up delaying the apply worker to get started after 180 seconds
timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
the latch and lets ApplyLauncherMain to reset the latch and checks if
any new worker or missing worker needs to be started.

Regards,
Vignesh

Attachments:

v5-0002-Apply-worker-will-get-started-after-180-seconds-b.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Apply-worker-will-get-started-after-180-seconds-b.patchDownload
From f04db050a583c9c01eb77766f830a0cf77b0a6c7 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Mon, 5 Feb 2024 14:55:48 +0530
Subject: [PATCH v5 2/2] Apply worker will get started after 180 seconds by the
 launcher in case the apply worker exits immediately after startup.

Apply worker was getting started after 180 seconds tiemout of the
launcher in case the apply worker exits immediately after startup. This
was happening because the launcher process's latch was getting reset
after the apply worker was started, which resulted in the launcher to
wait for the next 180 seconds timeout before starting the apply worker.
Fixed this issue by not resetting the latch, as this latch will be set for
subscription modifications and apply worker exit. We should check if the
new worker needs to be started or not and reset the latch in ApplyLauncherMain.
---
 src/backend/replication/logical/launcher.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 122db0bb13..a754f2c757 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -191,7 +191,6 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 							   BackgroundWorkerHandle *handle)
 {
 	BgwHandleStatus status;
-	int			rc;
 
 	for (;;)
 	{
@@ -226,16 +225,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		/*
 		 * We need timeout because we generally don't get notified via latch
 		 * about the worker attach.  But we don't expect to have to wait long.
+		 * Since this latch is also used for subscription creation/modification
+		 * and apply worker process exit signal handling, the latch must not be
+		 * reset here. We should check if the new worker needs to be started or
+		 * not and reset the latch in ApplyLauncherMain.
 		 */
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-					   10L, WAIT_EVENT_BGWORKER_STARTUP);
-
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			CHECK_FOR_INTERRUPTS();
-		}
+		(void) WaitLatch(MyLatch,
+						 WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+						 10L, WAIT_EVENT_BGWORKER_STARTUP);
 	}
 }
 
-- 
2.34.1

v5-0001-Table-sync-missed-for-newly-added-tables-because-.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Table-sync-missed-for-newly-added-tables-because-.patchDownload
From 385c96a71396ea8efa86d6136bbf0bfe5282f1d1 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v5 1/2] Table sync missed for newly added tables because
 subscription relation cache invalidation was not handled in certain
 concurrent scenarios.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +++++++++++++++++----
 src/tools/pgindent/typedefs.list            |  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ee06629088..9e24fb608c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -123,7 +123,14 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -273,7 +280,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1568,13 +1575,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_valid != SYNC_TABLE_STATE_VALID)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_valid = SYNC_TABLE_STATE_REBUILD_STARTED;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1608,7 +1617,15 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		/*
+		 * If the subscription relation cache has been invalidated since we
+		 * entered this routine, we still use and return the relations we just
+		 * finished constructing, to avoid infinite loops, but we leave the
+		 * table states marked as stale so that we'll rebuild it again on next
+		 * access. Otherwise, we mark the table states as valid.
+		 */
+		if (table_states_valid == SYNC_TABLE_STATE_REBUILD_STARTED)
+			table_states_valid = SYNC_TABLE_STATE_VALID;
 	}
 
 	return has_subrels;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b..5a40f549f9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2703,6 +2703,7 @@ SupportRequestSelectivity
 SupportRequestSimplify
 SupportRequestWFuncMonotonic
 Syn
+SyncingTablesState
 SyncOps
 SyncRepConfigData
 SyncRepStandbyData
-- 
2.34.1

#11Alexander Lakhin
exclusion@gmail.com
In reply to: vignesh C (#10)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

08.02.2024 12:25, vignesh C wrote:

Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
getting missed in this case. The wakeup call can be sent during
subscription creation/modification and when the apply worker exits.
WaitForReplicationWorkerAttach should not reset the latch here as it
will end up delaying the apply worker to get started after 180 seconds
timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
the latch and lets ApplyLauncherMain to reset the latch and checks if
any new worker or missing worker needs to be started.

Thank you for the updated patch!
I ran all the subscription tests in a loop (with the sleeps added as
before) and observed no failures and 180+ seconds duration.

Best regards,
Alexander

#12vignesh C
vignesh21@gmail.com
In reply to: Alexander Lakhin (#11)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Thu, 8 Feb 2024 at 23:30, Alexander Lakhin <exclusion@gmail.com> wrote:

08.02.2024 12:25, vignesh C wrote:

Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
getting missed in this case. The wakeup call can be sent during
subscription creation/modification and when the apply worker exits.
WaitForReplicationWorkerAttach should not reset the latch here as it
will end up delaying the apply worker to get started after 180 seconds
timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
the latch and lets ApplyLauncherMain to reset the latch and checks if
any new worker or missing worker needs to be started.

Thank you for the updated patch!
I ran all the subscription tests in a loop (with the sleeps added as
before) and observed no failures and 180+ seconds duration.

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

#13Ajin Cherian
itsajin@gmail.com
In reply to: vignesh C (#12)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

Thanks for the patch, I have verified that the fix works well by following
the steps mentioned to reproduce the problem.
Reviewing the patch, it seems good and is well documented. Just one minor
comment I had was probably to change the name of the variable
table_states_valid to table_states_validity. The current name made sense
when it was a bool, but now that it is a tri-state enum, it doesn't fit
well.

regards,
Ajin Cherian
Fujitsu Australia

#14vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#13)
2 attachment(s)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the problem.
Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the name of the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but now that it is a tri-state enum, it doesn't fit well.

Thanks for reviewing the patch, the attached v6 patch has the changes
for the same.

Regards,
Vignesh

Attachments:

v6-0002-Apply-worker-will-get-started-after-180-seconds-b.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Apply-worker-will-get-started-after-180-seconds-b.patchDownload
From ab68e343b4b3f7ec09f758089a9f90a16f21de42 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Mon, 5 Feb 2024 14:55:48 +0530
Subject: [PATCH v6 2/2] Apply worker will get started after 180 seconds by the
 launcher in case the apply worker exits immediately after startup.

Apply worker was getting started after 180 seconds tiemout of the
launcher in case the apply worker exits immediately after startup. This
was happening because the launcher process's latch was getting reset
after the apply worker was started, which resulted in the launcher to
wait for the next 180 seconds timeout before starting the apply worker.
Fixed this issue by not resetting the latch, as this latch will be set for
subscription modifications and apply worker exit. We should check if the
new worker needs to be started or not and reset the latch in ApplyLauncherMain.
---
 src/backend/replication/logical/launcher.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 66070e9131..f2545482c8 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -185,7 +185,6 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 							   BackgroundWorkerHandle *handle)
 {
 	BgwHandleStatus status;
-	int			rc;
 
 	for (;;)
 	{
@@ -220,16 +219,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		/*
 		 * We need timeout because we generally don't get notified via latch
 		 * about the worker attach.  But we don't expect to have to wait long.
+		 * Since this latch is also used for subscription creation/modification
+		 * and apply worker process exit signal handling, the latch must not be
+		 * reset here. We should check if the new worker needs to be started or
+		 * not and reset the latch in ApplyLauncherMain.
 		 */
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-					   10L, WAIT_EVENT_BGWORKER_STARTUP);
-
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			CHECK_FOR_INTERRUPTS();
-		}
+		(void) WaitLatch(MyLatch,
+						 WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+						 10L, WAIT_EVENT_BGWORKER_STARTUP);
 	}
 }
 
-- 
2.34.1

v6-0001-Table-sync-missed-for-newly-added-tables-because-.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Table-sync-missed-for-newly-added-tables-because-.patchDownload
From 47c6cd886902693074ece4b2a1188dce6f066bf8 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v6 1/2] Table sync missed for newly added tables because
 subscription relation cache invalidation was not handled in certain
 concurrent scenarios.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +++++++++++++++++----
 src/tools/pgindent/typedefs.list            |  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ee06629088..eefe620e5e 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -123,7 +123,14 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -273,7 +280,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1568,13 +1575,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_validity != SYNC_TABLE_STATE_VALID)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_validity = SYNC_TABLE_STATE_REBUILD_STARTED;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1608,7 +1617,15 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		/*
+		 * If the subscription relation cache has been invalidated since we
+		 * entered this routine, we still use and return the relations we just
+		 * finished constructing, to avoid infinite loops, but we leave the
+		 * table states marked as stale so that we'll rebuild it again on next
+		 * access. Otherwise, we mark the table states as valid.
+		 */
+		if (table_states_validity == SYNC_TABLE_STATE_REBUILD_STARTED)
+			table_states_validity = SYNC_TABLE_STATE_VALID;
 	}
 
 	return has_subrels;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index aa7a25b8f8..bd551fde3f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2711,6 +2711,7 @@ SupportRequestSelectivity
 SupportRequestSimplify
 SupportRequestWFuncMonotonic
 Syn
+SyncingTablesState
 SyncOps
 SyncRepConfigData
 SyncRepStandbyData
-- 
2.34.1

#15Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: vignesh C (#14)
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Wednesday, March 13, 2024 11:49 AMvignesh C <vignesh21@gmail.com> wrote:

On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

Thanks for the patch, I have verified that the fix works well by following the

steps mentioned to reproduce the problem.

Reviewing the patch, it seems good and is well documented. Just one minor

comment I had was probably to change the name of the variable
table_states_valid to table_states_validity. The current name made sense when
it was a bool, but now that it is a tri-state enum, it doesn't fit well.

Thanks for reviewing the patch, the attached v6 patch has the changes for the
same.

Thanks for the patches.

I saw a recent similar BF error[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&amp;dt=2024-03-11%2000%3A52%3A42 which seems related to the issue that 0001
patch is trying to solve. i.e. The table sync worker is somehow not started
after refreshing the publication on the subscriber. I didn't see other related ERRORs in
the log, so I think the reason is the same as the one being discussed in this
thread, which is the table state invalidation got lost. And the 0001 patch
looks good to me.

For 0002, instead of avoid resetting the latch, is it possible to let the
logical rep worker wake up the launcher once after attaching ?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&amp;dt=2024-03-11%2000%3A52%3A42

Best Regards,
Hou zj

#16vignesh C
vignesh21@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#15)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, March 13, 2024 11:49 AMvignesh C <vignesh21@gmail.com> wrote:

On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

Thanks for the patch, I have verified that the fix works well by following the

steps mentioned to reproduce the problem.

Reviewing the patch, it seems good and is well documented. Just one minor

comment I had was probably to change the name of the variable
table_states_valid to table_states_validity. The current name made sense when
it was a bool, but now that it is a tri-state enum, it doesn't fit well.

Thanks for reviewing the patch, the attached v6 patch has the changes for the
same.

Thanks for the patches.

I saw a recent similar BF error[1] which seems related to the issue that 0001
patch is trying to solve. i.e. The table sync worker is somehow not started
after refreshing the publication on the subscriber. I didn't see other related ERRORs in
the log, so I think the reason is the same as the one being discussed in this
thread, which is the table state invalidation got lost. And the 0001 patch
looks good to me.

For 0002, instead of avoid resetting the latch, is it possible to let the
logical rep worker wake up the launcher once after attaching ?

Waking up of the launch process uses the same latch that is used for
subscription creation/modification and apply worker process exit. As
the handling of this latch for subscription creation/modification and
worker process exit can be done only by ApplyLauncherMain, we will not
be able to reset the latch in WaitForReplicationWorkerAttach. I feel
waking up the launcher process might not help in this case as
currently we will not be able to differentiate between worker
attached, subscription creation/modification and apply worker process
exit.

Regards,
Vignesh

#17Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: vignesh C (#14)
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Wednesday, March 13, 2024 11:49 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

Thanks for the patch, I have verified that the fix works well by following the

steps mentioned to reproduce the problem.

Reviewing the patch, it seems good and is well documented. Just one minor

comment I had was probably to change the name of the variable
table_states_valid to table_states_validity. The current name made sense when
it was a bool, but now that it is a tri-state enum, it doesn't fit well.

Thanks for reviewing the patch, the attached v6 patch has the changes for the
same.

FYI, I noticed that there is one open item on
https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items which is related to
the fix in this thread.

--
Intermittent failures in 040_standby_failover_slots_sync test
Possible solution in this thread: Race condition in FetchTableStates
--

AFAICS, the bug discussed here is not a new issue on
PG17, so I am thinking to move the item to the "Older bugs affecting stable
branches" section if no objections.

Best Regards,
Hou zj

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#16)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Wed, Mar 13, 2024 at 11:59 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

For 0002, instead of avoid resetting the latch, is it possible to let the
logical rep worker wake up the launcher once after attaching ?

Waking up of the launch process uses the same latch that is used for
subscription creation/modification and apply worker process exit. As
the handling of this latch for subscription creation/modification and
worker process exit can be done only by ApplyLauncherMain, we will not
be able to reset the latch in WaitForReplicationWorkerAttach. I feel
waking up the launcher process might not help in this case as
currently we will not be able to differentiate between worker
attached, subscription creation/modification and apply worker process
exit.

IIUC, even if we set the latch once the worker attaches, the other
set_latch by subscription creation/modification or apply_worker_exit
could also be consumed due to reset of latch in
WaitForReplicationWorkerAttach(). Is that understanding correct? If
so, can we use some other way to wake up
WaitForReplicationWorkerAttach() say condition variable? The current
proposal can fix the issue but looks bit adhoc.

--
With Regards,
Amit Kapila.

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#14)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Wed, Mar 13, 2024 at 9:19 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the problem.
Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the name of the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but now that it is a tri-state enum, it doesn't fit well.

Thanks for reviewing the patch, the attached v6 patch has the changes
for the same.

v6_0001* looks good to me. This should be backpatched unless you or
others think otherwise.

--
With Regards,
Amit Kapila.

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#18)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 13, 2024 at 11:59 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

For 0002, instead of avoid resetting the latch, is it possible to let the
logical rep worker wake up the launcher once after attaching ?

Waking up of the launch process uses the same latch that is used for
subscription creation/modification and apply worker process exit. As
the handling of this latch for subscription creation/modification and
worker process exit can be done only by ApplyLauncherMain, we will not
be able to reset the latch in WaitForReplicationWorkerAttach. I feel
waking up the launcher process might not help in this case as
currently we will not be able to differentiate between worker
attached, subscription creation/modification and apply worker process
exit.

IIUC, even if we set the latch once the worker attaches, the other
set_latch by subscription creation/modification or apply_worker_exit
could also be consumed due to reset of latch in
WaitForReplicationWorkerAttach(). Is that understanding correct? If
so, can we use some other way to wake up
WaitForReplicationWorkerAttach() say condition variable?

The other possibility is to have a GUC launcher_retry_time or
something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This
still may not resolve the issue if launcher_retry_time is longer but
at least users would be able to configure it. I am not sure if this is
a good idea or not but just trying to brainstorm different ideas to
solve this problem.

BTW, as far as I understand, this is an improvement in the existing
code, so should be done only for HEAD (probably PG18) and should be
discussed in a separate thread.

--
With Regards,
Amit Kapila.

#21vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#19)
2 attachment(s)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Wed, 24 Apr 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 13, 2024 at 9:19 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the problem.
Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the name of the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but now that it is a tri-state enum, it doesn't fit well.

Thanks for reviewing the patch, the attached v6 patch has the changes
for the same.

v6_0001* looks good to me. This should be backpatched unless you or
others think otherwise.

This patch needs to be committed in master,PG16 and PG15.
This is not required from PG14 onwards because they don't have
HasSubscriptionRelations call before updating table_states_valid:
    /*
     * Does the subscription have tables?
     *
     * If there were not-READY relations found then we know it does. But
     * if table_states_not_ready was empty we still need to check again to
     * see if there are 0 tables.
     */
    has_subrels = (table_states_not_ready != NIL) ||
      HasSubscriptionRelations(MySubscription->oid);

So the invalidation function will not be called here.

Whereas for PG15 and newer versions this is applicable:
HasSubscriptionRelations calls table_open function which will get the
invalidate callback like in:
HasSubscriptionRelations -> table_open -> relation_open ->
LockRelationOid -> AcceptInvalidationMessages ->
ReceiveSharedInvalidMessages -> LocalExecuteInvalidationMessage ->
CallSyscacheCallbacks -> invalidate_syncing_table_states

The attached patch
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
applies for master and PG16 branch,
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
applies for PG15 branch.

Regards,
Vignesh

Attachments:

v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patchDownload
From 93116320bd9ffbcddc83a1fc253cba90c56cb47e Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v7] Table sync missed due to race condition in subscription
 cache handling.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +++++++++++++++++----
 src/tools/pgindent/typedefs.list            |  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 81fff1919d..7d4ee48206 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -122,7 +122,14 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -272,7 +279,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1556,13 +1563,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_validity != SYNC_TABLE_STATE_VALID)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_validity = SYNC_TABLE_STATE_REBUILD_STARTED;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1596,7 +1605,15 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		/*
+		 * If the subscription relation cache has been invalidated since we
+		 * entered this routine, we still use and return the relations we just
+		 * finished constructing, to avoid infinite loops, but we leave the
+		 * table states marked as stale so that we'll rebuild it again on next
+		 * access. Otherwise, we mark the table states as valid.
+		 */
+		if (table_states_validity == SYNC_TABLE_STATE_REBUILD_STARTED)
+			table_states_validity = SYNC_TABLE_STATE_VALID;
 	}
 
 	return has_subrels;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4ec490e362..5a9353c493 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2676,6 +2676,7 @@ SupportRequestSelectivity
 SupportRequestSimplify
 SupportRequestWFuncMonotonic
 Syn
+SyncingTablesState
 SyncOps
 SyncRepConfigData
 SyncRepStandbyData
-- 
2.34.1

v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patchDownload
From 461cfc99ac1d85d22629ec4880febb307460e4d5 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Wed, 24 Apr 2024 11:27:17 +0530
Subject: [PATCH v7] Table sync missed due to race condition in subscription
 cache handling.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +++++++++++++++++----
 src/tools/pgindent/typedefs.list            |  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index c3dd902f8c..e6159acba0 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -120,7 +120,14 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -270,7 +277,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1465,13 +1472,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_validity != SYNC_TABLE_STATE_VALID)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_validity = SYNC_TABLE_STATE_REBUILD_STARTED;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1505,7 +1514,15 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (list_length(table_states_not_ready) > 0) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		/*
+		 * If the subscription relation cache has been invalidated since we
+		 * entered this routine, we still use and return the relations we just
+		 * finished constructing, to avoid infinite loops, but we leave the
+		 * table states marked as stale so that we'll rebuild it again on next
+		 * access. Otherwise, we mark the table states as valid.
+		 */
+		if (table_states_validity == SYNC_TABLE_STATE_REBUILD_STARTED)
+			table_states_validity = SYNC_TABLE_STATE_VALID;
 	}
 
 	return has_subrels;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d030c485fd..78f3ec0caf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2635,6 +2635,7 @@ SupportRequestSelectivity
 SupportRequestSimplify
 SupportRequestWFuncMonotonic
 Syn
+SyncingTablesState
 SyncOps
 SyncRepConfigData
 SyncRepStandbyData
-- 
2.34.1

#22Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: vignesh C (#21)
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Wednesday, April 24, 2024 6:29 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 24 Apr 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 13, 2024 at 9:19 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com>

wrote:

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

Thanks for the patch, I have verified that the fix works well by following

the steps mentioned to reproduce the problem.

Reviewing the patch, it seems good and is well documented. Just one

minor comment I had was probably to change the name of the variable
table_states_valid to table_states_validity. The current name made sense when
it was a bool, but now that it is a tri-state enum, it doesn't fit well.

Thanks for reviewing the patch, the attached v6 patch has the
changes for the same.

v6_0001* looks good to me. This should be backpatched unless you or
others think otherwise.

This patch needs to be committed in master,PG16 and PG15.
This is not required from PG14 onwards because they don't have
HasSubscriptionRelations call before updating table_states_valid:
    /*
     * Does the subscription have tables?
     *
     * If there were not-READY relations found then we know it does. But
     * if table_states_not_ready was empty we still need to check again to
     * see if there are 0 tables.
     */
    has_subrels = (table_states_not_ready != NIL) ||
      HasSubscriptionRelations(MySubscription->oid);

So the invalidation function will not be called here.

Whereas for PG15 and newer versions this is applicable:
HasSubscriptionRelations calls table_open function which will get the
invalidate callback like in:
HasSubscriptionRelations -> table_open -> relation_open -> LockRelationOid
-> AcceptInvalidationMessages -> ReceiveSharedInvalidMessages ->
LocalExecuteInvalidationMessage -> CallSyscacheCallbacks ->
invalidate_syncing_table_states

The attached patch
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
applies for master and PG16 branch,
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
applies for PG15 branch.

Thanks, I have verified that the patches can be applied cleanly and fix the
issue on each branch. The regression test can also pass after applying the patch
on my machine.

Best Regards,
Hou zj

#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#22)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, April 24, 2024 6:29 PM vignesh C <vignesh21@gmail.com> wrote:

The attached patch
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
applies for master and PG16 branch,
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
applies for PG15 branch.

Thanks, I have verified that the patches can be applied cleanly and fix the
issue on each branch. The regression test can also pass after applying the patch
on my machine.

Pushed.

--
With Regards,
Amit Kapila.

#24vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#20)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Wed, 24 Apr 2024 at 15:49, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 13, 2024 at 11:59 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

For 0002, instead of avoid resetting the latch, is it possible to let the
logical rep worker wake up the launcher once after attaching ?

Waking up of the launch process uses the same latch that is used for
subscription creation/modification and apply worker process exit. As
the handling of this latch for subscription creation/modification and
worker process exit can be done only by ApplyLauncherMain, we will not
be able to reset the latch in WaitForReplicationWorkerAttach. I feel
waking up the launcher process might not help in this case as
currently we will not be able to differentiate between worker
attached, subscription creation/modification and apply worker process
exit.

IIUC, even if we set the latch once the worker attaches, the other
set_latch by subscription creation/modification or apply_worker_exit
could also be consumed due to reset of latch in
WaitForReplicationWorkerAttach(). Is that understanding correct? If
so, can we use some other way to wake up
WaitForReplicationWorkerAttach() say condition variable?

The other possibility is to have a GUC launcher_retry_time or
something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This
still may not resolve the issue if launcher_retry_time is longer but
at least users would be able to configure it. I am not sure if this is
a good idea or not but just trying to brainstorm different ideas to
solve this problem.

BTW, as far as I understand, this is an improvement in the existing
code, so should be done only for HEAD (probably PG18) and should be
discussed in a separate thread.

I have started a new thread at [1] to discuss this:
/messages/by-id/CALDaNm01_KEgHM1tKtgXkCGLJ5209SMSmGw3UmhZbOz365_=eA@mail.gmail.com

Regards,
Vignesh

#25vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#23)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

On Thu, 25 Apr 2024 at 16:20, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, April 24, 2024 6:29 PM vignesh C <vignesh21@gmail.com> wrote:

The attached patch
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
applies for master and PG16 branch,
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
applies for PG15 branch.

Thanks, I have verified that the patches can be applied cleanly and fix the
issue on each branch. The regression test can also pass after applying the patch
on my machine.

Pushed.

Thanks for pushing this, I have marked the commitfest entry at [1]https://commitfest.postgresql.org/48/4816/ as committed.
[1]: https://commitfest.postgresql.org/48/4816/

Regards,
Vignesh