Failure of subscription tests with topminnow
Hi all,
topminnow has just failed in a weird fashion:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-08-15%2013%3A24%3A58
# SELECT pid != FROM pg_stat_replication WHERE application_name = 'tap_sub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# ERROR: syntax error at or near "FROM"
# LINE 1: SELECT pid != FROM pg_stat_replication WHERE application_na...
Looking at the logs, it seems like the problem boils down to an active
slot when using ALTER SUBSCRIPTION tap_sub CONNECTION:
2021-08-15 18:44:38.027 CEST [16473:2] ERROR: could not start WAL
streaming: ERROR: replication slot "tap_sub" is active for PID 16336
There is only one place in 001_rep_changes.pl where this is used.
Thoughts?
--
Michael
On Mon, Aug 16, 2021 at 9:24 AM Michael Paquier <michael@paquier.xyz> wrote:
Hi all,
topminnow has just failed in a weird fashion:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-08-15%2013%3A24%3A58
# SELECT pid != FROM pg_stat_replication WHERE application_name = 'tap_sub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# ERROR: syntax error at or near "FROM"
# LINE 1: SELECT pid != FROM pg_stat_replication WHERE application_na...Looking at the logs, it seems like the problem boils down to an active
slot when using ALTER SUBSCRIPTION tap_sub CONNECTION:
2021-08-15 18:44:38.027 CEST [16473:2] ERROR: could not start WAL
streaming: ERROR: replication slot "tap_sub" is active for PID 16336
The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
walsender process. Now, here the problem seems to be that the previous
walsender process (16336) didn't exit and the new process started to
use the slot which leads to the error shown in the log. This is
evident from the below part of log where we can see that 16336 has
exited after new process started to use the slot.
2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]
One idea to solve this is to first disable the subscription, wait for
the walsender process to exit, and then change the connection string
and re-enable the subscription.
--
With Regards,
Amit Kapila.
On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
walsender process. Now, here the problem seems to be that the previous
walsender process (16336) didn't exit and the new process started to
use the slot which leads to the error shown in the log. This is
evident from the below part of log where we can see that 16336 has
exited after new process started to use the slot.2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]One idea to solve this is to first disable the subscription, wait for
the walsender process to exit, and then change the connection string
and re-enable the subscription.
I tried to simulate this by putting delays prior to the exit of the
walsender. Even though I see the same error
in the logs that the replication slot is active for the previous PID,
the test case does not fail in the way seen in this case here..
The new walsender tries to acquire the slot but seeing that there is
another PID that is active on the slot, it
errors and exits. At no point does this new failed walsender update
its pid for the slot. As a result, the below polled query
would not return a true value
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub';"
) or die "Timed out while waiting for apply to restart";
In my runs I see that this query is repeated until a new walsender is
able to successfully acquire the slot.
I am not able to explain why this query returned true in the topminnow
tap test. This would imply that a walsender
was able to acquire the slot and update its pid but I don't see how.
We also know that if this polled query returns
a true (implying a pid other than $oldpid), then the next query in the
test is to try and identify the pid:
SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';
From the topminnow logs we know that this query returned a "NULL", as
the value extracted from this is used
to formulate the next query which errored out and eventually caused
the test case to fail.
[16482:3] 001_rep_changes.pl ERROR: syntax error at or near "FROM"
at character 16
[16482:4] 001_rep_changes.pl STATEMENT: SELECT pid != FROM
pg_stat_replication WHERE application_name = 'tap_sub';
I am not an expert in perl but I looked at the perl function used in
the tap test poll_query_until(), this would poll until the query
returns a 'true' (unless specified otherwise).
I don't see in my tests that the polled function exits if the query
returns a NULL. I don't know if differences in the installed perl can
cause this difference in
behaviour. Can a NULL set be construed as a true and cause the poll to exit?
Any suggestions?
regards,
Ajin Cherian
Fujitsu Australia
On Tue, Aug 24, 2021 at 11:04 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
walsender process. Now, here the problem seems to be that the previous
walsender process (16336) didn't exit and the new process started to
use the slot which leads to the error shown in the log. This is
evident from the below part of log where we can see that 16336 has
exited after new process started to use the slot.2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]One idea to solve this is to first disable the subscription, wait for
the walsender process to exit, and then change the connection string
and re-enable the subscription.I tried to simulate this by putting delays prior to the exit of the
walsender. Even though I see the same error
in the logs that the replication slot is active for the previous PID,
the test case does not fail in the way seen in this case here..The new walsender tries to acquire the slot but seeing that there is
another PID that is active on the slot, it
errors and exits. At no point does this new failed walsender update
its pid for the slot. As a result, the below polled query
would not return a true value$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub';"
) or die "Timed out while waiting for apply to restart";In my runs I see that this query is repeated until a new walsender is
able to successfully acquire the slot.
What happens when there is neither a new walsender nor an old
walsender is present? It means to run the above statement when a new
walsender is exited due to error "... slot is active ..." and before a
new walsender could start. Also, allow old walsender (due to which the
error occurs) to exit before executing the statement.
Also, it seems this failure happens on REL_11_STABLE branch, not sure
if that matters, but it is better to use the same branch if you are
using a different branch to reproduce the issue.
--
With Regards,
Amit Kapila.
On Tue, Aug 24, 2021 at 9:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
What happens when there is neither a new walsender nor an old
walsender is present? It means to run the above statement when a new
walsender is exited due to error "... slot is active ..." and before a
new walsender could start. Also, allow old walsender (due to which the
error occurs) to exit before executing the statement.
I tried this, then the query returns a null instead of 'false' because
there is no entry for that application_name.
postgres=# select pid != 31876 from pg_stat_replication where
application_name = 'sub';
?column?
----------
(0 rows)
Also, it seems this failure happens on REL_11_STABLE branch, not sure
if that matters, but it is better to use the same branch if you are
using a different branch to reproduce the issue.
Ok, I didn't realise this. I will try this.
regards,
Ajin Cherian
Fujitsu Australia
On Tue, Aug 24, 2021 at 6:28 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 9:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
What happens when there is neither a new walsender nor an old
walsender is present? It means to run the above statement when a new
walsender is exited due to error "... slot is active ..." and before a
new walsender could start. Also, allow old walsender (due to which the
error occurs) to exit before executing the statement.I tried this, then the query returns a null instead of 'false' because
there is no entry for that application_name.postgres=# select pid != 31876 from pg_stat_replication where
application_name = 'sub';
?column?
----------
(0 rows)
But will poll function still poll or exit? Have you tried that?
Because it is not clear from your explanation how in the first
statement it returns a valid value which leads poll to exit and then
in second statement it returns NULL or maybe nothing. Can you share
the log also when you are getting "replication slot "tap_sub" is
active for ..."? If you see in the below log [1]2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication slot "tap_sub" is active for PID 16336 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"'), the STATEMENT is
printed twice, I want to see if you also get prints in a similar way
or is it something different? Do you know why it is printed twice?
[1]: 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication slot "tap_sub" is active for PID 16336 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
--
With Regards,
Amit Kapila.
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.
Because it is not clear from your explanation how in the first
statement it returns a valid value which leads poll to exit and then
in second statement it returns NULL or maybe nothing. Can you share
I don't have an explanation for it either. Maybe things are different
in REL_11_STABLE
the log also when you are getting "replication slot "tap_sub" is
active for ..."? If you see in the below log [1], the STATEMENT is
printed twice, I want to see if you also get prints in a similar way
or is it something different? Do you know why it is printed twice?
Yes, I get the same. For every "LOG or ERROR" message, there is the
associated STATEMENT message with it.
First there is a LOG "received replication command" followed by the
STATEMENT and then the ERROR "replication slot .. is active.."
followed by the STATEMENT.
I will try with REL_11_STABLE and see if the behaviour is different.
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.
What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.
I will try with REL_11_STABLE and see if the behaviour is different.
Okay, but I still suspect some timing issue here.
--
With Regards,
Amit Kapila.
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.
If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itself
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itself
Sorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:
postgres=# select pid from pg_stat_replication where application_name = 'sub';
pid
------
7899
7993
(2 rows)
ajin 7896 3326 0 05:22 pts/2 00:00:00 psql -p 6972 postgres
ajin 7897 7882 0 05:22 ? 00:00:00 postgres: ajin
postgres [local] idle
ajin 7899 7882 0 05:22 ? 00:00:00 postgres: walsender
ajin ::1(37719) START_REPLICATION
ajin 7992 3647 0 05:24 ? 00:00:00 postgres: logical
replication worker for subscription 16426
ajin 7993 7882 0 05:24 ? 00:00:00 postgres: walsender
ajin ::1(37720) START_REPLICATION
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itselfSorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:postgres=# select pid from pg_stat_replication where application_name = 'sub';
pid
------
7899
7993
(2 rows)
IIUC the query[1]"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';" used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:
1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.
If the fact that two walsender with the same application_name could
present in pg_stat_replication view was the cause of this issue,
poll_query_until() should return OK even if we expected just 't'. I
might be missing something, though.
[1]: "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
application_name = '$appname';"
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itselfSorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:postgres=# select pid from pg_stat_replication where application_name = 'sub';
pid
------
7899
7993
(2 rows)IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:
Yeah, we can check this but I guess as soon as it gets 't', the poll
query will exit.
1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
I think the reason for exit is that we get two rows with the same
application_name in pg_stat_replication.
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.
Your understanding of steps is the same as mine.
--
With Regards,
Amit Kapila.
On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.If the fact that two walsender with the same application_name could
present in pg_stat_replication view was the cause of this issue,
poll_query_until() should return OK even if we expected just 't'. I
might be missing something, though.[1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"
Yes, the query [1]"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';" returns OK with a {f,t} or {t,f}
[1]: "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
application_name = '$appname';"
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 9:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itselfSorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:postgres=# select pid from pg_stat_replication where application_name = 'sub';
pid
------
7899
7993
(2 rows)IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:Yeah, we can check this but I guess as soon as it gets 't', the poll
query will exit.
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));
The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Wed, Aug 25, 2021 at 5:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.If the fact that two walsender with the same application_name could
present in pg_stat_replication view was the cause of this issue,
poll_query_until() should return OK even if we expected just 't'. I
might be missing something, though.[1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"Yes, the query [1] returns OK with a {f,t} or {t,f}
[1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"
Can you additionally check the value of 'state' from
pg_stat_replication for both the old and new walsender sessions?
--
With Regards,
Amit Kapila.
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.
This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?
--
With Regards,
Amit Kapila.
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.
Yes, this is true, I also see the same behaviour.
This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?
Relooking at my logs, I think what happens is this:
1. First walsender 'a' is running.
2. Second walsender 'b' starts and attempts at acquiring the slot
finds that the slot is active for pid a.
3. Now both walsenders are active, the query does not return.
4. First walsender 'a' times out and exits.
5. Now only the second walsender is active and the query returns OK
because pid != a.
6. Second walsender exits with error.
7. Another query attempts to get the pid of the running walsender for
tap_sub but returns null because both walsender exits.
8. This null return value results in the next query erroring out and
the test failing.
Can you additionally check the value of 'state' from
pg_stat_replication for both the old and new walsender sessions?
Yes, will try this and post a patch tomorrow.
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.Yes, this is true, I also see the same behaviour.
This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?Relooking at my logs, I think what happens is this:
1. First walsender 'a' is running.
2. Second walsender 'b' starts and attempts at acquiring the slot
finds that the slot is active for pid a.
3. Now both walsenders are active, the query does not return.
4. First walsender 'a' times out and exits.
5. Now only the second walsender is active and the query returns OK
because pid != a.
6. Second walsender exits with error.
7. Another query attempts to get the pid of the running walsender for
tap_sub but returns null because both walsender exits.
8. This null return value results in the next query erroring out and
the test failing.
So this is slightly different than what we can see in the topminnow
logs? According to the server logs, step #5 happened (at 18:44:38.016)
before step #4 happened (at 18:44:38.043).
Can you additionally check the value of 'state' from
pg_stat_replication for both the old and new walsender sessions?Yes, will try this and post a patch tomorrow.
Thanks. I guess the state of the new walsender should be "startup"
whereas the old one should be "streaming".
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.Yes, this is true, I also see the same behaviour.
This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?Relooking at my logs, I think what happens is this:
1. First walsender 'a' is running.
2. Second walsender 'b' starts and attempts at acquiring the slot
finds that the slot is active for pid a.
3. Now both walsenders are active, the query does not return.
4. First walsender 'a' times out and exits.
5. Now only the second walsender is active and the query returns OK
because pid != a.
6. Second walsender exits with error.
7. Another query attempts to get the pid of the running walsender for
tap_sub but returns null because both walsender exits.
8. This null return value results in the next query erroring out and
the test failing.So this is slightly different than what we can see in the topminnow
logs? According to the server logs, step #5 happened (at 18:44:38.016)
before step #4 happened (at 18:44:38.043).
Luckily these logs have the disconnection times of the tap test client
sessions as well. (not sure why I don't see these when I run these
tests).
Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG: statement: SELECT
pid != 16336 FROM pg_stat_replication WHERE application_name =
'tap_sub';
:
:
18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG: disconnection:
session time: 0:00:00.063 user=nm database=postgres host=[local]
When the query starts both walsenders are present but when the query
completes both walsenders are gone, the actual query evaluation could
have happened any time in between. This is the rare timing window that
causes this problem.
regards,
Ajin Cherian
Fujitsu Australia
On Thu, Aug 26, 2021 at 7:38 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Luckily these logs have the disconnection times of the tap test client
sessions as well. (not sure why I don't see these when I run these
tests).Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG: statement: SELECT
pid != 16336 FROM pg_stat_replication WHERE application_name =
'tap_sub';
:
:
18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG: disconnection:
session time: 0:00:00.063 user=nm database=postgres host=[local]When the query starts both walsenders are present but when the query
completes both walsenders are gone, the actual query evaluation could
have happened any time in between. This is the rare timing window that
causes this problem.
You have a point but if we see the below logs, it seems the second
walsender (#step6) seemed to exited before the first walsender
(#step4).
2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]
Isn't it possible that pid is cleared in the other order due to which
we are seeing this problem?
--
With Regards,
Amit Kapila.
On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
You have a point but if we see the below logs, it seems the second
walsender (#step6) seemed to exited before the first walsender
(#step4).2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]Isn't it possible that pid is cleared in the other order due to which
we are seeing this problem?
If the pid is cleared in the other order, wouldn't the query [1]" SELECT pid != 16336 FROM pg_stat_replication WHERE application_name = 'tap_sub';" return a false?
[1]: " SELECT pid != 16336 FROM pg_stat_replication WHERE application_name = 'tap_sub';"
application_name = 'tap_sub';"
regards,
Ajin Cherian
Fujitsu Australia
On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
You have a point but if we see the below logs, it seems the second
walsender (#step6) seemed to exited before the first walsender
(#step4).2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]Isn't it possible that pid is cleared in the other order due to which
we are seeing this problem?If the pid is cleared in the other order, wouldn't the query [1] return a false?
[1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
application_name = 'tap_sub';"
I think it should return true because pid for 16336 is cleared first
and the remaining one will be 16475.
--
With Regards,
Amit Kapila.
On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
You have a point but if we see the below logs, it seems the second
walsender (#step6) seemed to exited before the first walsender
(#step4).2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]Isn't it possible that pid is cleared in the other order due to which
we are seeing this problem?If the pid is cleared in the other order, wouldn't the query [1] return a false?
[1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
application_name = 'tap_sub';"I think it should return true because pid for 16336 is cleared first
and the remaining one will be 16475.
Yes, that was what I explained as well. 16336 is PID 'a' (first
walsender) in my explanation. The first walsender should
be cleared first for this theory to work.
regards,
Ajin Cherian
Fujitsu Australia
On Thu, Aug 26, 2021 at 12:59 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
You have a point but if we see the below logs, it seems the second
walsender (#step6) seemed to exited before the first walsender
(#step4).2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]Isn't it possible that pid is cleared in the other order due to which
we are seeing this problem?If the pid is cleared in the other order, wouldn't the query [1] return a false?
[1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
application_name = 'tap_sub';"I think it should return true because pid for 16336 is cleared first
and the remaining one will be 16475.Yes, that was what I explained as well. 16336 is PID 'a' (first
walsender) in my explanation. The first walsender should
be cleared first for this theory to work.
I think that it’s possible that the orders of the process writing
disconnections logs and setting 0 to walsender's pid are mismatched.
We set 0 to walsender's pid in WalSndKill() that is called during
on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't
show the entry. On the other hand, disconnections logs are written by
log_disconnections() that is called during on_proc_exit callback. That
is, the following sequence could happen:
1. the second walsender (pid = 16475) raises an error as the slot is
already active (held by the first walsender).
2. the first walsender (pid = 16336) clears its pid on the shmem.
3. the polling query checks the walsender’s pid, and returns true
(since there is only the second walsender now).
4. the second walsender clears its pid on the shmem.
5. the second walsender writes disconnection log.
6. the first walsender writes disconneciton log.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Thu, Aug 26, 2021 at 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I think that it’s possible that the orders of the process writing
disconnections logs and setting 0 to walsender's pid are mismatched.
We set 0 to walsender's pid in WalSndKill() that is called during
on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't
show the entry. On the other hand, disconnections logs are written by
log_disconnections() that is called during on_proc_exit callback. That
is, the following sequence could happen:1. the second walsender (pid = 16475) raises an error as the slot is
already active (held by the first walsender).
2. the first walsender (pid = 16336) clears its pid on the shmem.
3. the polling query checks the walsender’s pid, and returns true
(since there is only the second walsender now).
4. the second walsender clears its pid on the shmem.
5. the second walsender writes disconnection log.
6. the first walsender writes disconneciton log.
I agree with this.
Attaching a patch on head that modifies this particular script to also
consider the state of the walsender.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchapplication/octet-stream; name=v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchDownload
From 021e7b92fab08b626c254d75df11a66513232019 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 26 Aug 2021 06:37:36 -0400
Subject: [PATCH v1] fix for tap test failure seen in 001_rep_changes
When a walsender restarts, the new walsender might start before the previous walsender
relinquishes the slot. This will cause the new walsender to fail to acquire the slot
and fail immediately. This can cause the script to wrongly assume that the new walsender
has started. Then when the script tries to get the pid of the walsender and gets a NULL
, further queries will fail with an error and the test fails. This script adds a check
to confirm that the new walsender reaches a 'streaming' state before continuing.
---
src/test/subscription/t/001_rep_changes.pl | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0c84d87..03cf3e3 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -424,7 +424,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$oldpid = $node_publisher->safe_psql('postgres',
@@ -434,7 +434,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_publisher->safe_psql('postgres',
@@ -488,7 +488,7 @@ $oldpid = $node_publisher->safe_psql('postgres',
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
# check all the cleanup
--
1.8.3.1
On Thu, Aug 26, 2021 at 4:28 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Aug 26, 2021 at 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attaching a patch on head that modifies this particular script to also
consider the state of the walsender.
I think the fix is correct but similar changes are required in
022_twophase_cascade.pl as well (search for $oldpid in tests). I am
not completely sure but I think it is better to make this test change
in back branches as well to make it stable and reduce such random
build farm failures.
--
With Regards,
Amit Kapila.
On Fri, Aug 27, 2021 at 3:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I think the fix is correct but similar changes are required in
022_twophase_cascade.pl as well (search for $oldpid in tests). I am
not completely sure but I think it is better to make this test change
in back branches as well to make it stable and reduce such random
build farm failures.
I have made the changes in 022_twophase_cascade.pl for HEAD as well as
patches for older branches.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
HEAD-v2-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchapplication/octet-stream; name=HEAD-v2-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchDownload
From 7b7db873ac9520b31000f0662ca389779a355aa5 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 27 Aug 2021 06:21:00 -0400
Subject: [PATCH v2] fix for tap test failure seen in 001_rep_changes
When a walsender restarts, the new walsender might start before the previous walsender
relinquishes the slot. This will cause the new walsender to fail to acquire the slot
and fail immediately. This can cause the script to wrongly assume that the new walsender
has started. Then when the script tries to get the pid of the walsender and gets a NULL
, further queries will fail with an error and the test fails. This fix adds a check
to confirm that the new walsender reaches a 'streaming' state before continuing.
---
src/test/subscription/t/001_rep_changes.pl | 6 +++---
src/test/subscription/t/022_twophase_cascade.pl | 8 ++++----
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0c84d87..03cf3e3 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -424,7 +424,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$oldpid = $node_publisher->safe_psql('postgres',
@@ -434,7 +434,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_publisher->safe_psql('postgres',
@@ -488,7 +488,7 @@ $oldpid = $node_publisher->safe_psql('postgres',
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
# check all the cleanup
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index adda874..a38c5e7 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -235,10 +235,10 @@ is($result, qq(21), 'Rows committed are present on subscriber C');
my $oldpid_B = $node_A->safe_psql('postgres', "
SELECT pid FROM pg_stat_replication
- WHERE application_name = '$appname_B';");
+ WHERE application_name = '$appname_B' AND state = 'streaming';");
my $oldpid_C = $node_B->safe_psql('postgres', "
SELECT pid FROM pg_stat_replication
- WHERE application_name = '$appname_C';");
+ WHERE application_name = '$appname_C' AND state = 'streaming';");
# Setup logical replication (streaming = on)
@@ -253,11 +253,11 @@ $node_C->safe_psql('postgres', "
$node_A->poll_query_until('postgres', "
SELECT pid != $oldpid_B FROM pg_stat_replication
- WHERE application_name = '$appname_B';"
+ WHERE application_name = '$appname_B' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_B->poll_query_until('postgres', "
SELECT pid != $oldpid_C FROM pg_stat_replication
- WHERE application_name = '$appname_C';"
+ WHERE application_name = '$appname_C' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
###############################
--
1.8.3.1
REL-13-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchapplication/octet-stream; name=REL-13-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchDownload
From f78bc4bb91563675047c7a817804ad61ac89a953 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 27 Aug 2021 06:53:33 -0400
Subject: [PATCH v1] fix for tap test failure seen in 001_rep_changes
When a walsender restarts, the new walsender might start before the previous walsender
relinquishes the slot. This will cause the new walsender to fail to acquire the slot
and fail immediately. This can cause the script to wrongly assume that the new walsender
has started. Then when the script tries to get the pid of the walsender and gets a NULL
, further queries will fail with an error and the test fails. This fix adds a check
to confirm that the new walsender reaches a 'streaming' state before continuing.
---
src/test/subscription/t/001_rep_changes.pl | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..0b61d39 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -265,7 +265,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$oldpid = $node_publisher->safe_psql('postgres',
@@ -275,7 +275,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_publisher->safe_psql('postgres',
@@ -329,7 +329,7 @@ $oldpid = $node_publisher->safe_psql('postgres',
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
# check all the cleanup
--
1.8.3.1
REL-12-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchapplication/octet-stream; name=REL-12-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchDownload
From f78bc4bb91563675047c7a817804ad61ac89a953 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 27 Aug 2021 06:53:33 -0400
Subject: [PATCH v1] fix for tap test failure seen in 001_rep_changes
When a walsender restarts, the new walsender might start before the previous walsender
relinquishes the slot. This will cause the new walsender to fail to acquire the slot
and fail immediately. This can cause the script to wrongly assume that the new walsender
has started. Then when the script tries to get the pid of the walsender and gets a NULL
, further queries will fail with an error and the test fails. This fix adds a check
to confirm that the new walsender reaches a 'streaming' state before continuing.
---
src/test/subscription/t/001_rep_changes.pl | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..0b61d39 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -265,7 +265,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$oldpid = $node_publisher->safe_psql('postgres',
@@ -275,7 +275,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_publisher->safe_psql('postgres',
@@ -329,7 +329,7 @@ $oldpid = $node_publisher->safe_psql('postgres',
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
# check all the cleanup
--
1.8.3.1
REL-10-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchapplication/octet-stream; name=REL-10-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchDownload
From 629680dac0c8ea2ad34dedc3b32d627a64080029 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 27 Aug 2021 03:35:05 -0400
Subject: [PATCH v1] fix for tap test failure seen in 001_rep_changes
When a walsender restarts, the new walsender might start before the previous walsender
relinquishes the slot. This will cause the new walsender to fail to acquire the slot
and fail immediately. This can cause the script to wrongly assume that the new walsender
has started. Then when the script tries to get the pid of the walsender and gets a NULL
, further queries will fail with an error and the test fails. This fix adds a check
to confirm that the new walsender reaches a 'streaming' state before continuing.
---
src/test/subscription/t/001_rep_changes.pl | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 5ac50e1..9c4e77b 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -258,7 +258,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION 'application_name=$appname $publisher_connstr'"
);
$node_publisher->poll_query_until('postgres',
-"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
+"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$oldpid = $node_publisher->safe_psql('postgres',
@@ -268,7 +268,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
-"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
+"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_publisher->safe_psql('postgres',
@@ -319,7 +319,7 @@ $oldpid = $node_publisher->safe_psql('postgres',
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
-"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
+"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
# check all the cleanup
--
1.8.3.1
REL-11-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchapplication/octet-stream; name=REL-11-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchDownload
From 6ca0d97db504a13f8e9d4debccc6021d7dff48da Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 27 Aug 2021 05:09:14 -0400
Subject: [PATCH v1] fix for tap test failure seen in 001_rep_changes
When a walsender restarts, the new walsender might start before the previous walsender
relinquishes the slot. This will cause the new walsender to fail to acquire the slot
and fail immediately. This can cause the script to wrongly assume that the new walsender
has started. Then when the script tries to get the pid of the walsender and gets a NULL
, further queries will fail with an error and the test fails. This fix adds a check
to confirm that the new walsender reaches a 'streaming' state before continuing.
---
src/test/subscription/t/001_rep_changes.pl | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 9ca0acb..aa46fd1 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -266,7 +266,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION 'application_name=$appname $publisher_connstr'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$oldpid = $node_publisher->safe_psql('postgres',
@@ -276,7 +276,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_publisher->safe_psql('postgres',
@@ -330,7 +330,7 @@ $oldpid = $node_publisher->safe_psql('postgres',
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
# check all the cleanup
--
1.8.3.1
REL-14-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchapplication/octet-stream; name=REL-14-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patchDownload
From f78bc4bb91563675047c7a817804ad61ac89a953 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 27 Aug 2021 06:53:33 -0400
Subject: [PATCH v1] fix for tap test failure seen in 001_rep_changes
When a walsender restarts, the new walsender might start before the previous walsender
relinquishes the slot. This will cause the new walsender to fail to acquire the slot
and fail immediately. This can cause the script to wrongly assume that the new walsender
has started. Then when the script tries to get the pid of the walsender and gets a NULL
, further queries will fail with an error and the test fails. This fix adds a check
to confirm that the new walsender reaches a 'streaming' state before continuing.
---
src/test/subscription/t/001_rep_changes.pl | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..0b61d39 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -265,7 +265,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$oldpid = $node_publisher->safe_psql('postgres',
@@ -275,7 +275,7 @@ $node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_publisher->safe_psql('postgres',
@@ -329,7 +329,7 @@ $oldpid = $node_publisher->safe_psql('postgres',
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
# check all the cleanup
--
1.8.3.1
On Fri, Aug 27, 2021 at 4:33 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Fri, Aug 27, 2021 at 3:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I think the fix is correct but similar changes are required in
022_twophase_cascade.pl as well (search for $oldpid in tests). I am
not completely sure but I think it is better to make this test change
in back branches as well to make it stable and reduce such random
build farm failures.I have made the changes in 022_twophase_cascade.pl for HEAD as well as
patches for older branches.
Isn't it better to check the streaming state when we are fetching
oldpid? If we don't add, then I suspect that the next time someone
adding tests on similar lines might get confused about where to check
the state and where not. Also, if you agree, add some comments before
the test on why it is important to check states.
For ex., in below queries, the queries used for $oldpid.
my $oldpid = $node_publisher->safe_psql('postgres',
"SELECT pid FROM pg_stat_replication WHERE application_name =
'tap_sub';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$oldpid = $node_publisher->safe_psql('postgres',
"SELECT pid FROM pg_stat_replication WHERE application_name =
'tap_sub';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only
WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
--
With Regards,
Amit Kapila.
On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Isn't it better to check the streaming state when we are fetching
oldpid? If we don't add, then I suspect that the next time someone
adding tests on similar lines might get confused about where to check
the state and where not. Also, if you agree, add some comments before
the test on why it is important to check states.For ex., in below queries, the queries used for $oldpid.
my $oldpid = $node_publisher->safe_psql('postgres',
"SELECT pid FROM pg_stat_replication WHERE application_name =
'tap_sub';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";$oldpid = $node_publisher->safe_psql('postgres',
"SELECT pid FROM pg_stat_replication WHERE application_name =
'tap_sub';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only
WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
I have made the above changes on HEAD.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
v3-0001-Fix-the-random-test-failure-in-001_rep_changes.patchapplication/octet-stream; name=v3-0001-Fix-the-random-test-failure-in-001_rep_changes.patchDownload
From ed1c90cf2cd328be9fb775783961d29d5d450ff5 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 27 Aug 2021 06:46:33 -0400
Subject: [PATCH v3] Fix the random test failure in 001_rep_changes.
The check to test whether the subscription workers were restarting after a
change in the subscription was failing. The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state and
The check to test whether the subscription workers were restarting after a
change in the subscription was failing. The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state and
the walsender was exiting due to an error before that. Now, the walsender
was erroring out before reaching the 'streaming' state because it tries to
acquire the slot before the previous walsender has exited.
Reported-by: Michael Paquier, as per buildfarm
Author: Ajin Cherian
Reviewed-by: Amit Kapila
Backpatch-through: 10, where this test was introduced
Discussion: https://postgr.es/m/YRnhFxa9bo73wfpV@paquier.xyz
---
src/test/subscription/t/001_rep_changes.pl | 20 +++++++++++---------
src/test/subscription/t/022_twophase_cascade.pl | 8 ++++----
2 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0c84d87..f19e4d2 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -414,27 +414,29 @@ is( $result, qq(11.11|baz|1
'update works with dropped subscriber column');
# check that change of connection string and/or publication list causes
-# restart of subscription workers. Not all of these are registered as tests
-# as we need to poll for a change but the test suite will fail none the less
-# when something goes wrong.
+# restart of subscription workers. We check the state along with application_name
+# to ensure that the walsender is (re)started.
+#
+# Not all of these are registered as tests as we need to poll for a change
+# but the test suite will fail none the less when something goes wrong.
my $oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';;"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_publisher->safe_psql('postgres',
@@ -483,12 +485,12 @@ is($result, qq(21|0|100), 'check replicated insert after alter publication');
# check restart on rename
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
# check all the cleanup
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index adda874..a38c5e7 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -235,10 +235,10 @@ is($result, qq(21), 'Rows committed are present on subscriber C');
my $oldpid_B = $node_A->safe_psql('postgres', "
SELECT pid FROM pg_stat_replication
- WHERE application_name = '$appname_B';");
+ WHERE application_name = '$appname_B' AND state = 'streaming';");
my $oldpid_C = $node_B->safe_psql('postgres', "
SELECT pid FROM pg_stat_replication
- WHERE application_name = '$appname_C';");
+ WHERE application_name = '$appname_C' AND state = 'streaming';");
# Setup logical replication (streaming = on)
@@ -253,11 +253,11 @@ $node_C->safe_psql('postgres', "
$node_A->poll_query_until('postgres', "
SELECT pid != $oldpid_B FROM pg_stat_replication
- WHERE application_name = '$appname_B';"
+ WHERE application_name = '$appname_B' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
$node_B->poll_query_until('postgres', "
SELECT pid != $oldpid_C FROM pg_stat_replication
- WHERE application_name = '$appname_C';"
+ WHERE application_name = '$appname_C' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";
###############################
--
1.8.3.1
On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I have made the above changes on HEAD.
Thanks, this looks mostly good to me. I'll push and backpatch this
tomorrow unless you or someone else thinks otherwise.
Minor comments
==============
1.
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name =
'tap_sub' AND state = 'streaming';;"
);
An extra semicolon at the end of the statement.
2.
+# restart of subscription workers. We check the state along with
application_name
+# to ensure that the walsender is (re)started.
It is better to keep application_name in an above comment in the
second line as that will make this line looks a bit more consistent
with other comments.
3. In commit message, the text: "The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state
and The check to test whether the subscription workers were restarting
after a change in the subscription was failing." seems to be
repeated/redundant.
4. Kindly submit the patches for back-branches.
--
With Regards,
Amit Kapila.
On Tue, Aug 31, 2021 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I have made the above changes on HEAD.
Thanks, this looks mostly good to me. I'll push and backpatch this
tomorrow unless you or someone else thinks otherwise.Minor comments ============== 1. $oldpid = $node_publisher->safe_psql('postgres', - "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';" + "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';;" );An extra semicolon at the end of the statement.
2. +# restart of subscription workers. We check the state along with application_name +# to ensure that the walsender is (re)started.It is better to keep application_name in an above comment in the
second line as that will make this line looks a bit more consistent
with other comments.3. In commit message, the text: "The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state
and The check to test whether the subscription workers were restarting
after a change in the subscription was failing." seems to be
repeated/redundant.4. Kindly submit the patches for back-branches.
The patch with the above comments looks good to me. One minor
suggestion is to change the two messages of "die" to make the
investigation a bit easier. For example,
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart after changing
CONNECTION";
and
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart after changing
PUBLICATION";
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Tue, Aug 31, 2021 at 3:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Aug 31, 2021 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I have made the above changes on HEAD.
Thanks, this looks mostly good to me. I'll push and backpatch this
tomorrow unless you or someone else thinks otherwise.Minor comments ============== 1. $oldpid = $node_publisher->safe_psql('postgres', - "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';" + "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';;" );An extra semicolon at the end of the statement.
2. +# restart of subscription workers. We check the state along with application_name +# to ensure that the walsender is (re)started.It is better to keep application_name in an above comment in the
second line as that will make this line looks a bit more consistent
with other comments.3. In commit message, the text: "The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state
and The check to test whether the subscription workers were restarting
after a change in the subscription was failing." seems to be
repeated/redundant.4. Kindly submit the patches for back-branches.
The patch with the above comments looks good to me. One minor
suggestion is to change the two messages of "die" to make the
investigation a bit easier. For example,$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart after changing
CONNECTION";and
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart after changing
PUBLICATION";Regards,
Thanks Masahiko-san. I have included this change and made a new patch-set.
Hi Amit,
I have included your comments as well and also attached the patches
for the back-branches.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
head-v4-0001-Fix-the-random-test-failure-in-001_rep_changes.patchapplication/octet-stream; name=head-v4-0001-Fix-the-random-test-failure-in-001_rep_changes.patchDownload
From 38ac7093104cebca40248c2c9c807584489184fd Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 31 Aug 2021 08:05:16 -0400
Subject: [PATCH v4] Fix the random test failure in 001_rep_changes.
The check to test whether the subscription workers were restarting after a
change in the subscription was failing. The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state and
the walsender was exiting due to an error before that. Now, the walsender
was erroring out before reaching the 'streaming' state because it tries to
acquire the slot before the previous walsender has exited.
Reported-by: Michael Paquier, as per buildfarm
Author: Ajin Cherian
Reviewed-by: Amit Kapila
Backpatch-through: 10, where this test was introduced
Discussion: https://postgr.es/m/YRnhFxa9bo73wfpV@paquier.xyz
---
src/test/subscription/t/001_rep_changes.pl | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0c84d87..d89875a 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -414,28 +414,30 @@ is( $result, qq(11.11|baz|1
'update works with dropped subscriber column');
# check that change of connection string and/or publication list causes
-# restart of subscription workers. Not all of these are registered as tests
-# as we need to poll for a change but the test suite will fail none the less
-# when something goes wrong.
+# restart of subscription workers. We check the state along with
+# application_name to ensure that the walsender is (re)started.
+#
+# Not all of these are registered as tests as we need to poll for a change
+# but the test suite will fail none the less when something goes wrong.
my $oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing CONNECTION";
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing PUBLICATION";
$node_publisher->safe_psql('postgres',
"INSERT INTO tab_ins SELECT generate_series(1001,1100)");
@@ -483,13 +485,13 @@ is($result, qq(21|0|100), 'check replicated insert after alter publication');
# check restart on rename
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
# check all the cleanup
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
--
1.8.3.1
REL-10-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchapplication/octet-stream; name=REL-10-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchDownload
From afcf5345b0540b5e6e8476eee1c9263151a753a7 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 31 Aug 2021 06:43:26 -0400
Subject: [PATCH v2] Fix the random test failure in 001_rep_changes.
The check to test whether the subscription workers were restarting after a
change in the subscription was failing. The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state and
the walsender was exiting due to an error before that. Now, the walsender
was erroring out before reaching the 'streaming' state because it tries to
acquire the slot before the previous walsender has exited.
Reported-by: Michael Paquier, as per buildfarm
Author: Ajin Cherian
Reviewed-by: Amit Kapila
Backpatch-through: 10, where this test was introduced
Discussion: https://postgr.es/m/YRnhFxa9bo73wfpV@paquier.xyz
---
src/test/subscription/t/001_rep_changes.pl | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 5ac50e1..5decd8d 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -248,28 +248,30 @@ is( $result, qq(11.11|baz|1
'update works with dropped subscriber column');
# check that change of connection string and/or publication list causes
-# restart of subscription workers. Not all of these are registered as tests
-# as we need to poll for a change but the test suite will fail none the less
-# when something goes wrong.
+# restart of subscription workers. We check the state along with
+# application_name to ensure that the walsender is (re)started.
+#
+# Not all of these are registered as tests as we need to poll for a change
+# but the test suite will fail none the less when something goes wrong.
my $oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION 'application_name=$appname $publisher_connstr'"
);
$node_publisher->poll_query_until('postgres',
-"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
-) or die "Timed out while waiting for apply to restart";
+"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing CONNECTION";
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
-"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
-) or die "Timed out while waiting for apply to restart";
+"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing PUBLICATION";
$node_publisher->safe_psql('postgres',
"INSERT INTO tab_ins SELECT generate_series(1001,1100)");
@@ -314,13 +316,13 @@ is($result, qq(21|0|100), 'check replicated insert after alter publication');
# check restart on rename
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
-"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
-) or die "Timed out while waiting for apply to restart";
+"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
# check all the cleanup
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
--
1.8.3.1
REL-11-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchapplication/octet-stream; name=REL-11-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchDownload
From a2ad6bf2317786a30c537adedaf2723eae89e278 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 31 Aug 2021 07:34:40 -0400
Subject: [PATCH v2] Fix the random test failure in 001_rep_changes.
The check to test whether the subscription workers were restarting after a
change in the subscription was failing. The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state and
the walsender was exiting due to an error before that. Now, the walsender
was erroring out before reaching the 'streaming' state because it tries to
acquire the slot before the previous walsender has exited.
Reported-by: Michael Paquier, as per buildfarm
Author: Ajin Cherian
Reviewed-by: Amit Kapila
Backpatch-through: 10, where this test was introduced
Discussion: https://postgr.es/m/YRnhFxa9bo73wfpV@paquier.xyz
---
src/test/subscription/t/001_rep_changes.pl | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 9ca0acb..47c9580 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -256,28 +256,30 @@ is( $result, qq(11.11|baz|1
'update works with dropped subscriber column');
# check that change of connection string and/or publication list causes
-# restart of subscription workers. Not all of these are registered as tests
-# as we need to poll for a change but the test suite will fail none the less
-# when something goes wrong.
+# restart of subscription workers. We check the state along with
+# application_name to ensure that the walsender is (re)started.
+#
+# Not all of these are registered as tests as we need to poll for a change
+# but the test suite will fail none the less when something goes wrong.
my $oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION 'application_name=$appname $publisher_connstr'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing CONNECTION";
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing PUBLICATION";
$node_publisher->safe_psql('postgres',
"INSERT INTO tab_ins SELECT generate_series(1001,1100)");
@@ -330,8 +332,8 @@ $oldpid = $node_publisher->safe_psql('postgres',
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
# check all the cleanup
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
--
1.8.3.1
REL-12-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchapplication/octet-stream; name=REL-12-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchDownload
From df4c57747fd53245369c74675aa5dda1ae976628 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 27 Aug 2021 06:53:33 -0400
Subject: [PATCH v2] Fix the random test failure in 001_rep_changes.
The check to test whether the subscription workers were restarting after a
change in the subscription was failing. The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state and
the walsender was exiting due to an error before that. Now, the walsender
was erroring out before reaching the 'streaming' state because it tries to
acquire the slot before the previous walsender has exited.
Reported-by: Michael Paquier, as per buildfarm
Author: Ajin Cherian
Reviewed-by: Amit Kapila
Backpatch-through: 10, where this test was introduced
Discussion: https://postgr.es/m/YRnhFxa9bo73wfpV@paquier.xyz
---
src/test/subscription/t/001_rep_changes.pl | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..79f1db5 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -255,28 +255,30 @@ is( $result, qq(11.11|baz|1
'update works with dropped subscriber column');
# check that change of connection string and/or publication list causes
-# restart of subscription workers. Not all of these are registered as tests
-# as we need to poll for a change but the test suite will fail none the less
-# when something goes wrong.
+# restart of subscription workers. We check the state along with
+# application_name to ensure that the walsender is (re)started.
+#
+# Not all of these are registered as tests as we need to poll for a change
+# but the test suite will fail none the less when something goes wrong.
my $oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing CONNECTION";
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing PUBLICATION";
$node_publisher->safe_psql('postgres',
"INSERT INTO tab_ins SELECT generate_series(1001,1100)");
@@ -324,13 +326,13 @@ is($result, qq(21|0|100), 'check replicated insert after alter publication');
# check restart on rename
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
# check all the cleanup
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
--
1.8.3.1
REL-13-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchapplication/octet-stream; name=REL-13-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchDownload
From d107ae5513a19265777e53d4441724f4b4e5b61a Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 31 Aug 2021 06:15:16 -0400
Subject: [PATCH v2] Fix the random test failure in 001_rep_changes.
The check to test whether the subscription workers were restarting after a
change in the subscription was failing. The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state and
the walsender was exiting due to an error before that. Now, the walsender
was erroring out before reaching the 'streaming' state because it tries to
acquire the slot before the previous walsender has exited.
Reported-by: Michael Paquier, as per buildfarm
Author: Ajin Cherian
Reviewed-by: Amit Kapila
Backpatch-through: 10, where this test was introduced
Discussion: https://postgr.es/m/YRnhFxa9bo73wfpV@paquier.xyz
---
src/test/subscription/t/001_rep_changes.pl | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 311cca8..c3be32e 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -292,28 +292,30 @@ is( $result, qq(11.11|baz|1
'update works with dropped subscriber column');
# check that change of connection string and/or publication list causes
-# restart of subscription workers. Not all of these are registered as tests
-# as we need to poll for a change but the test suite will fail none the less
-# when something goes wrong.
+# restart of subscription workers. We check the state along with
+# application_name to ensure that the walsender is (re)started.
+#
+# Not all of these are registered as tests as we need to poll for a change
+# but the test suite will fail none the less when something goes wrong.
my $oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing CONNECTION";
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing PUBLICATION";
$node_publisher->safe_psql('postgres',
"INSERT INTO tab_ins SELECT generate_series(1001,1100)");
@@ -361,13 +363,13 @@ is($result, qq(21|0|100), 'check replicated insert after alter publication');
# check restart on rename
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
# check all the cleanup
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
--
1.8.3.1
REL-14-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchapplication/octet-stream; name=REL-14-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patchDownload
From 56df57e40d1f94f8cdd029ef725b9a657641b7cf Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 31 Aug 2021 08:34:31 -0400
Subject: [PATCH v2] Fix the random test failure in 001_rep_changes.
The check to test whether the subscription workers were restarting after a
change in the subscription was failing. The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state and
the walsender was exiting due to an error before that. Now, the walsender
was erroring out before reaching the 'streaming' state because it tries to
acquire the slot before the previous walsender has exited.
Reported-by: Michael Paquier, as per buildfarm
Author: Ajin Cherian
Reviewed-by: Amit Kapila
Backpatch-through: 10, where this test was introduced
Discussion: https://postgr.es/m/YRnhFxa9bo73wfpV@paquier.xyz
---
src/test/subscription/t/001_rep_changes.pl | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index dee5f5c..7dd69ca 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -414,28 +414,30 @@ is( $result, qq(11.11|baz|1
'update works with dropped subscriber column');
# check that change of connection string and/or publication list causes
-# restart of subscription workers. Not all of these are registered as tests
-# as we need to poll for a change but the test suite will fail none the less
-# when something goes wrong.
+# restart of subscription workers. We check the state along with
+# application_name to ensure that the walsender is (re)started.
+#
+# Not all of these are registered as tests as we need to poll for a change
+# but the test suite will fail none the less when something goes wrong.
my $oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing CONNECTION";
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after changing PUBLICATION";
$node_publisher->safe_psql('postgres',
"INSERT INTO tab_ins SELECT generate_series(1001,1100)");
@@ -483,13 +485,13 @@ is($result, qq(21|0|100), 'check replicated insert after alter publication');
# check restart on rename
$oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub RENAME TO tap_sub_renamed");
$node_publisher->poll_query_until('postgres',
- "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed';"
-) or die "Timed out while waiting for apply to restart";
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+) or die "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
# check all the cleanup
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
--
1.8.3.1
On Tue, Aug 31, 2021 at 6:13 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 31, 2021 at 3:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thanks Masahiko-san. I have included this change and made a new patch-set.
Hi Amit,
I have included your comments as well and also attached the patches
for the back-branches.
You forgot to make changes in 022_twophase_cascade.pl in the head
version patch. I have made the required changes and pushed it. Thanks
to you and Sawada-San for diagnosing and fixing this problem.
--
With Regards,
Amit Kapila.