Failure of subscription tests with topminnow
Hi all,
topminnow has just failed in a weird fashion:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-08-15%2013%3A24%3A58
# SELECT pid != FROM pg_stat_replication WHERE application_name = 'tap_sub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# ERROR: syntax error at or near "FROM"
# LINE 1: SELECT pid != FROM pg_stat_replication WHERE application_na...
Looking at the logs, it seems like the problem boils down to an active
slot when using ALTER SUBSCRIPTION tap_sub CONNECTION:
2021-08-15 18:44:38.027 CEST [16473:2] ERROR: could not start WAL
streaming: ERROR: replication slot "tap_sub" is active for PID 16336
There is only one place in 001_rep_changes.pl where this is used.
Thoughts?
--
Michael
On Mon, Aug 16, 2021 at 9:24 AM Michael Paquier <michael@paquier.xyz> wrote:
Hi all,
topminnow has just failed in a weird fashion:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-08-15%2013%3A24%3A58
# SELECT pid != FROM pg_stat_replication WHERE application_name = 'tap_sub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# ERROR: syntax error at or near "FROM"
# LINE 1: SELECT pid != FROM pg_stat_replication WHERE application_na...Looking at the logs, it seems like the problem boils down to an active
slot when using ALTER SUBSCRIPTION tap_sub CONNECTION:
2021-08-15 18:44:38.027 CEST [16473:2] ERROR: could not start WAL
streaming: ERROR: replication slot "tap_sub" is active for PID 16336
The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
walsender process. Now, here the problem seems to be that the previous
walsender process (16336) didn't exit and the new process started to
use the slot which leads to the error shown in the log. This is
evident from the below part of log where we can see that 16336 has
exited after new process started to use the slot.
2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]
One idea to solve this is to first disable the subscription, wait for
the walsender process to exit, and then change the connection string
and re-enable the subscription.
--
With Regards,
Amit Kapila.
On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
walsender process. Now, here the problem seems to be that the previous
walsender process (16336) didn't exit and the new process started to
use the slot which leads to the error shown in the log. This is
evident from the below part of log where we can see that 16336 has
exited after new process started to use the slot.2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]One idea to solve this is to first disable the subscription, wait for
the walsender process to exit, and then change the connection string
and re-enable the subscription.
I tried to simulate this by putting delays prior to the exit of the
walsender. Even though I see the same error
in the logs that the replication slot is active for the previous PID,
the test case does not fail in the way seen in this case here..
The new walsender tries to acquire the slot but seeing that there is
another PID that is active on the slot, it
errors and exits. At no point does this new failed walsender update
its pid for the slot. As a result, the below polled query
would not return a true value
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub';"
) or die "Timed out while waiting for apply to restart";
In my runs I see that this query is repeated until a new walsender is
able to successfully acquire the slot.
I am not able to explain why this query returned true in the topminnow
tap test. This would imply that a walsender
was able to acquire the slot and update its pid but I don't see how.
We also know that if this polled query returns
a true (implying a pid other than $oldpid), then the next query in the
test is to try and identify the pid:
SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';
From the topminnow logs we know that this query returned a "NULL", as
the value extracted from this is used
to formulate the next query which errored out and eventually caused
the test case to fail.
[16482:3] 001_rep_changes.pl ERROR: syntax error at or near "FROM"
at character 16
[16482:4] 001_rep_changes.pl STATEMENT: SELECT pid != FROM
pg_stat_replication WHERE application_name = 'tap_sub';
I am not an expert in perl but I looked at the perl function used in
the tap test poll_query_until(), this would poll until the query
returns a 'true' (unless specified otherwise).
I don't see in my tests that the polled function exits if the query
returns a NULL. I don't know if differences in the installed perl can
cause this difference in
behaviour. Can a NULL set be construed as a true and cause the poll to exit?
Any suggestions?
regards,
Ajin Cherian
Fujitsu Australia
On Tue, Aug 24, 2021 at 11:04 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
walsender process. Now, here the problem seems to be that the previous
walsender process (16336) didn't exit and the new process started to
use the slot which leads to the error shown in the log. This is
evident from the below part of log where we can see that 16336 has
exited after new process started to use the slot.2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]One idea to solve this is to first disable the subscription, wait for
the walsender process to exit, and then change the connection string
and re-enable the subscription.I tried to simulate this by putting delays prior to the exit of the
walsender. Even though I see the same error
in the logs that the replication slot is active for the previous PID,
the test case does not fail in the way seen in this case here..The new walsender tries to acquire the slot but seeing that there is
another PID that is active on the slot, it
errors and exits. At no point does this new failed walsender update
its pid for the slot. As a result, the below polled query
would not return a true value$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub';"
) or die "Timed out while waiting for apply to restart";In my runs I see that this query is repeated until a new walsender is
able to successfully acquire the slot.
What happens when there is neither a new walsender nor an old
walsender is present? It means to run the above statement when a new
walsender is exited due to error "... slot is active ..." and before a
new walsender could start. Also, allow old walsender (due to which the
error occurs) to exit before executing the statement.
Also, it seems this failure happens on REL_11_STABLE branch, not sure
if that matters, but it is better to use the same branch if you are
using a different branch to reproduce the issue.
--
With Regards,
Amit Kapila.
On Tue, Aug 24, 2021 at 9:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
What happens when there is neither a new walsender nor an old
walsender is present? It means to run the above statement when a new
walsender is exited due to error "... slot is active ..." and before a
new walsender could start. Also, allow old walsender (due to which the
error occurs) to exit before executing the statement.
I tried this, then the query returns a null instead of 'false' because
there is no entry for that application_name.
postgres=# select pid != 31876 from pg_stat_replication where
application_name = 'sub';
?column?
----------
(0 rows)
Also, it seems this failure happens on REL_11_STABLE branch, not sure
if that matters, but it is better to use the same branch if you are
using a different branch to reproduce the issue.
Ok, I didn't realise this. I will try this.
regards,
Ajin Cherian
Fujitsu Australia
On Tue, Aug 24, 2021 at 6:28 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 9:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
What happens when there is neither a new walsender nor an old
walsender is present? It means to run the above statement when a new
walsender is exited due to error "... slot is active ..." and before a
new walsender could start. Also, allow old walsender (due to which the
error occurs) to exit before executing the statement.I tried this, then the query returns a null instead of 'false' because
there is no entry for that application_name.postgres=# select pid != 31876 from pg_stat_replication where
application_name = 'sub';
?column?
----------
(0 rows)
But will poll function still poll or exit? Have you tried that?
Because it is not clear from your explanation how in the first
statement it returns a valid value which leads poll to exit and then
in second statement it returns NULL or maybe nothing. Can you share
the log also when you are getting "replication slot "tap_sub" is
active for ..."? If you see in the below log [1]2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication slot "tap_sub" is active for PID 16336 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"'), the STATEMENT is
printed twice, I want to see if you also get prints in a similar way
or is it something different? Do you know why it is printed twice?
[1]: 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication slot "tap_sub" is active for PID 16336 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
--
With Regards,
Amit Kapila.
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.
Because it is not clear from your explanation how in the first
statement it returns a valid value which leads poll to exit and then
in second statement it returns NULL or maybe nothing. Can you share
I don't have an explanation for it either. Maybe things are different
in REL_11_STABLE
the log also when you are getting "replication slot "tap_sub" is
active for ..."? If you see in the below log [1], the STATEMENT is
printed twice, I want to see if you also get prints in a similar way
or is it something different? Do you know why it is printed twice?
Yes, I get the same. For every "LOG or ERROR" message, there is the
associated STATEMENT message with it.
First there is a LOG "received replication command" followed by the
STATEMENT and then the ERROR "replication slot .. is active.."
followed by the STATEMENT.
I will try with REL_11_STABLE and see if the behaviour is different.
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.
What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.
I will try with REL_11_STABLE and see if the behaviour is different.
Okay, but I still suspect some timing issue here.
--
With Regards,
Amit Kapila.
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.
If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itself
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itself
Sorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:
postgres=# select pid from pg_stat_replication where application_name = 'sub';
pid
------
7899
7993
(2 rows)
ajin 7896 3326 0 05:22 pts/2 00:00:00 psql -p 6972 postgres
ajin 7897 7882 0 05:22 ? 00:00:00 postgres: ajin
postgres [local] idle
ajin 7899 7882 0 05:22 ? 00:00:00 postgres: walsender
ajin ::1(37719) START_REPLICATION
ajin 7992 3647 0 05:24 ? 00:00:00 postgres: logical
replication worker for subscription 16426
ajin 7993 7882 0 05:24 ? 00:00:00 postgres: walsender
ajin ::1(37720) START_REPLICATION
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itselfSorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:postgres=# select pid from pg_stat_replication where application_name = 'sub';
pid
------
7899
7993
(2 rows)
IIUC the query[1]"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';" used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:
1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.
If the fact that two walsender with the same application_name could
present in pg_stat_replication view was the cause of this issue,
poll_query_until() should return OK even if we expected just 't'. I
might be missing something, though.
[1]: "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
application_name = '$appname';"
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itselfSorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:postgres=# select pid from pg_stat_replication where application_name = 'sub';
pid
------
7899
7993
(2 rows)IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:
Yeah, we can check this but I guess as soon as it gets 't', the poll
query will exit.
1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
I think the reason for exit is that we get two rows with the same
application_name in pg_stat_replication.
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.
Your understanding of steps is the same as mine.
--
With Regards,
Amit Kapila.
On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.If the fact that two walsender with the same application_name could
present in pg_stat_replication view was the cause of this issue,
poll_query_until() should return OK even if we expected just 't'. I
might be missing something, though.[1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"
Yes, the query [1]"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';" returns OK with a {f,t} or {t,f}
[1]: "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';"
application_name = '$appname';"
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 9:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
But will poll function still poll or exit? Have you tried that?
I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.What if the query in a poll is fired just before we get an error
"tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itselfSorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:postgres=# select pid from pg_stat_replication where application_name = 'sub';
pid
------
7899
7993
(2 rows)IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:Yeah, we can check this but I guess as soon as it gets 't', the poll
query will exit.
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));
The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Wed, Aug 25, 2021 at 5:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.If the fact that two walsender with the same application_name could
present in pg_stat_replication view was the cause of this issue,
poll_query_until() should return OK even if we expected just 't'. I
might be missing something, though.[1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"Yes, the query [1] returns OK with a {f,t} or {t,f}
[1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"
Can you additionally check the value of 'state' from
pg_stat_replication for both the old and new walsender sessions?
--
With Regards,
Amit Kapila.
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.
This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?
--
With Regards,
Amit Kapila.
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.
Yes, this is true, I also see the same behaviour.
This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?
Relooking at my logs, I think what happens is this:
1. First walsender 'a' is running.
2. Second walsender 'b' starts and attempts at acquiring the slot
finds that the slot is active for pid a.
3. Now both walsenders are active, the query does not return.
4. First walsender 'a' times out and exits.
5. Now only the second walsender is active and the query returns OK
because pid != a.
6. Second walsender exits with error.
7. Another query attempts to get the pid of the running walsender for
tap_sub but returns null because both walsender exits.
8. This null return value results in the next query erroring out and
the test failing.
Can you additionally check the value of 'state' from
pg_stat_replication for both the old and new walsender sessions?
Yes, will try this and post a patch tomorrow.
regards,
Ajin Cherian
Fujitsu Australia
On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.Yes, this is true, I also see the same behaviour.
This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?Relooking at my logs, I think what happens is this:
1. First walsender 'a' is running.
2. Second walsender 'b' starts and attempts at acquiring the slot
finds that the slot is active for pid a.
3. Now both walsenders are active, the query does not return.
4. First walsender 'a' times out and exits.
5. Now only the second walsender is active and the query returns OK
because pid != a.
6. Second walsender exits with error.
7. Another query attempts to get the pid of the running walsender for
tap_sub but returns null because both walsender exits.
8. This null return value results in the next query erroring out and
the test failing.
So this is slightly different than what we can see in the topminnow
logs? According to the server logs, step #5 happened (at 18:44:38.016)
before step #4 happened (at 18:44:38.043).
Can you additionally check the value of 'state' from
pg_stat_replication for both the old and new walsender sessions?Yes, will try this and post a patch tomorrow.
Thanks. I guess the state of the new walsender should be "startup"
whereas the old one should be "streaming".
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I did a quick check with the following tap test code:
$node_publisher->poll_query_until('postgres',
qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.Yes, this is true, I also see the same behaviour.
This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?Relooking at my logs, I think what happens is this:
1. First walsender 'a' is running.
2. Second walsender 'b' starts and attempts at acquiring the slot
finds that the slot is active for pid a.
3. Now both walsenders are active, the query does not return.
4. First walsender 'a' times out and exits.
5. Now only the second walsender is active and the query returns OK
because pid != a.
6. Second walsender exits with error.
7. Another query attempts to get the pid of the running walsender for
tap_sub but returns null because both walsender exits.
8. This null return value results in the next query erroring out and
the test failing.So this is slightly different than what we can see in the topminnow
logs? According to the server logs, step #5 happened (at 18:44:38.016)
before step #4 happened (at 18:44:38.043).
Luckily these logs have the disconnection times of the tap test client
sessions as well. (not sure why I don't see these when I run these
tests).
Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG: statement: SELECT
pid != 16336 FROM pg_stat_replication WHERE application_name =
'tap_sub';
:
:
18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG: disconnection:
session time: 0:00:00.063 user=nm database=postgres host=[local]
When the query starts both walsenders are present but when the query
completes both walsenders are gone, the actual query evaluation could
have happened any time in between. This is the rare timing window that
causes this problem.
regards,
Ajin Cherian
Fujitsu Australia
On Thu, Aug 26, 2021 at 7:38 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Luckily these logs have the disconnection times of the tap test client
sessions as well. (not sure why I don't see these when I run these
tests).Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG: statement: SELECT
pid != 16336 FROM pg_stat_replication WHERE application_name =
'tap_sub';
:
:
18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG: disconnection:
session time: 0:00:00.063 user=nm database=postgres host=[local]When the query starts both walsenders are present but when the query
completes both walsenders are gone, the actual query evaluation could
have happened any time in between. This is the rare timing window that
causes this problem.
You have a point but if we see the below logs, it seems the second
walsender (#step6) seemed to exited before the first walsender
(#step4).
2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]
Isn't it possible that pid is cleared in the other order due to which
we are seeing this problem?
--
With Regards,
Amit Kapila.