Race conditions in 019_replslot_limit.pl
While looking at recent failures in the new 028_pitr_timelines.pl
recovery test, I noticed that there have been a few failures in the
buildfarm in the recoveryCheck phase even before that, in the
019_replslot_limit.pl test.
For example:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04
[07:42:23] t/018_wal_optimize.pl ................ ok 12403 ms ( 0.00
usr 0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU)
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved
and:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'
The latter looks like there are two walsenders active, which confuses
the test. Not sure what's happening in the first case, but looks like
some kind of a race condition at a quick glance.
Has anyone looked into these yet?
- Heikki
Hi,
On 2022-02-15 23:29:20 +0200, Heikki Linnakangas wrote:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04
[07:42:23] t/018_wal_optimize.pl ................ ok 12403 ms ( 0.00 usr
0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU)
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved
The relevant bit from the log is:
2022-02-14 07:42:27.817 CET [6209f9d3.68bab:3] LOG: database system is ready to accept read-only connections
2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:1] LOG: started streaming WAL from primary at 0/1B00000 on timeline 1
2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000001B has already been removed
2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:1] LOG: started streaming WAL from primary at 0/1B00000 on timeline 1
2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000001B has already been removed
I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);
Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?
and:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'The latter looks like there are two walsenders active, which confuses the
test.
Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...
%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
which pg_basebackup builds using the backend pid:
replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));
The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?
I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?
Has anyone looked into these yet?
Hadn't yet...
Greetings,
Andres Freund
At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in
I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?
Agreed. Doing this att all slot creation seems fine.
and:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'The latter looks like there are two walsenders active, which confuses the
test.Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
which pg_basebackup builds using the backend pid:
replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?
It seems to me so, too.
I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?
Yes, but since postmaster seems thinking that process is gone.
Has anyone looked into these yet?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in
I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?Agreed. Doing this att all slot creation seems fine.
and:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'The latter looks like there are two walsenders active, which confuses the
test.Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
which pg_basebackup builds using the backend pid:
replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?It seems to me so, too.
I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?Yes, but since postmaster seems thinking that process is gone.
Or it's possible that the process took a time to clean up the
temporary replication slot?
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
At Wed, 16 Feb 2022 14:26:37 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Agreed. Doing this att all slot creation seems fine.
Done in the attached. The first slot is deliberately created
unreserved so I changed the code to re-create the slot with "reserved"
before taking backup.
Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?It seems to me so, too.
I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?Yes, but since postmaster seems thinking that process is gone.
s/ since//;
Whatever is happening at that time, I can make sure that walsender is
gone before making a new replication connection, even though it
doesn't "fix" any of the observed issues.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
make_019_replslot_limit_more_robust.txttext/plain; charset=us-asciiDownload+15-2
At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in
I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?Yes, but since postmaster seems thinking that process is gone.
Or it's possible that the process took a time to clean up the
temporary replication slot?
Ugg, it's immediate shutdown. So postmaster kills the walsender no
matter what state the waldender is under, nd leaves no log about the
end ofwalsender.
Sorry for the confusion.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
Or it's possible that the process took a time to clean up the
temporary replication slot?
Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Wed, 16 Feb 2022 15:22:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
Or it's possible that the process took a time to clean up the
temporary replication slot?Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?
Or missing CV broadcast? Anyway I haven't find a scenario of
interfering between checkpointer and walsender.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
Or it's possible that the process took a time to clean up the
temporary replication slot?Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?
That's possible. Whatever the exact cause of this failure, I think we
can stabilize this test by adding a condition of application_name to
the query.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Hi,
On 2022-02-16 18:04:14 +0900, Masahiko Sawada wrote:
On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
Or it's possible that the process took a time to clean up the
temporary replication slot?Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?
A deadlock requires some form of incorrected lock (or lock like) nesting. Do
you have an idea what that could be?
That's possible. Whatever the exact cause of this failure, I think we
can stabilize this test by adding a condition of application_name to
the query.
I think the test is telling us that something may be broken. We shouldn't
silence that without at least some understanding what it is.
It'd be good try to reproduce this locally...
- Andres
Andres Freund <andres@anarazel.de> writes:
I think the test is telling us that something may be broken. We shouldn't
silence that without at least some understanding what it is.
I looked at the recent failure on komodoensis [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2022-02-16%2021%3A16%3A04, and I think what is
happening is just that the walsender for the basebackup run (launched
at 019_replslot_limit.pl line 325) hasn't exited yet at the point where
we do a blind
"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"
and expect that we're only going to see the walsender launched for
the standby at line 331. The two PIDs reported in the failure
correspond to this postmaster log trace:
2022-02-16 23:06:29.596 CET [620d7565.38dd62:1] LOG: connection received: host=[local]
2022-02-16 23:06:29.596 CET [620d7565.38dd62:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-16 23:06:29.596 CET [620d7565.38dd62:3] LOG: received replication command: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:4] STATEMENT: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-16 23:06:29.596 CET [620d7565.38dd62:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-16 23:06:29.597 CET [620d7565.38dd62:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000 TIMELINE 1
2022-02-16 23:06:29.597 CET [620d7565.38dd62:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000 TIMELINE 1
and this one:
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:1] LOG: connection received: host=[local]
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:2] LOG: replication connection authorized: user=bf application_name=standby_3
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:3] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:4] STATEMENT: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:5] LOG: received replication command: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:6] STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...
shouldn't we fix that? But in any case, I don't see anything
interesting here, just a query that needs to be more selective.
Perhaps we can look for application_name=standby_3?
regards, tom lane
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2022-02-16%2021%3A16%3A04
Hi,
On 2022-02-16 20:22:23 -0500, Tom Lane wrote:
There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...
The thing is, we actually *do* log disconnection for walsenders:
2022-02-16 17:53:55.780 PST [2459806][walsender][:0] LOG: disconnection: session time: 0:00:01.321 user=andres database= host=[local]
The only way I see for the disconnection not to be logged is an immediate
shutdown.
It'd be one thing if the missing walsender disconnect wasn't logged because we
shut down just after. But unless I misunderstand something, between the
basebackup and the failing test, we actually start standby_3:
# Running: pg_basebackup -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_primary3_data/backup/my_backup -h /tmp/hyrtnyd2RU -p 53774 --checkpoint fast --no-sync
# Backup finished
...
# Running: pg_ctl -w -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_standby_3_data/pgdata -l /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/log/019_replslot_limit_standby_3.log -o --cluster-name=standby_3 start
and also wait till replication has caught up:
Waiting for replication conn standby_3's replay_lsn to pass 0/700000 on primary3
done
and only then we have
not ok 17 - have walsender pid 3682154
# 3682136
# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'
Starting a node in recovery and having it connect to the primary seems like a
mighty long time for a process to exit, unless it's stuck behind something.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2022-02-16 20:22:23 -0500, Tom Lane wrote:
There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...
The thing is, we actually *do* log disconnection for walsenders:
Ah, my mistake, now I do see a disconnection entry for the other walsender
launched by the basebackup.
Starting a node in recovery and having it connect to the primary seems like a
mighty long time for a process to exit, unless it's stuck behind something.
Fair point. Also, 019_replslot_limit.pl hasn't been changed in any
material way in months, but *something's* changed recently, because
this just started. I scraped the buildfarm for instances of
"Failed test 'have walsender pid" going back 6 months, and what I find is
sysname | branch | snapshot | stage | l
--------------+--------+---------------------+---------------+---------------------------------------------
desmoxytes | HEAD | 2022-02-15 04:42:05 | recoveryCheck | # Failed test 'have walsender pid 1685516
idiacanthus | HEAD | 2022-02-15 07:24:05 | recoveryCheck | # Failed test 'have walsender pid 2758549
serinus | HEAD | 2022-02-15 11:00:08 | recoveryCheck | # Failed test 'have walsender pid 3682154
desmoxytes | HEAD | 2022-02-15 11:04:05 | recoveryCheck | # Failed test 'have walsender pid 3775359
flaviventris | HEAD | 2022-02-15 18:03:48 | recoveryCheck | # Failed test 'have walsender pid 1517077
idiacanthus | HEAD | 2022-02-15 22:48:05 | recoveryCheck | # Failed test 'have walsender pid 2494972
desmoxytes | HEAD | 2022-02-15 23:48:04 | recoveryCheck | # Failed test 'have walsender pid 3055399
desmoxytes | HEAD | 2022-02-16 10:48:05 | recoveryCheck | # Failed test 'have walsender pid 1593461
komodoensis | HEAD | 2022-02-16 21:16:04 | recoveryCheck | # Failed test 'have walsender pid 3726703
serinus | HEAD | 2022-02-17 01:18:17 | recoveryCheck | # Failed test 'have walsender pid 208363
So (a) it broke around 48 hours ago, which is already a useful
bit of info, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?
regards, tom lane
I wrote:
So (a) it broke around 48 hours ago, which is already a useful
bit of info, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?
Eyeballing the commit log for potential causes in that time frame,
I can't help noticing
2 days ago Andres Freund Move replication slot release to before_shmem_exit().
regards, tom lane
Hi,
On 2022-02-16 22:11:30 -0500, Tom Lane wrote:
So (a) it broke around 48 hours ago, which is already a useful
bit of info
Indeed. 2f6501fa3c54bbe4568e3bcccd9a60d26a46b5ee seems like the obvious commit
to blame.
We document before_shmem_exit hooks as
/*
* Call before_shmem_exit callbacks.
*
* These should be things that need most of the system to still be up and
* working, such as cleanup of temp relations, which requires catalog
* access; or things that need to be completed because later cleanup steps
* depend on them, such as releasing lwlocks.
*/
and several before_shmem_exit callbacks use lwlocks.
But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?
, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?
Flaviventris, serinus use the newest snapshot of gcc available in debian, one
with -O0, the other with O3.
desmoxytes, idiacanthus, komodoensis all have JIT forced for every query.
They all run on the same host - looking at stats it doesn't look crazily
overcommitted or such. But it might have more scheduler variance than most?
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?
If we're inside a transaction, then ShutdownPostgres ->
AbortOutOfAnyTransaction would release LWLocks. I wonder
though if an exiting walsender would always take that path.
Maybe it'd be good if we released all LWLocks even if we
don't think we're in a transaction?
More generally, it wasn't obvious to me why you thought that
BaseInit was a good spot to place the on_shmem_exit call.
It obviously can't be *before* that because of the pgstats
dependency, but maybe it should be later?
regards, tom lane
Hi,
On 2022-02-16 19:46:26 -0800, Andres Freund wrote:
But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?
Nothing. But we register ShutdownPostgres() / ShutdownAuxiliaryProcess() to
take care of that.
I checked and walsenders do have ShutdownPostgres() and I don't see anything
else that's missing it either.
I'll run the test in a loop, perhaps I can reproduce...
Greetings,
Andres Freund
Hi,
On 2022-02-16 22:58:19 -0500, Tom Lane wrote:
I wonder though if an exiting walsender would always take that path.
You're right - the CFI() in PostgresMain(), several replication commands, are
not at all guaranteed to be in a transaction when throwing a FATAL.
I don't quite see where we could be throwing an error with a relevant lwlock
held yet though.
Maybe it'd be good if we released all LWLocks even if we don't think we're
in a transaction?
Yea, probably a good idea. I think there's a few hooks could that could
conceivably be in trouble otherwise.
More generally, it wasn't obvious to me why you thought that
BaseInit was a good spot to place the on_shmem_exit call.
It obviously can't be *before* that because of the pgstats
dependency, but maybe it should be later?
Yea, I agree, it probably could be done later. It's not super obvious when
though. If e.g. somebody wrote a non-database connected bgworker to
synchronize slot state with another node (for failover, there's even patches
for it), it'd be just after BaseInit(). Additionally, it was previously in
ProcKill(), so moving it something registered early-ish seemed sensible.
Greetings,
Andres Freund
Hi,
On 2022-02-16 20:03:00 -0800, Andres Freund wrote:
I'll run the test in a loop, perhaps I can reproduce...
I've done this for a couple hundred iterations, under load, subsequently with
additional assertions, without being able to reproduce.
There've not been any new reports in the last 18 hours, but that's probably
just due to lower commit activity triggering fewer runs.
Greetings,
Andres Freund
Hi,
On 2022-02-17 16:34:34 -0800, Andres Freund wrote:
I've done this for a couple hundred iterations, under load, subsequently with
additional assertions, without being able to reproduce.
Playing around with this further I did get into one interesting state:
I started psql with replication=1, created a slot
SELECT pg_drop_replication_slot('t');SELECT pg_create_physical_replication_slot('t', true);
which was at 0/200002C8 at this point
START_REPLICATION SLOT "t" 0/200002C8 TIMELINE 1
then did
SELECT pg_switch_wal();checkpoint;
in another session, which tried to invalidate that slot because of
max_slot_wal_keep_size=1MB
At this point the psql "walsender" is *not* terminated, because the fatal
message is waiting behind all the WAL data sent. Which also means that the
slot isn't yet released.
At this point checkpointer is stuck, because it is waiting for the connection
to end.
Quitting the "psql walsender" or terminating it again resolves the issue.
Not sure how this could be related, but it seems interesting.
"psql walsender":
(gdb) bt
#0 0x00007faf4a570ec6 in epoll_wait (epfd=4, events=0x7faf4c201458, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007faf4b8ced5c in WaitEventSetWaitBlock (set=0x7faf4c2013e0, cur_timeout=-1, occurred_events=0x7ffe47df2320, nevents=1)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
#2 0x00007faf4b8cebe3 in WaitEventSetWait (set=0x7faf4c2013e0, timeout=-1, occurred_events=0x7ffe47df2320, nevents=1, wait_event_info=100663297)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
#3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, ptr=0x7faf4c2f1210, len=21470) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:298
#4 0x00007faf4b71aecb in internal_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1380
#5 0x00007faf4b71ada1 in internal_putbytes (s=0x7ffe47df23dc "E\177", len=1) at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1326
#6 0x00007faf4b71b0cf in socket_putmessage (msgtype=69 'E', s=0x7faf4c201700 "SFATAL", len=112)
at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1507
#7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301
#8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253
#9 0x00007faf4bab8aa0 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541
#10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts")
at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
#11 0x00007faf4b907e73 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192
#12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 <XLogSendPhysical>) at /home/andres/src/postgresql/src/backend/replication/walsender.c:2404
#13 0x00007faf4b88f82e in StartReplication (cmd=0x7faf4c293fc0) at /home/andres/src/postgresql/src/backend/replication/walsender.c:834
#14 0x00007faf4b89136f in exec_replication_command (cmd_string=0x7faf4c2073c0 "START_REPLICATION SLOT \"t\" 0/1B000000 TIMELINE 1\n;")
at /home/andres/src/postgresql/src/backend/replication/walsender.c:1771
#15 0x00007faf4b909842 in PostgresMain (dbname=0x7faf4c22fce8 "", username=0x7faf4c22fcc8 "andres")
at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4494
#16 0x00007faf4b830e7e in BackendRun (port=0x7faf4c22da50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4593
#17 0x00007faf4b83075f in BackendStartup (port=0x7faf4c22da50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4321
#18 0x00007faf4b82c55b in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1801
#19 0x00007faf4b82bd71 in PostmasterMain (argc=49, argv=0x7faf4c1ff1e0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1473
checkpointer:
(gdb) bt
#0 0x00007f0d4ee1fec6 in epoll_wait (epfd=10, events=0x7f0d51489b78, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007f0d5017dd5c in WaitEventSetWaitBlock (set=0x7f0d51489b18, cur_timeout=-1, occurred_events=0x7fffe7c5f410, nevents=1)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
#2 0x00007f0d5017dbe3 in WaitEventSetWait (set=0x7f0d51489b18, timeout=-1, occurred_events=0x7fffe7c5f410, nevents=1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
#3 0x00007f0d5017cec4 in WaitLatch (latch=0x7f0d4c27db04, wakeEvents=33, timeout=-1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:472
#4 0x00007f0d5018fe34 in ConditionVariableTimedSleep (cv=0x7f0d4c342268, timeout=-1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:163
#5 0x00007f0d5018fd6c in ConditionVariableSleep (cv=0x7f0d4c342268, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:100
#6 0x00007f0d50132525 in InvalidatePossiblyObsoleteSlot (s=0x7f0d4c342188, oldestLSN=553648128, invalidated=0x7fffe7c5f577)
at /home/andres/src/postgresql/src/backend/replication/slot.c:1290
#7 0x00007f0d501326af in InvalidateObsoleteReplicationSlots (oldestSegno=33) at /home/andres/src/postgresql/src/backend/replication/slot.c:1359
#8 0x00007f0d4fd84072 in CreateCheckPoint (flags=108) at /home/andres/src/postgresql/src/backend/access/transam/xlog.c:6620
#9 0x00007f0d500cd278 in CheckpointerMain () at /home/andres/src/postgresql/src/backend/postmaster/checkpointer.c:445
#10 0x00007f0d500cad7a in AuxiliaryProcessMain (auxtype=CheckpointerProcess) at /home/andres/src/postgresql/src/backend/postmaster/auxprocess.c:153
#11 0x00007f0d500e034c in StartChildProcess (type=CheckpointerProcess) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5531
#12 0x00007f0d500dacf1 in PostmasterMain (argc=49, argv=0x7f0d514891e0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1458
#13 0x00007f0d4ffcfc19 in main (argc=49, argv=0x7f0d514891e0) at /home/andres/src/postgresql/src/backend/main/main.c:202
Greetings,
Andres Freund