Race conditions in 019_replslot_limit.pl

Started by Heikki Linnakangasalmost 4 years ago42 messages
#1Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi

While looking at recent failures in the new 028_pitr_timelines.pl
recovery test, I noticed that there have been a few failures in the
buildfarm in the recoveryCheck phase even before that, in the
019_replslot_limit.pl test.

For example:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04

[07:42:23] t/018_wal_optimize.pl ................ ok 12403 ms ( 0.00
usr 0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU)
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved

and:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

The latter looks like there are two walsenders active, which confuses
the test. Not sure what's happening in the first case, but looks like
some kind of a race condition at a quick glance.

Has anyone looked into these yet?

- Heikki

#2Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#1)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-15 23:29:20 +0200, Heikki Linnakangas wrote:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04

[07:42:23] t/018_wal_optimize.pl ................ ok 12403 ms ( 0.00 usr
0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU)
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved

The relevant bit from the log is:
2022-02-14 07:42:27.817 CET [6209f9d3.68bab:3] LOG: database system is ready to accept read-only connections
2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:1] LOG: started streaming WAL from primary at 0/1B00000 on timeline 1
2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000001B has already been removed
2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:1] LOG: started streaming WAL from primary at 0/1B00000 on timeline 1
2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000001B has already been removed

I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);

Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?

and:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

The latter looks like there are two walsenders active, which confuses the
test.

Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...

%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
which pg_basebackup builds using the backend pid:
replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));

The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1

Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Has anyone looked into these yet?

Hadn't yet...

Greetings,

Andres Freund

#3Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#2)
Re: Race conditions in 019_replslot_limit.pl

At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in

I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);

Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?

Agreed. Doing this att all slot creation seems fine.

and:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&amp;dt=2022-02-15%2011%3A00%3A08

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

The latter looks like there are two walsenders active, which confuses the
test.

Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...

%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
which pg_basebackup builds using the backend pid:
replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));

The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1

Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?

It seems to me so, too.

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.

Has anyone looked into these yet?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#4Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kyotaro Horiguchi (#3)
Re: Race conditions in 019_replslot_limit.pl

On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in

I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);

Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?

Agreed. Doing this att all slot creation seems fine.

and:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&amp;dt=2022-02-15%2011%3A00%3A08

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

The latter looks like there are two walsenders active, which confuses the
test.

Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...

%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
which pg_basebackup builds using the backend pid:
replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));

The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1

Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?

It seems to me so, too.

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.

Or it's possible that the process took a time to clean up the
temporary replication slot?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#5Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#3)
1 attachment(s)
Re: Race conditions in 019_replslot_limit.pl

At Wed, 16 Feb 2022 14:26:37 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Agreed. Doing this att all slot creation seems fine.

Done in the attached. The first slot is deliberately created
unreserved so I changed the code to re-create the slot with "reserved"
before taking backup.

Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?

It seems to me so, too.

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.

s/ since//;

Whatever is happening at that time, I can make sure that walsender is
gone before making a new replication connection, even though it
doesn't "fix" any of the observed issues.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

make_019_replslot_limit_more_robust.txttext/plain; charset=us-ascii
#6Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#4)
Re: Race conditions in 019_replslot_limit.pl

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.

Or it's possible that the process took a time to clean up the
temporary replication slot?

Ugg, it's immediate shutdown. So postmaster kills the walsender no
matter what state the waldender is under, nd leaves no log about the
end ofwalsender.

Sorry for the confusion.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#4)
Re: Race conditions in 019_replslot_limit.pl

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Or it's possible that the process took a time to clean up the
temporary replication slot?

Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#7)
Re: Race conditions in 019_replslot_limit.pl

At Wed, 16 Feb 2022 15:22:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Or it's possible that the process took a time to clean up the
temporary replication slot?

Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?

Or missing CV broadcast? Anyway I haven't find a scenario of
interfering between checkpointer and walsender.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#9Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kyotaro Horiguchi (#7)
Re: Race conditions in 019_replslot_limit.pl

On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Or it's possible that the process took a time to clean up the
temporary replication slot?

Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?

That's possible. Whatever the exact cause of this failure, I think we
can stabilize this test by adding a condition of application_name to
the query.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#10Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Masahiko Sawada (#9)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 18:04:14 +0900, Masahiko Sawada wrote:

On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Or it's possible that the process took a time to clean up the
temporary replication slot?

Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?

A deadlock requires some form of incorrected lock (or lock like) nesting. Do
you have an idea what that could be?

That's possible. Whatever the exact cause of this failure, I think we
can stabilize this test by adding a condition of application_name to
the query.

I think the test is telling us that something may be broken. We shouldn't
silence that without at least some understanding what it is.

It'd be good try to reproduce this locally...

- Andres

#11Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#10)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

I think the test is telling us that something may be broken. We shouldn't
silence that without at least some understanding what it is.

I looked at the recent failure on komodoensis [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&amp;dt=2022-02-16%2021%3A16%3A04, and I think what is
happening is just that the walsender for the basebackup run (launched
at 019_replslot_limit.pl line 325) hasn't exited yet at the point where
we do a blind
"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"
and expect that we're only going to see the walsender launched for
the standby at line 331. The two PIDs reported in the failure
correspond to this postmaster log trace:

2022-02-16 23:06:29.596 CET [620d7565.38dd62:1] LOG: connection received: host=[local]
2022-02-16 23:06:29.596 CET [620d7565.38dd62:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-16 23:06:29.596 CET [620d7565.38dd62:3] LOG: received replication command: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:4] STATEMENT: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-16 23:06:29.596 CET [620d7565.38dd62:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-16 23:06:29.597 CET [620d7565.38dd62:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000 TIMELINE 1
2022-02-16 23:06:29.597 CET [620d7565.38dd62:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000 TIMELINE 1

and this one:

2022-02-16 23:06:29.687 CET [620d7565.38dd6f:1] LOG: connection received: host=[local]
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:2] LOG: replication connection authorized: user=bf application_name=standby_3
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:3] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:4] STATEMENT: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:5] LOG: received replication command: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:6] STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...
shouldn't we fix that? But in any case, I don't see anything
interesting here, just a query that needs to be more selective.
Perhaps we can look for application_name=standby_3?

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&amp;dt=2022-02-16%2021%3A16%3A04

#12Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#11)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 20:22:23 -0500, Tom Lane wrote:

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...

The thing is, we actually *do* log disconnection for walsenders:

2022-02-16 17:53:55.780 PST [2459806][walsender][:0] LOG: disconnection: session time: 0:00:01.321 user=andres database= host=[local]

The only way I see for the disconnection not to be logged is an immediate
shutdown.

It'd be one thing if the missing walsender disconnect wasn't logged because we
shut down just after. But unless I misunderstand something, between the
basebackup and the failing test, we actually start standby_3:

# Running: pg_basebackup -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_primary3_data/backup/my_backup -h /tmp/hyrtnyd2RU -p 53774 --checkpoint fast --no-sync
# Backup finished

...

# Running: pg_ctl -w -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_standby_3_data/pgdata -l /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/log/019_replslot_limit_standby_3.log -o --cluster-name=standby_3 start

and also wait till replication has caught up:

Waiting for replication conn standby_3's replay_lsn to pass 0/700000 on primary3
done

and only then we have

not ok 17 - have walsender pid 3682154
# 3682136

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

Starting a node in recovery and having it connect to the primary seems like a
mighty long time for a process to exit, unless it's stuck behind something.

Greetings,

Andres Freund

#13Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#12)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

On 2022-02-16 20:22:23 -0500, Tom Lane wrote:

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...

The thing is, we actually *do* log disconnection for walsenders:

Ah, my mistake, now I do see a disconnection entry for the other walsender
launched by the basebackup.

Starting a node in recovery and having it connect to the primary seems like a
mighty long time for a process to exit, unless it's stuck behind something.

Fair point. Also, 019_replslot_limit.pl hasn't been changed in any
material way in months, but *something's* changed recently, because
this just started. I scraped the buildfarm for instances of
"Failed test 'have walsender pid" going back 6 months, and what I find is

sysname | branch | snapshot | stage | l
--------------+--------+---------------------+---------------+---------------------------------------------
desmoxytes | HEAD | 2022-02-15 04:42:05 | recoveryCheck | # Failed test 'have walsender pid 1685516
idiacanthus | HEAD | 2022-02-15 07:24:05 | recoveryCheck | # Failed test 'have walsender pid 2758549
serinus | HEAD | 2022-02-15 11:00:08 | recoveryCheck | # Failed test 'have walsender pid 3682154
desmoxytes | HEAD | 2022-02-15 11:04:05 | recoveryCheck | # Failed test 'have walsender pid 3775359
flaviventris | HEAD | 2022-02-15 18:03:48 | recoveryCheck | # Failed test 'have walsender pid 1517077
idiacanthus | HEAD | 2022-02-15 22:48:05 | recoveryCheck | # Failed test 'have walsender pid 2494972
desmoxytes | HEAD | 2022-02-15 23:48:04 | recoveryCheck | # Failed test 'have walsender pid 3055399
desmoxytes | HEAD | 2022-02-16 10:48:05 | recoveryCheck | # Failed test 'have walsender pid 1593461
komodoensis | HEAD | 2022-02-16 21:16:04 | recoveryCheck | # Failed test 'have walsender pid 3726703
serinus | HEAD | 2022-02-17 01:18:17 | recoveryCheck | # Failed test 'have walsender pid 208363

So (a) it broke around 48 hours ago, which is already a useful
bit of info, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?

regards, tom lane

#14Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#13)
Re: Race conditions in 019_replslot_limit.pl

I wrote:

So (a) it broke around 48 hours ago, which is already a useful
bit of info, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?

Eyeballing the commit log for potential causes in that time frame,
I can't help noticing

2 days ago Andres Freund Move replication slot release to before_shmem_exit().

regards, tom lane

#15Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 22:11:30 -0500, Tom Lane wrote:

So (a) it broke around 48 hours ago, which is already a useful
bit of info

Indeed. 2f6501fa3c54bbe4568e3bcccd9a60d26a46b5ee seems like the obvious commit
to blame.

We document before_shmem_exit hooks as
/*
* Call before_shmem_exit callbacks.
*
* These should be things that need most of the system to still be up and
* working, such as cleanup of temp relations, which requires catalog
* access; or things that need to be completed because later cleanup steps
* depend on them, such as releasing lwlocks.
*/
and several before_shmem_exit callbacks use lwlocks.

But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?

, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?

Flaviventris, serinus use the newest snapshot of gcc available in debian, one
with -O0, the other with O3.

desmoxytes, idiacanthus, komodoensis all have JIT forced for every query.

They all run on the same host - looking at stats it doesn't look crazily
overcommitted or such. But it might have more scheduler variance than most?

Greetings,

Andres Freund

#16Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#15)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?

If we're inside a transaction, then ShutdownPostgres ->
AbortOutOfAnyTransaction would release LWLocks. I wonder
though if an exiting walsender would always take that path.
Maybe it'd be good if we released all LWLocks even if we
don't think we're in a transaction?

More generally, it wasn't obvious to me why you thought that
BaseInit was a good spot to place the on_shmem_exit call.
It obviously can't be *before* that because of the pgstats
dependency, but maybe it should be later?

regards, tom lane

#17Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#15)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 19:46:26 -0800, Andres Freund wrote:

But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?

Nothing. But we register ShutdownPostgres() / ShutdownAuxiliaryProcess() to
take care of that.

I checked and walsenders do have ShutdownPostgres() and I don't see anything
else that's missing it either.

I'll run the test in a loop, perhaps I can reproduce...

Greetings,

Andres Freund

#18Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#16)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 22:58:19 -0500, Tom Lane wrote:

I wonder though if an exiting walsender would always take that path.

You're right - the CFI() in PostgresMain(), several replication commands, are
not at all guaranteed to be in a transaction when throwing a FATAL.

I don't quite see where we could be throwing an error with a relevant lwlock
held yet though.

Maybe it'd be good if we released all LWLocks even if we don't think we're
in a transaction?

Yea, probably a good idea. I think there's a few hooks could that could
conceivably be in trouble otherwise.

More generally, it wasn't obvious to me why you thought that
BaseInit was a good spot to place the on_shmem_exit call.

It obviously can't be *before* that because of the pgstats
dependency, but maybe it should be later?

Yea, I agree, it probably could be done later. It's not super obvious when
though. If e.g. somebody wrote a non-database connected bgworker to
synchronize slot state with another node (for failover, there's even patches
for it), it'd be just after BaseInit(). Additionally, it was previously in
ProcKill(), so moving it something registered early-ish seemed sensible.

Greetings,

Andres Freund

#19Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 20:03:00 -0800, Andres Freund wrote:

I'll run the test in a loop, perhaps I can reproduce...

I've done this for a couple hundred iterations, under load, subsequently with
additional assertions, without being able to reproduce.

There've not been any new reports in the last 18 hours, but that's probably
just due to lower commit activity triggering fewer runs.

Greetings,

Andres Freund

#20Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#19)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-17 16:34:34 -0800, Andres Freund wrote:

I've done this for a couple hundred iterations, under load, subsequently with
additional assertions, without being able to reproduce.

Playing around with this further I did get into one interesting state:

I started psql with replication=1, created a slot
SELECT pg_drop_replication_slot('t');SELECT pg_create_physical_replication_slot('t', true);
which was at 0/200002C8 at this point
START_REPLICATION SLOT "t" 0/200002C8 TIMELINE 1
then did
SELECT pg_switch_wal();checkpoint;
in another session, which tried to invalidate that slot because of
max_slot_wal_keep_size=1MB

At this point the psql "walsender" is *not* terminated, because the fatal
message is waiting behind all the WAL data sent. Which also means that the
slot isn't yet released.

At this point checkpointer is stuck, because it is waiting for the connection
to end.

Quitting the "psql walsender" or terminating it again resolves the issue.

Not sure how this could be related, but it seems interesting.

"psql walsender":

(gdb) bt
#0 0x00007faf4a570ec6 in epoll_wait (epfd=4, events=0x7faf4c201458, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007faf4b8ced5c in WaitEventSetWaitBlock (set=0x7faf4c2013e0, cur_timeout=-1, occurred_events=0x7ffe47df2320, nevents=1)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
#2 0x00007faf4b8cebe3 in WaitEventSetWait (set=0x7faf4c2013e0, timeout=-1, occurred_events=0x7ffe47df2320, nevents=1, wait_event_info=100663297)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
#3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, ptr=0x7faf4c2f1210, len=21470) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:298
#4 0x00007faf4b71aecb in internal_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1380
#5 0x00007faf4b71ada1 in internal_putbytes (s=0x7ffe47df23dc "E\177", len=1) at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1326
#6 0x00007faf4b71b0cf in socket_putmessage (msgtype=69 'E', s=0x7faf4c201700 "SFATAL", len=112)
at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1507
#7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301
#8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253
#9 0x00007faf4bab8aa0 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541
#10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts")
at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
#11 0x00007faf4b907e73 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192
#12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 <XLogSendPhysical>) at /home/andres/src/postgresql/src/backend/replication/walsender.c:2404
#13 0x00007faf4b88f82e in StartReplication (cmd=0x7faf4c293fc0) at /home/andres/src/postgresql/src/backend/replication/walsender.c:834
#14 0x00007faf4b89136f in exec_replication_command (cmd_string=0x7faf4c2073c0 "START_REPLICATION SLOT \"t\" 0/1B000000 TIMELINE 1\n;")
at /home/andres/src/postgresql/src/backend/replication/walsender.c:1771
#15 0x00007faf4b909842 in PostgresMain (dbname=0x7faf4c22fce8 "", username=0x7faf4c22fcc8 "andres")
at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4494
#16 0x00007faf4b830e7e in BackendRun (port=0x7faf4c22da50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4593
#17 0x00007faf4b83075f in BackendStartup (port=0x7faf4c22da50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4321
#18 0x00007faf4b82c55b in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1801
#19 0x00007faf4b82bd71 in PostmasterMain (argc=49, argv=0x7faf4c1ff1e0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1473

checkpointer:
(gdb) bt
#0 0x00007f0d4ee1fec6 in epoll_wait (epfd=10, events=0x7f0d51489b78, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007f0d5017dd5c in WaitEventSetWaitBlock (set=0x7f0d51489b18, cur_timeout=-1, occurred_events=0x7fffe7c5f410, nevents=1)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
#2 0x00007f0d5017dbe3 in WaitEventSetWait (set=0x7f0d51489b18, timeout=-1, occurred_events=0x7fffe7c5f410, nevents=1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
#3 0x00007f0d5017cec4 in WaitLatch (latch=0x7f0d4c27db04, wakeEvents=33, timeout=-1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:472
#4 0x00007f0d5018fe34 in ConditionVariableTimedSleep (cv=0x7f0d4c342268, timeout=-1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:163
#5 0x00007f0d5018fd6c in ConditionVariableSleep (cv=0x7f0d4c342268, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:100
#6 0x00007f0d50132525 in InvalidatePossiblyObsoleteSlot (s=0x7f0d4c342188, oldestLSN=553648128, invalidated=0x7fffe7c5f577)
at /home/andres/src/postgresql/src/backend/replication/slot.c:1290
#7 0x00007f0d501326af in InvalidateObsoleteReplicationSlots (oldestSegno=33) at /home/andres/src/postgresql/src/backend/replication/slot.c:1359
#8 0x00007f0d4fd84072 in CreateCheckPoint (flags=108) at /home/andres/src/postgresql/src/backend/access/transam/xlog.c:6620
#9 0x00007f0d500cd278 in CheckpointerMain () at /home/andres/src/postgresql/src/backend/postmaster/checkpointer.c:445
#10 0x00007f0d500cad7a in AuxiliaryProcessMain (auxtype=CheckpointerProcess) at /home/andres/src/postgresql/src/backend/postmaster/auxprocess.c:153
#11 0x00007f0d500e034c in StartChildProcess (type=CheckpointerProcess) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5531
#12 0x00007f0d500dacf1 in PostmasterMain (argc=49, argv=0x7f0d514891e0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1458
#13 0x00007f0d4ffcfc19 in main (argc=49, argv=0x7f0d514891e0) at /home/andres/src/postgresql/src/backend/main/main.c:202

Greetings,

Andres Freund

#21Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#19)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

There've not been any new reports in the last 18 hours, but that's probably
just due to lower commit activity triggering fewer runs.

myna just showed the same symptom:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=myna&amp;dt=2022-02-18%2003%3A00%3A17

# Failed test 'have walsender pid 13681
# 13670'
# at t/019_replslot_limit.pl line 335.
# '13681
# 13670'
# doesn't match '(?^:^[0-9]+$)'
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 18.
[03:13:09] t/019_replslot_limit.pl ..............

I think that's the first one on something outside your menagerie.

regards, tom lane

#22Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#20)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-17 19:41:40 -0800, Andres Freund wrote:

At this point the psql "walsender" is *not* terminated, because the fatal
message is waiting behind all the WAL data sent. Which also means that the
slot isn't yet released.

Isn't it pretty bonkers that we allow error processing to get stuck behind
network traffic, *before* we have have released resources (locks etc)?

I wonder if we should try to send, but do it in a nonblocking way.

Greetings,

Andres Freund

#23Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#22)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-17 21:55:21 -0800, Andres Freund wrote:

Isn't it pretty bonkers that we allow error processing to get stuck behind
network traffic, *before* we have have released resources (locks etc)?

This is particularly likely to be a problem for walsenders, because they often
have a large output buffer filled, because walsender uses
pq_putmessage_noblock() to send WAL data. Which obviously can be large.

In the stacktrace upthread you can see:
#3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, ptr=0x7faf4c2f1210, len=21470) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:29

which certainly is more than in most other cases of error messages being
sent. And it obviously might not be the first to have gone out.

I wonder if we should try to send, but do it in a nonblocking way.

I think we should probably do so at least during FATAL error processing. But
also consider doing so for ERROR, because not releasing resources after
getting cancelled / terminated is pretty nasty imo.

Greetings,

Andres Freund

#24Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#23)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-18 14:42:48 -0800, Andres Freund wrote:

On 2022-02-17 21:55:21 -0800, Andres Freund wrote:

Isn't it pretty bonkers that we allow error processing to get stuck behind
network traffic, *before* we have have released resources (locks etc)?

This is particularly likely to be a problem for walsenders, because they often
have a large output buffer filled, because walsender uses
pq_putmessage_noblock() to send WAL data. Which obviously can be large.

In the stacktrace upthread you can see:
#3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, ptr=0x7faf4c2f1210, len=21470) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:29

which certainly is more than in most other cases of error messages being
sent. And it obviously might not be the first to have gone out.

I wonder if we should try to send, but do it in a nonblocking way.

I think we should probably do so at least during FATAL error processing. But
also consider doing so for ERROR, because not releasing resources after
getting cancelled / terminated is pretty nasty imo.

Is it possible that what we're seeing is a deadlock, with both walsender and
the pg_basebackup child trying to send data, but neither receiving?

But that'd require that somehow the basebackup child process didn't exit with
its parent. And I don't really see how that'd happen.

I'm running out of ideas for how to try to reproduce this. I think we might
need some additional debugging information to get more information from the
buildfarm.

I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose
to pg_basebackup in $node_primary3->backup(...).

It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().

Greetings,

Andres Freund

#25Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#23)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

On 2022-02-17 21:55:21 -0800, Andres Freund wrote:

Isn't it pretty bonkers that we allow error processing to get stuck behind
network traffic, *before* we have have released resources (locks etc)?

It's more or less intentional, per elog.c:

/*
* This flush is normally not necessary, since postgres.c will flush out
* waiting data when control returns to the main loop. But it seems best
* to leave it here, so that the client has some clue what happened if the
* backend dies before getting back to the main loop ... error/notice
* messages should not be a performance-critical path anyway, so an extra
* flush won't hurt much ...
*/
pq_flush();

Perhaps it'd be sensible to do this only in debugging (ie Assert)
builds?

regards, tom lane

#26Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#25)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-18 18:15:21 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2022-02-17 21:55:21 -0800, Andres Freund wrote:

Isn't it pretty bonkers that we allow error processing to get stuck behind
network traffic, *before* we have have released resources (locks etc)?

It's more or less intentional, per elog.c:

/*
* This flush is normally not necessary, since postgres.c will flush out
* waiting data when control returns to the main loop. But it seems best
* to leave it here, so that the client has some clue what happened if the
* backend dies before getting back to the main loop ... error/notice
* messages should not be a performance-critical path anyway, so an extra
* flush won't hurt much ...
*/
pq_flush();

Perhaps it'd be sensible to do this only in debugging (ie Assert)
builds?

That seems not great, because it pretty clearly can lead to hangs, which is
problematic in tests too. What about using pq_flush_if_writable()? In nearly
all situations that'd still push the failure to the client.

We'd also need to add pq_endmessage_noblock(), because the pq_endmessage()
obviously tries to send (as in the backtrace upthread) if the output buffer is
large enough, which it often will be in walsender.

I guess we could try to flush in a blocking manner sometime later in the
shutdown sequence, after we've released resources? But I'm doubtful it's a
good idea, we don't really want to block waiting to exit when e.g. the network
connection is dead without the TCP stack knowing.

Hm. There already is code trying to short-circuit sending errors to the client
if a backend gets terminated. Introduced in

commit 2ddb9149d14de9a2e7ac9ec6accf3ad442702b24
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: 2018-10-19 21:39:21 -0400

Server-side fix for delayed NOTIFY and SIGTERM processing.

and predecessors.

If ProcessClientWriteInterrupt() sees ProcDiePending, we'll stop trying to
send stuff to the client if writes block. However, this doesn't work here,
because we've already unset ProcDiePending:

#7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301
#8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253
#9 0x00007faf4bab8aa0 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541
#10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts")
at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
#11 0x00007faf4b907e73 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192
#12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 <XLogSendPhysical>) at /home/andres/src/postgresql/src/backend/replication/walsender.c:2404
#13 0x00007faf4b88f82e in StartReplication (cmd=0x7faf4c293fc0) at /home/andres/src/postgresql/src/backend/replication/walsender.c:834

Before ProcessInterrupts() FATALs due to a SIGTERM, it resets ProcDiePending.

This seems not optimal.

We can't just leave ProcDiePending set, otherwise we'll probably end up
throwing more errors during the shutdown sequence. But it seems we need
something similar to proc_exit_inprogress, except set earlier? And then take
that into account in ProcessClientWriteInterrupt()?

Greetings,

Andres Freund

#27Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#26)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

On 2022-02-18 18:15:21 -0500, Tom Lane wrote:

Perhaps it'd be sensible to do this only in debugging (ie Assert)
builds?

That seems not great, because it pretty clearly can lead to hangs, which is
problematic in tests too. What about using pq_flush_if_writable()? In nearly
all situations that'd still push the failure to the client.

That'd be okay by me.

We'd also need to add pq_endmessage_noblock(), because the pq_endmessage()
obviously tries to send (as in the backtrace upthread) if the output buffer is
large enough, which it often will be in walsender.

I don't see that as "obvious". If we're there, we do not have an
error situation.

I guess we could try to flush in a blocking manner sometime later in the
shutdown sequence, after we've released resources? But I'm doubtful it's a
good idea, we don't really want to block waiting to exit when e.g. the network
connection is dead without the TCP stack knowing.

I think you are trying to move in the direction of possibly exiting
without ever sending at all, which does NOT seem like an improvement.

regards, tom lane

#28Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#27)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-18 18:49:14 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

We'd also need to add pq_endmessage_noblock(), because the pq_endmessage()
obviously tries to send (as in the backtrace upthread) if the output buffer is
large enough, which it often will be in walsender.

I don't see that as "obvious". If we're there, we do not have an
error situation.

The problem is that due walsender using pq_putmessage_noblock(), the output
buffer is often going to be too full for a plain pq_endmessage() to not send
out data. Because walsender will have an output buffer > PQ_SEND_BUFFER_SIZE
a lot of the time, errors will commonly be thrown with the output buffer
full.

That leads the pq_endmessage() in send_message_to_frontend() to block sending
the error message and the preceding WAL data. Leading to backtraces like:

#0 0x00007faf4a570ec6 in epoll_wait (epfd=4, events=0x7faf4c201458, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007faf4b8ced5c in WaitEventSetWaitBlock (set=0x7faf4c2013e0, cur_timeout=-1, occurred_events=0x7ffe47df2320, nevents=1)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
#2 0x00007faf4b8cebe3 in WaitEventSetWait (set=0x7faf4c2013e0, timeout=-1, occurred_events=0x7ffe47df2320, nevents=1, wait_event_info=100663297)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
#3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, ptr=0x7faf4c2f1210, len=21470) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:298
#4 0x00007faf4b71aecb in internal_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1380
#5 0x00007faf4b71ada1 in internal_putbytes (s=0x7ffe47df23dc "E\177", len=1) at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1326
#6 0x00007faf4b71b0cf in socket_putmessage (msgtype=69 'E', s=0x7faf4c201700 "SFATAL", len=112)
at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1507
#7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301
#8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253
#9 0x00007faf4bab8aa0 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541
#10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts")
at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
#11 0x00007faf4b907e73 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192
#12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 <XLogSendPhysical>) at /home/andres/src/postgresql/src/backend/replication/walsender.c:2404

I guess we could try to flush in a blocking manner sometime later in the
shutdown sequence, after we've released resources? But I'm doubtful it's a
good idea, we don't really want to block waiting to exit when e.g. the network
connection is dead without the TCP stack knowing.

I think you are trying to move in the direction of possibly exiting
without ever sending at all, which does NOT seem like an improvement.

I was just talking about doing another blocking pq_flush(), in addition to the
pq_flush_if_writable() earlier. That'd be trying harder to send out data, not
less hard...

Greetings,

Andres Freund

#29Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#24)
1 attachment(s)
Re: Race conditions in 019_replslot_limit.pl

Hi,

I think I did find a bug related to the test, but afaics not the cause of the
test failures we're seeing. See
/messages/by-id/20220223014855.4lsddr464i7mymk2@alap3.anarazel.de

I don't think it's related to the problem of this thread, because the logs of
primary3 don't have a single mention of

ereport(LOG,
(errmsg("terminating process %d to release replication slot \"%s\"",
active_pid, NameStr(slotname))));

On 2022-02-18 15:14:15 -0800, Andres Freund wrote:

I'm running out of ideas for how to try to reproduce this. I think we might
need some additional debugging information to get more information from the
buildfarm.

I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose
to pg_basebackup in $node_primary3->backup(...).

It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().

Planning to commit something like the attached.

Greetings,

Andres Freund

Attachments:

v1-0001-Add-temporary-debug-info-to-help-debug-019_replsl.patchtext/x-diff; charset=us-ascii
#30Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#29)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-22 18:06:24 -0800, Andres Freund wrote:

On 2022-02-18 15:14:15 -0800, Andres Freund wrote:

I'm running out of ideas for how to try to reproduce this. I think we might
need some additional debugging information to get more information from the
buildfarm.

I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose
to pg_basebackup in $node_primary3->backup(...).

It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().

Planning to commit something like the attached.

This did provide us a bit more detail.

Seems to suggest something is holding a problematic lock in a way that I do not understand yet:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&amp;dt=2022-02-23%2013%3A47%3A20&amp;stg=recovery-check
2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL)
...
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG: replication slot exit hook, without active slot
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: begin

last message from 1997084 until the immediate shutdown. Just checking for
temporary replication slots that need to be dropped requires
ReplicationSlotControlLock. Actually dropping also requires
ReplicationSlotAllocationLock

1997084 does have to a temporary replication slot to clean up.

2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:35] 019_replslot_limit.pl DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:36] 019_replslot_limit.pl DEBUG: replication slot exit hook, without active slot
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:37] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: begin
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:38] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: done

1997083 succeeds in doing the cleanup. It does not have a temporary
replication slot. Making it look like somehow ReplicationSlotAllocationLock
hasn't been released.

2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:39] 019_replslot_limit.pl DEBUG: shmem_exit(0): 7 on_shmem_exit callbacks to make
...
2022-02-23 09:09:53.076 EST [2022-02-23 09:09:52 EST 1997072:87] LOG: received immediate shutdown request
...
2022-02-23 09:09:53.095 EST [2022-02-23 09:09:52 EST 1997072:90] DEBUG: server process (PID 1997084) exited with exit code 2

It's *possible*, but not likely, that somehow 1997084 just doesn't get
scheduled for a prolonged amount of time.

We could be more certain if we shut down the cluster in fast rather than
immediate mode. So I'm thinking of doing something like

# We've seen occasionales cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
my ($stdout, $stderr);
$node_primary3->psql('postgres',
"\\a\\t\nSELECT * FROM pg_stat_activity",
stdout => \$stdout, stderr => \$stderr);
diag $stdout, $stderr;
$node_primary3->stop('fast');
$node_standby3->stop('fast');
die "could not determine walsender pid, can't continue";
}

Does that make sense? Better ideas?

Greetings,

Andres Freund

#31Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#30)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

Seems to suggest something is holding a problematic lock in a way that I do not understand yet:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&amp;dt=2022-02-23%2013%3A47%3A20&amp;stg=recovery-check
2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL)
...
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG: replication slot exit hook, without active slot
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: begin

last message from 1997084 until the immediate shutdown.

Hmm. Maybe put a couple more debug messages into ReplicationSlotCleanup
and/or ReplicationSlotDropPtr? It doesn't seem very clear where in that
sequence it's hanging up.

We could be more certain if we shut down the cluster in fast rather than
immediate mode. So I'm thinking of doing something like

Does that risk an indefinite hangup of the buildfarm run?

regards, tom lane

#32Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#31)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-25 15:07:01 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Seems to suggest something is holding a problematic lock in a way that I do not understand yet:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&amp;dt=2022-02-23%2013%3A47%3A20&amp;stg=recovery-check
2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL)
...
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG: replication slot exit hook, without active slot
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: begin

last message from 1997084 until the immediate shutdown.

Hmm. Maybe put a couple more debug messages into ReplicationSlotCleanup
and/or ReplicationSlotDropPtr? It doesn't seem very clear where in that
sequence it's hanging up.

Yea, was thinking that as well.

I'm also wondering whether it's worth adding an assert, or at least a WARNING,
about no lwlocks held to the tail end of ShutdownPostgres? I don't want to
add an LWLockReleaseAll() yet, before I understand what's actually happening.

We could be more certain if we shut down the cluster in fast rather than
immediate mode. So I'm thinking of doing something like

Does that risk an indefinite hangup of the buildfarm run?

I think not. The pg_ctl stop -m fast should time out after PGCTLTIMEOUT,
$self->_update_pid(-1); should notice it's not dead. The END{} block should
then shut it down in immediate mode.

Greetings,

Andres Freund

#33Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#32)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-25 12:15:58 -0800, Andres Freund wrote:

On 2022-02-25 15:07:01 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:
Hmm. Maybe put a couple more debug messages into ReplicationSlotCleanup
and/or ReplicationSlotDropPtr? It doesn't seem very clear where in that
sequence it's hanging up.

Yea, was thinking that as well.

I'm also wondering whether it's worth adding an assert, or at least a WARNING,
about no lwlocks held to the tail end of ShutdownPostgres? I don't want to
add an LWLockReleaseAll() yet, before I understand what's actually happening.

Did those things. Sure looks like there's some interaction with
checkpointer. There's a similar sequence in the two failures since the
additional debugging.

2022-02-26 06:35:52.453 CET [6219bc37.8717f:20] DEBUG: replication slot drop: pg_basebackup_553343: removed on-disk
2022-02-26 06:35:52.453 CET [6219bc37.87168:17] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/700060 oldest xid 720 latest complete 719 next xid 720)
2022-02-26 06:35:52.453 CET [6219bc37.87168:18] DEBUG: begin invalidating obsolete replication slots older than 0/700000
2022-02-26 06:35:52.453 CET [6219bc37.87168:19] DEBUG: done invalidating obsolete replication slots
2022-02-26 06:35:52.453 CET [6219bc37.87168:20] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000006
2022-02-26 06:35:52.453 CET [6219bc37.87168:21] DEBUG: removing write-ahead log file "000000010000000000000006"
2022-02-26 06:35:52.453 CET [6219bc37.87168:22] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000
2022-02-26 06:35:52.453 CET [6219bc37.87168:23] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.351 s; sync files=0, longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB
[...]
2022-02-26 06:35:52.970 CET [6219bc37.87139:71] LOG: received fast shutdown request
2022-02-26 06:35:52.970 CET [6219bc37.87139:72] LOG: aborting any active transactions
[...]
2022-02-26 06:35:52.971 CET [6219bc37.87168:24] LOG: shutting down
2022-02-26 06:35:52.976 CET [6219bc37.8717f:21] DEBUG: replication slot drop: pg_basebackup_553343: done
2022-02-26 06:35:52.976 CET [6219bc37.8717f:22] DEBUG: temporary replication slot cleanup: begin
2022-02-26 06:35:52.976 CET [6219bc37.8717f:23] DEBUG: temporary replication slot cleanup: 0 in use, active_pid: 553385
2022-02-26 06:35:52.976 CET [6219bc37.8717f:24] DEBUG: temporary replication slot cleanup: done
2022-02-26 06:35:52.977 CET [6219bc37.8717f:25] DEBUG: shmem_exit(0): 7 on_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:26] DEBUG: proc_exit(0): 3 callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:27] LOG: disconnection: session time: 0:00:00.996 user=bf database= host=[local]
2022-02-26 06:35:52.977 CET [6219bc37.8717f:28] DEBUG: exit(0)
2022-02-26 06:35:52.977 CET [6219bc37.8717f:29] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:30] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:31] DEBUG: proc_exit(-1): 0 callbacks to make

So the backend (8717f/553343) is in the middle of ReplicationSlotDrop(), after
removing on-disk data. And then for 500ms nothing happens until checkpointer
wakes up again. As soon as it does, the slot drop continues.

Just before calling ReplicationSlotDrop() we were able to acquire
ReplicationSlotControlLock in share mode. Just after the log message after
which there's the delay, ReplicationSlotControlLock is locked in exclusive
mode.

Too tired now..

Greetings,

Andres Freund

#34Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#33)
Re: Race conditions in 019_replslot_limit.pl

I just noticed something very interesting: in a couple of recent
buildfarm runs with this failure, the pg_stat_activity printout
no longer shows the extra walsender:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&amp;dt=2022-03-24%2017%3A50%3A10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&amp;dt=2022-03-23%2011%3A00%3A05

This is just two of the 33 such failures in the past ten days,
so maybe it's not surprising that we didn't see it already.
(I got bored before looking back further than that.)

What this suggests to me is that maybe the extra walsender is
indeed not blocked on anything, but is just taking its time
about exiting. In these two runs, as well as in all the
non-failing runs, it had enough time to do so.

I suggest that we add a couple-of-seconds sleep in front of
the query that collects walsender PIDs, and maybe a couple more
seconds before the pg_stat_activity probe in the failure path,
and see if the behavior changes at all. That should be enough
to confirm or disprove this idea pretty quickly. If it is
right, a permanent fix could be to wait for the basebackup's
walsender to disappear from node_primary3's pg_stat_activity
before we start the one for node_standby3.

regards, tom lane

#35Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#34)
1 attachment(s)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-03-24 15:05:40 -0400, Tom Lane wrote:

I just noticed something very interesting: in a couple of recent
buildfarm runs with this failure, the pg_stat_activity printout
no longer shows the extra walsender:

Oh. That is interesting. Thanks for catching that.

What this suggests to me is that maybe the extra walsender is
indeed not blocked on anything, but is just taking its time
about exiting. In these two runs, as well as in all the
non-failing runs, it had enough time to do so.

Still odd that it started with the bugfix I committed. And the locking pattern
in [1]/messages/by-id/20220226072325.wtvkwvvga2wc3nkn@alap3.anarazel.de still seems suspicious. But the obove does seem to suggest those might
just have been red herrings.

I had previously tried to make process exit slow by inserting delays in
various points and only succeeded in tests failing due too many
connections. For some reason I didn't increase max_connections...

I now increased the test's max_connections and managed to get a kind of
similar failure by sticking
elog(LOG, "received disconnect");
pg_usleep(220000);
into the 'X' case in PostgresMain().

However, I always see three pids or the expected one pid, not two like the BF
cases. But it's not too surprising that such an artificial, uniform, slowdown
would have different symptoms than reality.

I suggest that we add a couple-of-seconds sleep in front of
the query that collects walsender PIDs, and maybe a couple more
seconds before the pg_stat_activity probe in the failure path,
and see if the behavior changes at all.

Makes sense. I was previously thinking that it'd make sense to sleep for a bit
before the ->stop('fast'); calls, so that we can see whether it's just the
shutdown unblocking things (which [1]/messages/by-id/20220226072325.wtvkwvvga2wc3nkn@alap3.anarazel.de seems to suggest).

How about the attached variation, which retries (for 15s, with 100ms sleeps)
if there are multiple walsenders, printing the debugging info each time? It'll
still fail the test if later iterations find just one walsender, which seems
the right behaviour for now.

That should be enough to confirm or disprove this idea pretty quickly. If
it is right, a permanent fix could be to wait for the basebackup's walsender
to disappear from node_primary3's pg_stat_activity before we start the one
for node_standby3.

For some tests a "confimed disconnect" mode would be useful. Basically the
client waiting until it receives EOF after sending 'X'.

Greetings,

Andres Freund

[1]: /messages/by-id/20220226072325.wtvkwvvga2wc3nkn@alap3.anarazel.de

Attachments:

019_replslot_limit_debug.difftext/x-diff; charset=us-ascii
#36Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#35)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

How about the attached variation, which retries (for 15s, with 100ms sleeps)
if there are multiple walsenders, printing the debugging info each time? It'll
still fail the test if later iterations find just one walsender, which seems
the right behaviour for now.

OK by me.

regards, tom lane

#37Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#35)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

How about the attached variation, which retries (for 15s, with 100ms sleeps)
if there are multiple walsenders, printing the debugging info each time? It'll
still fail the test if later iterations find just one walsender, which seems
the right behaviour for now.

We have now four instances of failures with this version of the test,
and in every case the second iteration succeeded. Is that enough
evidence yet?

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&amp;dt=2022-03-27%2017%3A04%3A18
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&amp;dt=2022-03-25%2009%3A00%3A09
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&amp;dt=2022-03-25%2008%3A02%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&amp;dt=2022-03-25%2003%3A00%3A18

I'd like to silence this noise so that we can start tracking
lower-probability failure modes, like say these:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&amp;dt=2022-03-26%2002%3A59%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2022-03-26%2015%3A53%3A51

regards, tom lane

#38Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#37)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-03-27 15:28:05 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

How about the attached variation, which retries (for 15s, with 100ms sleeps)
if there are multiple walsenders, printing the debugging info each time? It'll
still fail the test if later iterations find just one walsender, which seems
the right behaviour for now.

We have now four instances of failures with this version of the test,
and in every case the second iteration succeeded. Is that enough
evidence yet?

I still feel like there's something off here. But that's probably not enough
to keep causing failures. I'm inclined to leave the debugging in for a bit
longer, but not fail the test anymore?

The way the temporary slot removal hangs for a while seems just off:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&amp;dt=2022-03-27%2017%3A04%3A18
2022-03-27 13:30:56.993 EDT [2022-03-27 13:30:56 EDT 750695:20] 019_replslot_limit.pl DEBUG: replication slot drop: pg_basebackup_750695: removed on-disk
...
2022-03-27 13:30:57.456 EDT [2022-03-27 13:30:57 EDT 750759:3] [unknown] LOG: connection authorized: user=andrew database=postgres application_name=019_replslot_limit.pl
2022-03-27 13:30:57.466 EDT [2022-03-27 13:30:57 EDT 750759:4] 019_replslot_limit.pl LOG: statement: SELECT * FROM pg_stat_activity
.
2022-03-27 13:30:57.474 EDT [2022-03-27 13:30:56 EDT 750679:87] DEBUG: server process (PID 750759) exited with exit code 0
2022-03-27 13:30:57.507 EDT [2022-03-27 13:30:56 EDT 750695:21] 019_replslot_limit.pl DEBUG: replication slot drop: pg_basebackup_750695: done

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&amp;dt=2022-03-25%2009%3A00%3A09
2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819: begin
2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819: removed on-disk
...
2022-03-25 10:13:31.038 CET [4022841][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgres application_name=019_replslot_limit.pl
2022-03-25 10:13:31.039 CET [4022841][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity
...
2022-03-25 10:13:31.045 CET [4022807][postmaster][:0][] DEBUG: server process (PID 4022841) exited with exit code 0
2022-03-25 10:13:31.056 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819: done

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&amp;dt=2022-03-25%2008%3A02%3A05
2022-03-25 09:15:20.558 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_3730425: removed on-disk
...
2022-03-25 09:15:20.803 CET [3730461][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgres application_name=019_replslot_limit.pl
2022-03-25 09:15:20.804 CET [3730461][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity
...
2022-03-25 09:15:20.834 CET [3730381][postmaster][:0][] DEBUG: server process (PID 3730461) exited with exit code 0
2022-03-25 09:15:20.861 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_3730425: done

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&amp;dt=2022-03-25%2003%3A00%3A18
2022-03-25 04:14:03.025 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_2674398: removed on-disk
...
2022-03-25 04:14:03.264 CET [2674463][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgres application_name=019_replslot_limit.pl
2022-03-25 04:14:03.265 CET [2674463][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity
...
2022-03-25 04:14:03.270 CET [2674384][postmaster][:0][] DEBUG: server process (PID 2674463) exited with exit code 0
...
2022-03-25 04:14:03.324 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_2674398: done

We are able to start / finish several new connections between the two debug
elog()sin ReplicationSlotDropPtr()?

I wonder if there's something odd going on with ConditionVariableBroadcast().

Might be useful to add another debug message before/after
ConditionVariableBroadcast() and rmtree(). If the delay is due to rmtree()
being slow under concurrent tests I'd feel less concerned (although that
machine has dual NVMe drives...).

I really wish I could reproduce the failure. I went through a few hundred
cycles of that test in a separate checkout on that machine.

I'd like to silence this noise so that we can start tracking
lower-probability failure modes, like say these:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&amp;dt=2022-03-26%2002%3A59%3A03

That one was a missing compiler flag. I see that failure reproducibly locally
when using asan with clang, unless I use -fsanitize-address-use-after-return=never. gcc has a different default
for the option, which is why I hadn't configured it.

If I understand correctly, the problem is that
-fsanitize-address-use-after-return uses an alternative stack. Sometimes our
stack depths functions get called with the "proper" stack, and sometimes with
a "shadow" stack. Which breaks our stack depth checking.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2022-03-26%2015%3A53%3A51

This one I have no idea about yet. I assume it's just a race in a new test...

Greetings,

Andres Freund

#39Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#38)
Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

I still feel like there's something off here. But that's probably not enough
to keep causing failures. I'm inclined to leave the debugging in for a bit
longer, but not fail the test anymore?

WFM.

The way the temporary slot removal hangs for a while seems just off:

Perhaps, but right now we're causing noise in the buildfarm and
learning nothing.

regards, tom lane

#40Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#39)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-03-27 17:36:14 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I still feel like there's something off here. But that's probably not enough
to keep causing failures. I'm inclined to leave the debugging in for a bit
longer, but not fail the test anymore?

WFM.

I've done so now.

Greetings,

Andres Freund

#41Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#40)
Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-03-27 22:37:34 -0700, Andres Freund wrote:

On 2022-03-27 17:36:14 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I still feel like there's something off here. But that's probably not enough
to keep causing failures. I'm inclined to leave the debugging in for a bit
longer, but not fail the test anymore?

WFM.

I've done so now.

I did look over the test results a couple times since then and once more
today. There were a few cases with pretty significant numbers of iterations:

The highest is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2022-04-07%2022%3A14%3A03
showing:
# multiple walsenders active in iteration 19

It's somewhat interesting that the worst case was just around the feature
freeze, where the load on my buildfarm animal boxes was higher than normal.

I comparison to earlier approaches, with the current in-tree approach, we
don't do anything when hitting the "problem", other than wait. Which does give
us additional information - afaics there's nothing at all indicating that some
other backend existed allowing the replication slot drop to finish.

It just looks like for reasons I still do not understand, removing a directory
and 2 files or so takes multiple seconds (at least ~36 new connections, 18
pg_usleep(100_100)), while there are no other indications of problems.

I also still don't have a theory why this suddenly started to happen.

Unless somebody has another idea, I'm planning to remove all the debugging
code added, but keep the retry based approach in 019_replslot_limit.pl, so we
don't again get all the spurious failures.

Greetings,

Andres Freund

#42Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#41)
Re: Race conditions in 019_replslot_limit.pl

At Mon, 30 May 2022 12:01:55 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2022-03-27 22:37:34 -0700, Andres Freund wrote:

On 2022-03-27 17:36:14 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I still feel like there's something off here. But that's probably not enough
to keep causing failures. I'm inclined to leave the debugging in for a bit
longer, but not fail the test anymore?

WFM.

I've done so now.

I did look over the test results a couple times since then and once more
today. There were a few cases with pretty significant numbers of iterations:

The highest is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2022-04-07%2022%3A14%3A03
showing:
# multiple walsenders active in iteration 19

It's somewhat interesting that the worst case was just around the feature
freeze, where the load on my buildfarm animal boxes was higher than normal.

If disk is too busy, CheckPointReplicationSlots may take very long.

I comparison to earlier approaches, with the current in-tree approach, we
don't do anything when hitting the "problem", other than wait. Which does give
us additional information - afaics there's nothing at all indicating that some
other backend existed allowing the replication slot drop to finish.

preventing? Only checkpointer and a client backend that ran "SELECT * FROM
pg_stat_activity" are the only processes that are running during the
blocking state.

It just looks like for reasons I still do not understand, removing a

directory

and 2 files or so takes multiple seconds (at least ~36 new connections, 18
pg_usleep(100_100)), while there are no other indications of problems.

That fact suports that CheckPointReplicationSlots took long time.

I also still don't have a theory why this suddenly started to happen.

Maybe we need to see the load of disks at that time OS-wide. Couldn't
compiler or other non-postgres tools put significant load to disks?

Unless somebody has another idea, I'm planning to remove all the debugging
code added, but keep the retry based approach in 019_replslot_limit.pl, so we
don't again get all the spurious failures.

+1.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center