Race conditions in 019_replslot_limit.pl

andres@anarazel.de

over 4 years ago

In reply to: Heikki Linnakangas (#1)

Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-15 23:29:20 +0200, Heikki Linnakangas wrote:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04

[07:42:23] t/018_wal_optimize.pl ................ ok 12403 ms ( 0.00 usr
0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU)
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved

The relevant bit from the log is:
2022-02-14 07:42:27.817 CET [6209f9d3.68bab:3] LOG: database system is ready to accept read-only connections
2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:1] LOG: started streaming WAL from primary at 0/1B00000 on timeline 1
2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000001B has already been removed
2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:1] LOG: started streaming WAL from primary at 0/1B00000 on timeline 1
2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000001B has already been removed

I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);

Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?

and:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

The latter looks like there are two walsenders active, which confuses the
test.

Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...

%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
which pg_basebackup builds using the backend pid:
replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));

The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1

Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Has anyone looked into these yet?

Hadn't yet...

Greetings,

Andres Freund

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#2)

Re: Race conditions in 019_replslot_limit.pl

At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in

I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);

Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?

Agreed. Doing this att all slot creation seems fine.

and:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

The latter looks like there are two walsenders active, which confuses the
test.

Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...

%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
which pg_basebackup builds using the backend pid:
replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));

The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1

Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?

It seems to me so, too.

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.

Has anyone looked into these yet?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Masahiko Sawada

sawada.mshk@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#3)

Re: Race conditions in 019_replslot_limit.pl

On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in

I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);

Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?

Agreed. Doing this att all slot creation seems fine.

and:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

The latter looks like there are two walsenders active, which confuses the
test.

Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...

%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
which pg_basebackup builds using the backend pid:
replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));

The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1

Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?

It seems to me so, too.

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.

Or it's possible that the process took a time to clean up the
temporary replication slot?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#3)

Re: Race conditions in 019_replslot_limit.pl

At Wed, 16 Feb 2022 14:26:37 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Agreed. Doing this att all slot creation seems fine.

Done in the attached. The first slot is deliberately created
unreserved so I changed the code to re-create the slot with "reserved"
before taking backup.

Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down. Even though pg_basebackup clearly finished? Uh, huh?

It seems to me so, too.

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.

s/ since//;

Whatever is happening at that time, I can make sure that walsender is
gone before making a new replication connection, even though it
doesn't "fix" any of the observed issues.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Masahiko Sawada (#4)

Re: Race conditions in 019_replslot_limit.pl

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.

Or it's possible that the process took a time to clean up the
temporary replication slot?

Ugg, it's immediate shutdown. So postmaster kills the walsender no
matter what state the waldender is under, nd leaves no log about the
end ofwalsender.

Sorry for the confusion.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Masahiko Sawada (#4)

Re: Race conditions in 019_replslot_limit.pl

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Or it's possible that the process took a time to clean up the
temporary replication slot?

Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#7)

Re: Race conditions in 019_replslot_limit.pl

At Wed, 16 Feb 2022 15:22:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Or it's possible that the process took a time to clean up the
temporary replication slot?

Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?

Or missing CV broadcast? Anyway I haven't find a scenario of
interfering between checkpointer and walsender.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Masahiko Sawada

sawada.mshk@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#7)

Re: Race conditions in 019_replslot_limit.pl

On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Or it's possible that the process took a time to clean up the
temporary replication slot?

Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?

That's possible. Whatever the exact cause of this failure, I think we
can stabilize this test by adding a condition of application_name to
the query.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#10

andres@anarazel.de

over 4 years ago

In reply to: Masahiko Sawada (#9)

Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 18:04:14 +0900, Masahiko Sawada wrote:

On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in

Or it's possible that the process took a time to clean up the
temporary replication slot?

Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?

A deadlock requires some form of incorrected lock (or lock like) nesting. Do
you have an idea what that could be?

That's possible. Whatever the exact cause of this failure, I think we
can stabilize this test by adding a condition of application_name to
the query.

I think the test is telling us that something may be broken. We shouldn't
silence that without at least some understanding what it is.

It'd be good try to reproduce this locally...

- Andres

#11

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2022-02-16%2021%3A16%3A04

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#10)

Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

I think the test is telling us that something may be broken. We shouldn't
silence that without at least some understanding what it is.

I looked at the recent failure on komodoensis [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2022-02-16%2021%3A16%3A04, and I think what is
happening is just that the walsender for the basebackup run (launched
at 019_replslot_limit.pl line 325) hasn't exited yet at the point where
we do a blind
"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"
and expect that we're only going to see the walsender launched for
the standby at line 331. The two PIDs reported in the failure
correspond to this postmaster log trace:

2022-02-16 23:06:29.596 CET [620d7565.38dd62:1] LOG: connection received: host=[local]
2022-02-16 23:06:29.596 CET [620d7565.38dd62:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-16 23:06:29.596 CET [620d7565.38dd62:3] LOG: received replication command: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:4] STATEMENT: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-16 23:06:29.596 CET [620d7565.38dd62:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-16 23:06:29.597 CET [620d7565.38dd62:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000 TIMELINE 1
2022-02-16 23:06:29.597 CET [620d7565.38dd62:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000 TIMELINE 1

and this one:

2022-02-16 23:06:29.687 CET [620d7565.38dd6f:1] LOG: connection received: host=[local]
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:2] LOG: replication connection authorized: user=bf application_name=standby_3
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:3] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:4] STATEMENT: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:5] LOG: received replication command: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:6] STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...
shouldn't we fix that? But in any case, I don't see anything
interesting here, just a query that needs to be more selective.
Perhaps we can look for application_name=standby_3?

regards, tom lane

#12

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#11)

Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 20:22:23 -0500, Tom Lane wrote:

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...

The thing is, we actually *do* log disconnection for walsenders:

2022-02-16 17:53:55.780 PST [2459806][walsender][:0] LOG: disconnection: session time: 0:00:01.321 user=andres database= host=[local]

The only way I see for the disconnection not to be logged is an immediate
shutdown.

It'd be one thing if the missing walsender disconnect wasn't logged because we
shut down just after. But unless I misunderstand something, between the
basebackup and the failing test, we actually start standby_3:

# Running: pg_basebackup -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_primary3_data/backup/my_backup -h /tmp/hyrtnyd2RU -p 53774 --checkpoint fast --no-sync
# Backup finished

...

# Running: pg_ctl -w -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_standby_3_data/pgdata -l /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/log/019_replslot_limit_standby_3.log -o --cluster-name=standby_3 start

and also wait till replication has caught up:

Waiting for replication conn standby_3's replay_lsn to pass 0/700000 on primary3
done

and only then we have

not ok 17 - have walsender pid 3682154
# 3682136

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

Starting a node in recovery and having it connect to the primary seems like a
mighty long time for a process to exit, unless it's stuck behind something.

Greetings,

Andres Freund

#13

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#12)

Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

On 2022-02-16 20:22:23 -0500, Tom Lane wrote:

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...

The thing is, we actually *do* log disconnection for walsenders:

Ah, my mistake, now I do see a disconnection entry for the other walsender
launched by the basebackup.

Starting a node in recovery and having it connect to the primary seems like a
mighty long time for a process to exit, unless it's stuck behind something.

Fair point. Also, 019_replslot_limit.pl hasn't been changed in any
material way in months, but *something's* changed recently, because
this just started. I scraped the buildfarm for instances of
"Failed test 'have walsender pid" going back 6 months, and what I find is

So (a) it broke around 48 hours ago, which is already a useful
bit of info, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?

regards, tom lane

#14

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Tom Lane (#13)

Re: Race conditions in 019_replslot_limit.pl

I wrote:

So (a) it broke around 48 hours ago, which is already a useful
bit of info, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?

Eyeballing the commit log for potential causes in that time frame,
I can't help noticing

2 days ago Andres Freund Move replication slot release to before_shmem_exit().

regards, tom lane

#15

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#13)

Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 22:11:30 -0500, Tom Lane wrote:

So (a) it broke around 48 hours ago, which is already a useful
bit of info

Indeed. 2f6501fa3c54bbe4568e3bcccd9a60d26a46b5ee seems like the obvious commit
to blame.

We document before_shmem_exit hooks as
/*
* Call before_shmem_exit callbacks.
*
* These should be things that need most of the system to still be up and
* working, such as cleanup of temp relations, which requires catalog
* access; or things that need to be completed because later cleanup steps
* depend on them, such as releasing lwlocks.
*/
and several before_shmem_exit callbacks use lwlocks.

But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?

, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?

Flaviventris, serinus use the newest snapshot of gcc available in debian, one
with -O0, the other with O3.

desmoxytes, idiacanthus, komodoensis all have JIT forced for every query.

They all run on the same host - looking at stats it doesn't look crazily
overcommitted or such. But it might have more scheduler variance than most?

Greetings,

Andres Freund

#16

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#15)

Re: Race conditions in 019_replslot_limit.pl

Andres Freund <andres@anarazel.de> writes:

But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?

If we're inside a transaction, then ShutdownPostgres ->
AbortOutOfAnyTransaction would release LWLocks. I wonder
though if an exiting walsender would always take that path.
Maybe it'd be good if we released all LWLocks even if we
don't think we're in a transaction?

More generally, it wasn't obvious to me why you thought that
BaseInit was a good spot to place the on_shmem_exit call.
It obviously can't be *before* that because of the pgstats
dependency, but maybe it should be later?

regards, tom lane

#17

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#15)

Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 19:46:26 -0800, Andres Freund wrote:

But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?

Nothing. But we register ShutdownPostgres() / ShutdownAuxiliaryProcess() to
take care of that.

I checked and walsenders do have ShutdownPostgres() and I don't see anything
else that's missing it either.

I'll run the test in a loop, perhaps I can reproduce...

Greetings,

Andres Freund

#18

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#16)

Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 22:58:19 -0500, Tom Lane wrote:

I wonder though if an exiting walsender would always take that path.

You're right - the CFI() in PostgresMain(), several replication commands, are
not at all guaranteed to be in a transaction when throwing a FATAL.

I don't quite see where we could be throwing an error with a relevant lwlock
held yet though.

Maybe it'd be good if we released all LWLocks even if we don't think we're
in a transaction?

Yea, probably a good idea. I think there's a few hooks could that could
conceivably be in trouble otherwise.

More generally, it wasn't obvious to me why you thought that
BaseInit was a good spot to place the on_shmem_exit call.

It obviously can't be *before* that because of the pgstats
dependency, but maybe it should be later?

Yea, I agree, it probably could be done later. It's not super obvious when
though. If e.g. somebody wrote a non-database connected bgworker to
synchronize slot state with another node (for failover, there's even patches
for it), it'd be just after BaseInit(). Additionally, it was previously in
ProcKill(), so moving it something registered early-ish seemed sensible.

Greetings,

Andres Freund

#19

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#17)

Re: Race conditions in 019_replslot_limit.pl

Hi,

On 2022-02-16 20:03:00 -0800, Andres Freund wrote:

I'll run the test in a loop, perhaps I can reproduce...

I've done this for a couple hundred iterations, under load, subsequently with
additional assertions, without being able to reproduce.

There've not been any new reports in the last 18 hours, but that's probably
just due to lower commit activity triggering fewer runs.

Greetings,

Andres Freund

#20