Timeout failure in 019_replslot_limit.pl

Started by Michael Paquierover 4 years ago20 messages
#1Michael Paquier
michael@paquier.xyz

Hi all,

Running the recovery tests in a parallel run, enough to bloat a
machine in resources, sometimes leads me to the following failure:
ok 19 - walsender termination logged
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

This corresponds to the following part of the test, where a WAL sender
is SIGSTOP'd and SIGCONT'd:
$node_primary3->poll_query_until('postgres',
"SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'",
"lost")
or die "timed out waiting for slot to be lost";

There is already a default timeout of 180s applied as of the default
of PostgresNode::poll_query_until(), so it seems to me that there
could be a different issue hiding here.

Thanks,
--
Michael

#2Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Michael Paquier (#1)
Re: Timeout failure in 019_replslot_limit.pl

Hello

On 2021-Sep-06, Michael Paquier wrote:

Running the recovery tests in a parallel run, enough to bloat a
machine in resources, sometimes leads me to the following failure:
ok 19 - walsender termination logged
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

Hmm, I've never seen that, and I do run tests in parallel quite often.
Would you please attach the log files for that test in a failed run?

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL. Most of the surprises
are of the "oh wow! That's cool" Not the "oh shit!" kind. :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: Timeout failure in 019_replslot_limit.pl

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On 2021-Sep-06, Michael Paquier wrote:

# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

Hmm, I've never seen that, and I do run tests in parallel quite often.

I scraped the buildfarm logs looking for similar failures, and didn't
find any. (019_replslot_limit.pl hasn't failed at all in the farm
since the last fix it received, in late July.) I wonder if Michael's
setup had any unusual settings.

regards, tom lane

#4Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#3)
1 attachment(s)
Re: Timeout failure in 019_replslot_limit.pl

On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote:

I scraped the buildfarm logs looking for similar failures, and didn't
find any. (019_replslot_limit.pl hasn't failed at all in the farm
since the last fix it received, in late July.)

The interesting bits are in 019_replslot_limit_primary3.log. In a
failed run, I can see that we get immediately a process termination,
as follows:
2021-09-07 07:52:53.402 JST [22890] LOG: terminating process 23082 to release replication slot "rep3"
2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL: terminating connection due to administrator command
2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

In a successful run, the pattern is different:
2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL: terminating connection due to administrator command
2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2021-09-07 09:27:39.832 JST [57092] LOG: invalidating slot "rep3" because its restart_lsn 0/7000D8 exceeds max_slot_wal_keep_size
2021-09-07 09:27:39.833 JST [57092] LOG: checkpoint complete: wrote
19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB
2021-09-07 09:27:39.833 JST [57092] LOG: checkpoints are occurring too frequently (0 seconds apart)
2021-09-07 09:27:39.833 JST [57092] HINT: Consider increasing the configuration parameter "max_wal_size".
2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

The slot invalidation is forgotten because we don't complete a
checkpoint that does the work it should do, no? There is a completed
checkpoint before we query pg_replication_slots, and the buildfarm
shows the same thing.

I wonder if Michael's setup had any unusual settings.

The way I use configure and build options has caught bugs with code
ordering in the past, but this one looks like just a timing issue with
the test itself. I can only see that with Big Sur 11.5.2, and I just
got fresh logs this morning with a new failure, as of the attached.
--
Michael

Attachments:

replslot_inval.tar.gzapplication/x-tar-gzDownload
#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#4)
1 attachment(s)
Re: Timeout failure in 019_replslot_limit.pl

At Tue, 7 Sep 2021 09:37:10 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote:

I scraped the buildfarm logs looking for similar failures, and didn't
find any. (019_replslot_limit.pl hasn't failed at all in the farm
since the last fix it received, in late July.)

The interesting bits are in 019_replslot_limit_primary3.log. In a
failed run, I can see that we get immediately a process termination,
as follows:
2021-09-07 07:52:53.402 JST [22890] LOG: terminating process 23082 to release replication slot "rep3"
2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL: terminating connection due to administrator command
2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

In a successful run, the pattern is different:
2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL: terminating connection due to administrator command
2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2021-09-07 09:27:39.832 JST [57092] LOG: invalidating slot "rep3" because its restart_lsn 0/7000D8 exceeds max_slot_wal_keep_size
2021-09-07 09:27:39.833 JST [57092] LOG: checkpoint complete: wrote
19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB
2021-09-07 09:27:39.833 JST [57092] LOG: checkpoints are occurring too frequently (0 seconds apart)
2021-09-07 09:27:39.833 JST [57092] HINT: Consider increasing the configuration parameter "max_wal_size".
2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

The slot invalidation is forgotten because we don't complete a
checkpoint that does the work it should do, no? There is a completed
checkpoint before we query pg_replication_slots, and the buildfarm
shows the same thing.

It seems like the "kill 'STOP'" in the script didn't suspend the
processes before advancing WAL. The attached uses 'ps' command to
check that since I didn't come up with the way to do the same in Perl.

I'm still not sure it works as expected, though. (Imagining the case
where the state changes before the process actually stops..)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

make_019_to_wait_to_stop.difftext/x-patch; charset=us-asciiDownload
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index e065c5c008..7388900008 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3);
 # freeze walsender and walreceiver. Slot will still be active, but walreceiver
 # won't get anything anymore.
 kill 'STOP', $senderpid, $receiverpid;
+wait_to_stop($senderpid, $receiverpid);
+
 advance_wal($node_primary3, 2);
 
 my $max_attempts = 180;
@@ -426,3 +428,30 @@ sub find_in_log
 
 	return $log =~ m/$pat/;
 }
+
+sub wait_to_stop
+{
+	my (@pids) = @_;
+	my $max_attempts = 180;
+
+	# Haven't found the means to do the same on Windows
+	return if $TestLib::windows_os;
+
+	while ($max_attempts-- >= 0)
+	{
+		my ($all_stopped) = 1;
+
+		foreach my $pid (@pids)
+		{
+			if (`ps -p $pid -o pid,state | tail -1` !~ /^ *\d+ +T/)
+			{
+				$all_stopped = 0;
+				last;
+			}
+		}
+
+		last if ($all_stopped);
+
+		sleep 1;
+	}
+}
#6Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Kyotaro Horiguchi (#5)
Re: Timeout failure in 019_replslot_limit.pl

On 2021-Sep-07, Kyotaro Horiguchi wrote:

It seems like the "kill 'STOP'" in the script didn't suspend the
processes before advancing WAL. The attached uses 'ps' command to
check that since I didn't come up with the way to do the same in Perl.

Ah! so we tell the kernel to send the signal, but there's no guarantee
about the timing for the reaction from the other process. Makes sense.

Your proposal is to examine the other process' state until we see that
it gets the T flag. I wonder how portable this is; I suspect not very.
`ps` is pretty annoying, meaning not consistently implemented -- GNU's
manpage says there are "UNIX options", "BSD options" and "GNU long
options", so it seems hard to believe that there is one set of options
that will work everywhere.

I found a Perl module (Proc::ProcessTable) that can be used to get the
list of processes and their metadata, but it isn't in core Perl and it
doesn't look very well maintained either, so that one's out.

Another option might be to wait on the kernel -- do something that would
involve the kernel taking action on the other process, acting like a
barrier of sorts. I don't know if this actually works, but we could
try. Something like sending SIGSTOP first, then "kill 0" -- or just
send SIGSTOP twice:

diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index e065c5c008..e8f323066a 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3);
 # freeze walsender and walreceiver. Slot will still be active, but walreceiver
 # won't get anything anymore.
 kill 'STOP', $senderpid, $receiverpid;
+kill 'STOP', $senderpid, $receiverpid;
+
 advance_wal($node_primary3, 2);

my $max_attempts = 180;

+	# Haven't found the means to do the same on Windows
+	return if $TestLib::windows_os;

I suppose if it came down to something like your patch, we could do
something simple here like "if Windows, sleep 2s and hope for the best".

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

#7Noah Misch
noah@leadboat.com
In reply to: Alvaro Herrera (#6)
Re: Timeout failure in 019_replslot_limit.pl

On Fri, Sep 17, 2021 at 06:59:24PM -0300, Alvaro Herrera wrote:

On 2021-Sep-07, Kyotaro Horiguchi wrote:

It seems like the "kill 'STOP'" in the script didn't suspend the
processes before advancing WAL. The attached uses 'ps' command to
check that since I didn't come up with the way to do the same in Perl.

Ah! so we tell the kernel to send the signal, but there's no guarantee
about the timing for the reaction from the other process. Makes sense.

Agreed.

Your proposal is to examine the other process' state until we see that
it gets the T flag. I wonder how portable this is; I suspect not very.
`ps` is pretty annoying, meaning not consistently implemented -- GNU's
manpage says there are "UNIX options", "BSD options" and "GNU long
options", so it seems hard to believe that there is one set of options
that will work everywhere.

I like this, and it's the most-robust way. I agree there's no portable way,
so I'd modify it to be fail-open. Run a "ps" command that works on the OP's
system. If the output shows the process in a state matching [DRS], we can
confidently sleep a bit for signal delivery to finish. If the command fails
or prints something else (including state T, which we need check explicitly),
assume SIGSTOP delivery is complete. If some other platform shows this race
in the future, we can add an additional "ps" command.

If we ever get the "stop events" system
(/messages/by-id/flat/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com),
it would be useful for crafting this kind of test without problem seen here.

I found a Perl module (Proc::ProcessTable) that can be used to get the
list of processes and their metadata, but it isn't in core Perl and it
doesn't look very well maintained either, so that one's out.

Agreed, that one's out.

Another option might be to wait on the kernel -- do something that would
involve the kernel taking action on the other process, acting like a
barrier of sorts. I don't know if this actually works, but we could
try. Something like sending SIGSTOP first, then "kill 0" -- or just
send SIGSTOP twice:

diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index e065c5c008..e8f323066a 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3);
# freeze walsender and walreceiver. Slot will still be active, but walreceiver
# won't get anything anymore.
kill 'STOP', $senderpid, $receiverpid;
+kill 'STOP', $senderpid, $receiverpid;
+
advance_wal($node_primary3, 2);

my $max_attempts = 180;

If this fixes things for the OP, I'd like it slightly better than the "ps"
approach. It's less robust, but I like the brevity.

Another alternative might be to have walreceiver reach walsender via a proxy
Perl script. Then, make that proxy able to accept an instruction to pause
passing data until further notice. However, I like two of your options better
than this one.

#8Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#7)
Re: Timeout failure in 019_replslot_limit.pl

On Fri, Sep 17, 2021 at 08:41:00PM -0700, Noah Misch wrote:

If this fixes things for the OP, I'd like it slightly better than the "ps"
approach. It's less robust, but I like the brevity.

Another alternative might be to have walreceiver reach walsender via a proxy
Perl script. Then, make that proxy able to accept an instruction to pause
passing data until further notice. However, I like two of your options better
than this one.

Could it be possible to rely on a combination of wait events set in WAL
senders and pg_stat_replication to assume that a WAL sender is in a
stopped state? I would think about something like that in the top of
my mind (perhaps this would need 2 WAL senders, one stopped and one
still running):
1) SIGSTOP WAL sender 1.
2) Check WAL sender 1 is in WalSenderMain. If not retry 1) after a
SIGCONT.
3) Generate some WAL, and look at pg_stat_replication to see if there
has been some progress in 1), but that 2) is done.
--
Michael

#9Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Michael Paquier (#8)
Re: Timeout failure in 019_replslot_limit.pl

On 2021-Sep-18, Michael Paquier wrote:

Could it be possible to rely on a combination of wait events set in WAL
senders and pg_stat_replication to assume that a WAL sender is in a
stopped state?

Hmm, sounds a possibly useful idea to explore, but I would only do so if
the other ideas prove fruitless, because it sounds like it'd have more
moving parts. Can you please first test if the idea of sending the signal
twice is enough? If that doesn't work, let's try Horiguchi-san's idea
of using some `ps` flags to find the process.

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".

#10Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#9)
Re: Timeout failure in 019_replslot_limit.pl

On Sat, Sep 18, 2021 at 05:19:04PM -0300, Alvaro Herrera wrote:

Hmm, sounds a possibly useful idea to explore, but I would only do so if
the other ideas prove fruitless, because it sounds like it'd have more
moving parts. Can you please first test if the idea of sending the signal
twice is enough?

This idea does not work. I got one failure after 5 tries.

If that doesn't work, let's try Horiguchi-san's idea
of using some `ps` flags to find the process.

Tried this one as well, to see the same failure. I was just looking
at the state of the test while it was querying pg_replication_slots
and that was the expected state after the WAL sender received SIGCONT:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
toto 12663 0.0 0.0 5014468 3384 ?? Ss 8:30PM 0:00.00 postgres: primary3: walsender toto [local] streaming 0/720000
toto 12662 0.0 0.0 4753092 3936 ?? Ts 8:30PM 0:00.01 postgres: standby_3: walreceiver streaming 0/7000D8

The test gets the right PIDs, as the logs showed:
ok 17 - have walsender pid 12663
ok 18 - have walreceiver pid 12662

So it does not seem that this is not an issue with the signals.
Perhaps we'd better wait for a checkpoint to complete by for example
scanning the logs before running the query on pg_replication_slots to
make sure that the slot is invalidated?
--
Michael

#11Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Michael Paquier (#10)
Re: Timeout failure in 019_replslot_limit.pl

On 2021-Sep-20, Michael Paquier wrote:

Can you please first test if the idea of sending the signal twice is
enough?

This idea does not work. I got one failure after 5 tries.

OK, thanks for taking the time to test it.

If that doesn't work, let's try Horiguchi-san's idea of using some
`ps` flags to find the process.

Tried this one as well, to see the same failure.

Hmm, do you mean that you used Horiguchi-san's patch in [1]/messages/by-id/20210907.120106.1483239898065111540.horikyota.ntt@gmail.com and the
failure still occurred?
[1]: /messages/by-id/20210907.120106.1483239898065111540.horikyota.ntt@gmail.com

I was just looking at the state of the test while it was querying
pg_replication_slots and that was the expected state after the WAL
sender received SIGCONT:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
toto 12663 0.0 0.0 5014468 3384 ?? Ss 8:30PM 0:00.00 postgres: primary3: walsender toto [local] streaming 0/720000
toto 12662 0.0 0.0 4753092 3936 ?? Ts 8:30PM 0:00.01 postgres: standby_3: walreceiver streaming 0/7000D8

The test gets the right PIDs, as the logs showed:
ok 17 - have walsender pid 12663
ok 18 - have walreceiver pid 12662

As I understood, Horiguchi-san's point isn't that the PIDs might be
wrong -- the point is to make sure that the process is in state T before
moving on to the next step in the test.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#12Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#11)
Re: Timeout failure in 019_replslot_limit.pl

On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote:

On 2021-Sep-20, Michael Paquier wrote:

If that doesn't work, let's try Horiguchi-san's idea of using some
`ps` flags to find the process.

Tried this one as well, to see the same failure.

Hmm, do you mean that you used Horiguchi-san's patch in [1] and the
failure still occurred?
[1] /messages/by-id/20210907.120106.1483239898065111540.horikyota.ntt@gmail.com

Yes, that's what I mean.
--
Michael

#13Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#11)
Re: Timeout failure in 019_replslot_limit.pl

On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote:

On 2021-Sep-20, Michael Paquier wrote:

The test gets the right PIDs, as the logs showed:
ok 17 - have walsender pid 12663
ok 18 - have walreceiver pid 12662

As I understood, Horiguchi-san's point isn't that the PIDs might be
wrong -- the point is to make sure that the process is in state T before
moving on to the next step in the test.

I have spent more time on this issue, as it looks that I am the only
one with an environment able to reproduce it (Big Sur 11.6).

As far as I can see, the states of the WAL sender and receiver are
fine, after adding some extra debugging with ps called from the test
itself, and I have checked that they are SIGSTOP'd or SIGCONT'd when a
failure shows up.

In a sequence that passes, we have the following sequence:
- Start checkpoint.
- SIGSTOP sent to WAL sender and receiver.
- Advance WAL (CREATE TABLE, DROP TABLE and pg_switch_wal)
- Check that WAL sender is stopped
- SIGCONT on WAL sender.
- Invalidate the slot, checkpoint completes.
- Check state of pg_replication_slots.
- Check that slot invalidation happened in the logs.

A failed sequence starts a checkpoint, but never completes it when
we reach the check on pg_replication_slots and the test remains
stuck so the slot is never invalidated. I have played a bit with the
test and switched a bit the location of the test "slot invalidation
logged" that watches the logs, and the test fails to find the slot
invalidation, as a result of the checkpoint not finishing.

To keep the instance around for debugging, I have just launched an
extra checkpoint after the SIGCONT sent to the WAL sender. It remains
stuck as an effect of the first one:
kill 'CONT', $senderpid;
+$node_primary3->safe_psql('postgres', 'checkpoint;');

With that, I am able to grab the checkpointer of primary3 to see where
it is waiting:
* frame #0: 0x00007fff204f8c4a libsystem_kernel.dylib`kevent + 10
frame #1: 0x0000000105a81a43 postgres`WaitEventSetWaitBlock(set=0x00007fb87f008748, cur_timeout=-1, occurred_events=0x00007ffeea765400, nevents=1) at latch.c:1601:7
frame #2: 0x0000000105a80fd0 postgres`WaitEventSetWait(set=0x00007fb87f008748, timeout=-1, occurred_events=0x00007ffeea765400, nevents=1, wait_event_info=134217769) at latch.c:1396:8
frame #3: 0x0000000105a80b46 postgres`WaitLatch(latch=0x00000001069ae7a4, wakeEvents=33, timeout=-1, wait_event_info=134217769) at latch.c:473:6
frame #4: 0x0000000105a97011 postgres`ConditionVariableTimedSleep(cv=0x00000001069d8860, timeout=-1, wait_event_info=134217769) at condition_variable.c:163:10
frame #5: 0x0000000105a96f32 postgres`ConditionVariableSleep(cv=0x00000001069d8860, wait_event_info=134217769) at condition_variable.c:100:9
frame #6: 0x0000000105a299cf postgres`InvalidatePossiblyObsoleteSlot(s=0x00000001069d8780, oldestLSN=8388608, invalidated=0x00007ffeea76559f) at slot.c:1264:4
frame #7: 0x0000000105a296bd postgres`InvalidateObsoleteReplicationSlots(oldestSegno=8) at slot.c:1333:7
frame #8: 0x00000001055edbe6 postgres`CreateCheckPoint(flags=192) at xlog.c:9275:6
frame #9: 0x00000001059b753d postgres`CheckpointerMain at checkpointer.c:448:5
frame #10: 0x00000001059b470d postgres`AuxiliaryProcessMain(auxtype=CheckpointerProcess) at auxprocess.c:153:4
frame #11: 0x00000001059c8912 postgres`StartChildProcess(type=CheckpointerProcess) at postmaster.c:5498:3
frame #12: 0x00000001059c68fe postgres`PostmasterMain(argc=4, argv=0x00007fb87e505400) at postmaster.c:1458:21
frame #13: 0x000000010589e1bf postgres`main(argc=4, argv=0x00007fb87e505400) at main.c:198:3
frame #14: 0x00007fff20544f3d libdyld.dylib`start + 1

So there is really something fishy here IMO, something else than just
a test mis-design and it looks like a race condition, perhaps around
InvalidateObsoleteReplicationSlots().

Thoughts?
--
Michael

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#13)
Re: Timeout failure in 019_replslot_limit.pl

On Wed, Sep 22, 2021 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote:

On 2021-Sep-20, Michael Paquier wrote:

The test gets the right PIDs, as the logs showed:
ok 17 - have walsender pid 12663
ok 18 - have walreceiver pid 12662

As I understood, Horiguchi-san's point isn't that the PIDs might be
wrong -- the point is to make sure that the process is in state T before
moving on to the next step in the test.

I have spent more time on this issue, as it looks that I am the only
one with an environment able to reproduce it (Big Sur 11.6).

As far as I can see, the states of the WAL sender and receiver are
fine, after adding some extra debugging with ps called from the test
itself, and I have checked that they are SIGSTOP'd or SIGCONT'd when a
failure shows up.

In a sequence that passes, we have the following sequence:
- Start checkpoint.
- SIGSTOP sent to WAL sender and receiver.
- Advance WAL (CREATE TABLE, DROP TABLE and pg_switch_wal)
- Check that WAL sender is stopped
- SIGCONT on WAL sender.

Am I understanding correctly that after sending SIGCONT to the WAL
sender, the checkpoint's SIGTERM signal for the WAL sender is received
and it releases the slot and terminates itself?

- Invalidate the slot, checkpoint completes.

After which checkpoint invalidates the slot and completes.

Now, in the failed run, it appears that due to some reason WAL sender
has not released the slot. Is it possible to see if the WAL sender is
still alive when a checkpoint is stuck at ConditionVariableSleep? And
if it is active, what is its call stack?

--
With Regards,
Amit Kapila.

#15Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#14)
Re: Timeout failure in 019_replslot_limit.pl

On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote:

Now, in the failed run, it appears that due to some reason WAL sender
has not released the slot. Is it possible to see if the WAL sender is
still alive when a checkpoint is stuck at ConditionVariableSleep? And
if it is active, what is its call stack?

I got again a failure today, so I have used this occasion to check that
when the checkpoint gets stuck the WAL sender process getting SIGCONT
is still around, waiting for a write to happen:
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10
frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1) at latch.c:1601:7
frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1, wait_event_info=100663297) at latch.c:1396:8
frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694) at be-secure.c:298:3
frame #4: 0x000000010fc66d81 postgres`internal_flush at pqcomm.c:1352:7
frame #5: 0x000000010fc665b9 postgres`internal_putbytes(s="E, len=1) at pqcomm.c:1298:8
frame #6: 0x000000010fc66be3 postgres`socket_putmessage(msgtype='E', s="SFATAL", len=112) at pqcomm.c:1479:6
frame #7: 0x000000010fc67318 postgres`pq_endmessage(buf=0x00007ffee0396118) at pqformat.c:301:9
frame #8: 0x00000001100a469f postgres`send_message_to_frontend(edata=0x000000011030d640) at elog.c:3431:3
frame #9: 0x00000001100a066d postgres`EmitErrorReport at elog.c:1546:3
frame #10: 0x000000011009ff99 postgres`errfinish(filename="postgres.c", lineno=3193, funcname="ProcessInterrupts") at elog.c:597:2
* frame #11: 0x000000010fe8e2f5 postgres`ProcessInterrupts at postgres.c:3191:4
frame #12: 0x000000010fe0bbe5 postgres`WalSndLoop(send_data=(postgres`XLogSendPhysical at walsender.c:2550)) at walsender.c:2285:3
frame #13: 0x000000010fe0754f postgres`StartReplication(cmd=0x00007f881d808790) at walsender.c:738:3
frame #14: 0x000000010fe06149 postgres`exec_replication_command(cmd_string="START_REPLICATION SLOT \"rep3\" 0/700000 TIMELINE 1") at walsender.c:1652:6
frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12

It logs its FATAL "terminating connection due to administrator
command" coming from ProcessInterrupts(), and then it sits idle on
ClientWrite.
--
Michael

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#15)
Re: Timeout failure in 019_replslot_limit.pl

On Mon, Sep 27, 2021 at 11:32 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote:

Now, in the failed run, it appears that due to some reason WAL sender
has not released the slot. Is it possible to see if the WAL sender is
still alive when a checkpoint is stuck at ConditionVariableSleep? And
if it is active, what is its call stack?

I got again a failure today, so I have used this occasion to check that
when the checkpoint gets stuck the WAL sender process getting SIGCONT
is still around, waiting for a write to happen:
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10
frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1) at latch.c:1601:7
frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1, wait_event_info=100663297) at latch.c:1396:8
frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694) at be-secure.c:298:3

..
..

frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12

It logs its FATAL "terminating connection due to administrator
command" coming from ProcessInterrupts(), and then it sits idle on
ClientWrite.

So, it seems on your machine it has passed the following condition in
secure_write:
if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))

If so, this indicates write failure which seems odd to me and probably
something machine-specific or maybe some different settings in your
build or machine. BTW, if SSL or GSS is enabled that might have caused
it in some way. I think the best way is to debug the secure_write
during this occurrence.

--
With Regards,
Amit Kapila.

#17Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#16)
Re: Timeout failure in 019_replslot_limit.pl

On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote:

So, it seems on your machine it has passed the following condition in
secure_write:
if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))

Yep.

If so, this indicates write failure which seems odd to me and probably
something machine-specific or maybe some different settings in your
build or machine. BTW, if SSL or GSS is enabled that might have caused
it in some way. I think the best way is to debug the secure_write
during this occurrence.

Yeah, but we don't use any of them in the context of this test, so
this is something on a simple send(), no? Hmm. That would not be the
first issue we see with macos these days with interrupted syscalls...
And actually in this stack I can see that errno gets set to EINTR.
--
Michael

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#17)
Re: Timeout failure in 019_replslot_limit.pl

On Mon, Sep 27, 2021 at 12:13 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote:

So, it seems on your machine it has passed the following condition in
secure_write:
if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))

Yep.

If so, this indicates write failure which seems odd to me and probably
something machine-specific or maybe some different settings in your
build or machine. BTW, if SSL or GSS is enabled that might have caused
it in some way. I think the best way is to debug the secure_write
during this occurrence.

Yeah, but we don't use any of them in the context of this test, so
this is something on a simple send(), no? Hmm. That would not be the
first issue we see with macos these days with interrupted syscalls...
And actually in this stack I can see that errno gets set to EINTR.

If errno is EINTR, then how would the code pass the above if check as
it has a condition ((errno == EWOULDBLOCK || errno == EAGAIN))?

--
With Regards,
Amit Kapila.

#19Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Michael Paquier (#15)
Re: Timeout failure in 019_replslot_limit.pl

On 2021-Sep-27, Michael Paquier wrote:

I got again a failure today, so I have used this occasion to check that
when the checkpoint gets stuck the WAL sender process getting SIGCONT
is still around, waiting for a write to happen:
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10
frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1) at latch.c:1601:7
frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1, wait_event_info=100663297) at latch.c:1396:8
frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694) at be-secure.c:298:3
frame #4: 0x000000010fc66d81 postgres`internal_flush at pqcomm.c:1352:7
frame #5: 0x000000010fc665b9 postgres`internal_putbytes(s="E, len=1) at pqcomm.c:1298:8
frame #6: 0x000000010fc66be3 postgres`socket_putmessage(msgtype='E', s="SFATAL", len=112) at pqcomm.c:1479:6
frame #7: 0x000000010fc67318 postgres`pq_endmessage(buf=0x00007ffee0396118) at pqformat.c:301:9
frame #8: 0x00000001100a469f postgres`send_message_to_frontend(edata=0x000000011030d640) at elog.c:3431:3
frame #9: 0x00000001100a066d postgres`EmitErrorReport at elog.c:1546:3
frame #10: 0x000000011009ff99 postgres`errfinish(filename="postgres.c", lineno=3193, funcname="ProcessInterrupts") at elog.c:597:2
* frame #11: 0x000000010fe8e2f5 postgres`ProcessInterrupts at postgres.c:3191:4
frame #12: 0x000000010fe0bbe5 postgres`WalSndLoop(send_data=(postgres`XLogSendPhysical at walsender.c:2550)) at walsender.c:2285:3
frame #13: 0x000000010fe0754f postgres`StartReplication(cmd=0x00007f881d808790) at walsender.c:738:3
frame #14: 0x000000010fe06149 postgres`exec_replication_command(cmd_string="START_REPLICATION SLOT \"rep3\" 0/700000 TIMELINE 1") at walsender.c:1652:6
frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12

Ah, so the problem here is that the walsender is not exiting. That also
causes the checkpointer to hang waiting for it. I wonder if this is
related to the problem reported in
/messages/by-id/adce2c09-3bfc-4666-997a-c21991cb1eb1.mengjuan.cmj@alibaba-inc.com
A patch was proposed on that thread on September 22nd, can to try with
that and see if this problem still reproduces?

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).

#20Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#19)
Re: Timeout failure in 019_replslot_limit.pl

On Sat, Oct 02, 2021 at 07:00:01PM -0300, Alvaro Herrera wrote:

A patch was proposed on that thread on September 22nd, can to try with
that and see if this problem still reproduces?

Yes, the failure still shows up, even with a timeout set at 30s which
is the default of the patch.
--
Michael