Crash in new pgstats code
mylodon just showed a new-to-me failure mode [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2022-04-15%2011%3A51%3A35:
Core was generated by `postgres: cascade: startup recovering 000000010000000000000002 '.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
49 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1 0x00007f8b8db2d546 in __GI_abort () at abort.c:79
#2 0x000000000098a4dc in ExceptionalCondition (conditionName=<optimized out>, errorType=0x9e8061 "FailedAssertion", fileName=0xaf811f "/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/lib/dshash.c", lineNumber=lineNumber@entry=744) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/error/assert.c:69
#3 0x00000000006dbe65 in dshash_delete_current (status=status@entry=0x7fffec732dc8) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/lib/dshash.c:744
#4 0x000000000085f911 in pgstat_free_entry (shent=0x7f8b8b0fc320, hstat=0x7fffec732dc8) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat_shmem.c:741
#5 pgstat_drop_entry_internal (shent=0x7f8b8b0fc320, hstat=hstat@entry=0x7fffec732dc8) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat_shmem.c:773
#6 0x000000000085fa2e in pgstat_drop_all_entries () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat_shmem.c:887
#7 0x0000000000859301 in pgstat_reset_after_failure () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat.c:1631
#8 pgstat_discard_stats () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat.c:435
#9 0x0000000000555ae0 in StartupXLOG () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/access/transam/xlog.c:5127
#10 0x00000000007a8ece in StartupProcessMain () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/startup.c:267
#11 0x000000000079f44e in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/auxprocess.c:141
#12 0x00000000007a5891 in StartChildProcess (type=type@entry=StartupProcess) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5417
#13 0x00000000007a3ea0 in PostmasterMain (argc=argc@entry=4, argv=<optimized out>, argv@entry=0x1d1cc90) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1457
#14 0x00000000006f1b91 in main (argc=4, argv=0x1d1cc90) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:202
$1 = {si_signo = 6, si_errno = 0, si_code = -6, _sifields = {_pad = {675836, 1001, 0 <repeats 26 times>}, _kill = {si_pid = 675836, si_uid = 1001}, _timer = {si_tid = 675836, si_overrun = 1001, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 675836, si_uid = 1001, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 675836, si_uid = 1001, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x3e9000a4ffc, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 4299262939132, si_fd = 0}}}
Probably deserves investigation.
regards, tom lane
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2022-04-15%2011%3A51%3A35
I wrote:
mylodon just showed a new-to-me failure mode [1]:
Another occurrence here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-15%2022%3A42%3A07
I've added an open item.
regards, tom lane
Hi
On 2022-04-15 13:28:35 -0400, Tom Lane wrote:
mylodon just showed a new-to-me failure mode [1]:
Thanks. Found the bug (pgstat_drop_all_entries() passed the wrong lock
level), with the obvious fix.
This failed to fail in other tests because they all end up resetting
only when there's no stats. It's not too hard to write a test for that,
which is how I reproduced the issue.
I'm planning to make it a bit easier to test by verifying that 'E' in
pgstat_read_statsfile() actually is just before EOF. That seems like a
good check anyway.
What confuses me so far is what already had generated stats before
reaching pgstat_reset_after_failure() (so that the bug could even be hit
in t/025_stuck_on_old_timeline.pl).
Greetings,
Andres Freund
Hi
On 2022-04-16 12:13:09 -0700, Andres Freund wrote:
What confuses me so far is what already had generated stats before
reaching pgstat_reset_after_failure() (so that the bug could even be hit
in t/025_stuck_on_old_timeline.pl).
I see part of a problem - in archiver stats. Even in 14 (and presumably
before), we do work that can generate archiver stats
(e.g. ReadCheckpointRecord()) before pgstat_reset_all(). It's not the
end of the world, but doesn't seem great.
But since archiver stats are fixed-numbered stats (and thus not in the
hash table), they'd not trigger the backtrace we saw here.
One thing that's interesting is that the failing tests have:
2022-04-15 12:07:48.828 UTC [675922][walreceiver][:0] FATAL: could not link file "pg_wal/xlogtemp.675922" to "pg_wal/00000002.history": File exists
which I haven't seen locally. Looks like we have some race between
startup process and walreceiver? That seems not great. I'm a bit
confused that walreceiver and archiving are both active at the same time
in the first place - that doesn't seem right as things are set up
currently.
Greetings,
Andres Freund
Hi,
On 2022-04-16 12:13:09 -0700, Andres Freund wrote:
On 2022-04-15 13:28:35 -0400, Tom Lane wrote:
mylodon just showed a new-to-me failure mode [1]:
Thanks. Found the bug (pgstat_drop_all_entries() passed the wrong lock
level), with the obvious fix.This failed to fail in other tests because they all end up resetting
only when there's no stats. It's not too hard to write a test for that,
which is how I reproduced the issue.I'm planning to make it a bit easier to test by verifying that 'E' in
pgstat_read_statsfile() actually is just before EOF. That seems like a
good check anyway.
I've pushed that fix.
What confuses me so far is what already had generated stats before
reaching pgstat_reset_after_failure() (so that the bug could even be hit
in t/025_stuck_on_old_timeline.pl).
But there's still things I don't understand about that aspect.
Greetings,
Andres Freund
On Sat, Apr 16, 2022 at 02:36:33PM -0700, Andres Freund wrote:
which I haven't seen locally. Looks like we have some race between
startup process and walreceiver? That seems not great. I'm a bit
confused that walreceiver and archiving are both active at the same time
in the first place - that doesn't seem right as things are set up
currently.
Yeah, that should be exclusively one or the other, never both.
WaitForWALToBecomeAvailable() would be a hot spot when it comes to
decide when a WAL receiver should be spawned by the startup process.
Except from the recent refactoring of xlog.c or the WAL prefetch work,
there has not been many changes in this area lately.
--
Michael
On Mon, Apr 18, 2022 at 7:19 PM Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Apr 16, 2022 at 02:36:33PM -0700, Andres Freund wrote:
which I haven't seen locally. Looks like we have some race between
startup process and walreceiver? That seems not great. I'm a bit
confused that walreceiver and archiving are both active at the same time
in the first place - that doesn't seem right as things are set up
currently.Yeah, that should be exclusively one or the other, never both.
WaitForWALToBecomeAvailable() would be a hot spot when it comes to
decide when a WAL receiver should be spawned by the startup process.
Except from the recent refactoring of xlog.c or the WAL prefetch work,
there has not been many changes in this area lately.
Hmm, well I'm not sure what is happening here and will try to dig
tomorrow, but one observation from some log scraping is that kestrel
logged similar output with "could not link file" several times before
the main prefetching commit (5dc0418). I looked back 3 months on
kestrel/HEAD and found these:
commit | log
---------+-------------------------------------------------------------------------------------------------------------------
411b913 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-27%2010:57:20&stg=recovery-check
3d067c5 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-29%2017:52:32&stg=recovery-check
cd7ea75 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-30%2015:25:03&stg=recovery-check
8e053dc | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-30%2020:27:44&stg=recovery-check
4e34747 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-04%2020:32:24&stg=recovery-check
01effb1 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-06%2007:32:40&stg=recovery-check
fbfe691 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-07%2005:10:05&stg=recovery-check
5dc0418 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-07%2007:51:00&stg=recovery-check
bd037dc | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-11%2022:00:58&stg=recovery-check
a4b5754 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-12%2004:40:44&stg=recovery-check
7129a97 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-15%2022:42:07&stg=recovery-check
9f4f0a0 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-16%2020:05:34&stg=recovery-check
Hi,
On 2022-04-18 22:45:07 +1200, Thomas Munro wrote:
On Mon, Apr 18, 2022 at 7:19 PM Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Apr 16, 2022 at 02:36:33PM -0700, Andres Freund wrote:
which I haven't seen locally. Looks like we have some race between
startup process and walreceiver? That seems not great. I'm a bit
confused that walreceiver and archiving are both active at the same time
in the first place - that doesn't seem right as things are set up
currently.Yeah, that should be exclusively one or the other, never both.
WaitForWALToBecomeAvailable() would be a hot spot when it comes to
decide when a WAL receiver should be spawned by the startup process.
Except from the recent refactoring of xlog.c or the WAL prefetch work,
there has not been many changes in this area lately.Hmm, well I'm not sure what is happening here and will try to dig
tomorrow, but one observation from some log scraping is that kestrel
logged similar output with "could not link file" several times before
the main prefetching commit (5dc0418). I looked back 3 months on
kestrel/HEAD and found these:
Kestrel won't go that far back even - I set it up 23 days ago...
I'm formally on vacation till Thursday, I'll try to look at earlier
instances then. Unless it's already figured out :). I failed at
reproducing it locally, despite a fair bit of effort.
The BF really should break out individual tests into their own stage
logs. The recovery-check stage is 13MB and 150k lines by now.
Greetings,
Andres Freund
On Tue, Apr 19, 2022 at 2:50 AM Andres Freund <andres@anarazel.de> wrote:
Kestrel won't go that far back even - I set it up 23 days ago...
Here's a ~6 month old example from mylodon (I can't see much further
back than that with HTTP requests... I guess BF records are purged?):
On Tue, Apr 19, 2022 at 08:31:05PM +1200, Thomas Munro wrote:
On Tue, Apr 19, 2022 at 2:50 AM Andres Freund <andres@anarazel.de> wrote:
Kestrel won't go that far back even - I set it up 23 days ago...
Here's a ~6 month old example from mylodon (I can't see much further
back than that with HTTP requests... I guess BF records are purged?):
Do we have anything remaining on this thread in light of the upcoming
beta1? One fix has been pushed upthread, but it does not seem we are
completely in the clear either.
--
Michael
Hi,
On 2022-05-11 15:46:13 +0900, Michael Paquier wrote:
On Tue, Apr 19, 2022 at 08:31:05PM +1200, Thomas Munro wrote:
On Tue, Apr 19, 2022 at 2:50 AM Andres Freund <andres@anarazel.de> wrote:
Kestrel won't go that far back even - I set it up 23 days ago...
Here's a ~6 month old example from mylodon (I can't see much further
back than that with HTTP requests... I guess BF records are purged?):Do we have anything remaining on this thread in light of the upcoming
beta1? One fix has been pushed upthread, but it does not seem we are
completely in the clear either.
I don't know what else there is to do, tbh.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2022-05-11 15:46:13 +0900, Michael Paquier wrote:
Do we have anything remaining on this thread in light of the upcoming
beta1? One fix has been pushed upthread, but it does not seem we are
completely in the clear either.
I don't know what else there is to do, tbh.
Well, it was mostly you expressing misgivings upthread ;-). But we
have not seen any pgstat crashes lately, so I'm content to mark the
open item as resolved.
regards, tom lane
Hi,
On 2022-05-12 12:12:59 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2022-05-11 15:46:13 +0900, Michael Paquier wrote:
Do we have anything remaining on this thread in light of the upcoming
beta1? One fix has been pushed upthread, but it does not seem we are
completely in the clear either.I don't know what else there is to do, tbh.
Well, it was mostly you expressing misgivings upthread ;-).
Those mostly were about stuff in 14 as well... I guess it'd be good to figure
out precisely how the problem was triggered, but without further information
I don't quite see how to figure it out...
But we have not seen any pgstat crashes lately, so I'm content to mark the
open item as resolved.
Cool.
Greetings,
Andres Freund
On Thu, May 12, 2022 at 09:33:05AM -0700, Andres Freund wrote:
On 2022-05-12 12:12:59 -0400, Tom Lane wrote:
But we have not seen any pgstat crashes lately, so I'm content to mark the
open item as resolved.Cool.
Okay, thanks for the feedback. I have marked the item as resolved for
the time being. Let's revisit it later if necessary.
--
Michael