Weird failure in explain.out with OpenBSD
Hi all,
Buildfarm member gombessa just had an interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gombessa&dt=2020-12-29%2000%3A16%3A49
Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual
time=N.N..N.N rows=N loops=N)
Planning Time: N.N ms
- Execution Time: N.N ms
+ Execution Time: -N.N ms
(3 rows)
Not sure what to think about that, as this implies the calculation of
a negative execution time.
Regards,
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
Buildfarm member gombessa just had an interesting failure:
- Execution Time: N.N ms
+ Execution Time: -N.N ms
Not sure what to think about that, as this implies the calculation of
a negative execution time.
Yeah, I saw that. I notice that gombessa uses
'HEAD' => [
'force_parallel_mode = regress'
]
so my guess is that something messed up in transmitting or combining a
parallel worker's execution time. No idea how to narrow it down more
than that.
regards, tom lane
I wrote:
... so my guess is that something messed up in transmitting or combining a
parallel worker's execution time.
Hmph, no, a look at explain.c shows that the "Execution Time" is just
based on the difference of INSTR_TIME_SET_CURRENT measurements taken
within the current process. It's difficult to conclude anything except
that the clock went backwards. Which is weird, because according to [1]https://man.openbsd.org/OpenBSD-6.8/clock_gettime
that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our
preferred choice of INSTR_TIME time base; and such clocks are not
supposed to go backwards ever.
Tis puzzling, but it's hard to avoid the suspicion of a kernel bug.
regards, tom lane
On Tue, Dec 29, 2020 at 04:16:06PM -0500, Tom Lane wrote:
Hmph, no, a look at explain.c shows that the "Execution Time" is just
based on the difference of INSTR_TIME_SET_CURRENT measurements taken
within the current process. It's difficult to conclude anything except
that the clock went backwards. Which is weird, because according to [1]
that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our
preferred choice of INSTR_TIME time base; and such clocks are not
supposed to go backwards ever.
I was looking at that, and I agree that this looks like a monotonic
clock going backwards. Or could it be possible that it gave 0.0 as
result, still a minus sign was appended? That would mean an execution
that took less than 1us per the system clock.
--
Michael
On Wed, Dec 30, 2020 at 8:17 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Dec 29, 2020 at 04:16:06PM -0500, Tom Lane wrote:
Hmph, no, a look at explain.c shows that the "Execution Time" is just
based on the difference of INSTR_TIME_SET_CURRENT measurements taken
within the current process. It's difficult to conclude anything except
that the clock went backwards. Which is weird, because according to [1]
that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our
preferred choice of INSTR_TIME time base; and such clocks are not
supposed to go backwards ever.I was looking at that, and I agree that this looks like a monotonic
clock going backwards. Or could it be possible that it gave 0.0 as
result, still a minus sign was appended? That would mean an execution
that took less than 1us per the system clock.
If I'm reading this right, it might be further evidence of
CLOCK_MONOTONIC going backwards on OpenBSD, this time by quite a lot
(the regular expression doesn't tolerate a leading minus sign, so the
test failed):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32
# Log entry not matching: 1 22 -124769 0 1636578000 152817
On Thu, Nov 11, 2021 at 10:23:06AM +1300, Thomas Munro wrote:
If I'm reading this right, it might be further evidence of
CLOCK_MONOTONIC going backwards on OpenBSD, this time by quite a lot
(the regular expression doesn't tolerate a leading minus sign, so the
test failed):https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32
Yeah, I have noticed this failure yesterday.
morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
a version close to that, right? I am wondering if 7.0 may help
regarding this issue. Postgres is not wrong here.
--
Michael
On Thu, Nov 11, 2021 at 1:15 PM Michael Paquier <michael@paquier.xyz> wrote:
morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
a version close to that, right? I am wondering if 7.0 may help
regarding this issue. Postgres is not wrong here.
I dunno. Clocks on virtualised systems and even metal seem to be a
minefield of quirks and heuristics. Some discussion, may or may not
be relevant:
https://marc.info/?l=openbsd-tech&m=161657532610882&w=2
Some starter questions for Mikael would be: could you please check
which clock source it's using?, is it under a hypervisor, and if so
which?, what is the CPU model?, what are other kernels choosing when
running as guests on the same hypervisor (if applicable)? Did
anything interesting happen on either the guest or host operating
system (if not bare metal) some time around 2021-11-10 21:59 CET?
I'm no expert on this stuff but something tells me that a 124ms leap
needs a different explanation than the sub-µs difference reported
earlier...
Hi,
On 2021-11-11 14:15:33 +1300, Thomas Munro wrote:
Some starter questions for Mikael would be: could you please check
which clock source it's using?, is it under a hypervisor, and if so
which?, what is the CPU model?, what are other kernels choosing when
running as guests on the same hypervisor (if applicable)? Did
anything interesting happen on either the guest or host operating
system (if not bare metal) some time around 2021-11-10 21:59 CET?
I'm no expert on this stuff but something tells me that a 124ms leap
needs a different explanation than the sub-�s difference reported
earlier...
One quite conceivable way could be a VM migration. Which quite a few hosters
can do behind ones back, to reduce downtimes in case of hardware maintenance
etc. If openbsd is using the TSC, and something in the stack isn't quite
dealing correctly with the necessary tsc offset / speed correction values
you'd expect to see something like this.
Greetings,
Andres Freund
On 2021-11-11 01:15, Michael Paquier wrote:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32
Yeah, I have noticed this failure yesterday.
morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
a version close to that, right? I am wondering if 7.0 may help
regarding this issue. Postgres is not wrong here.
Just FYI. morepork was replaced with 6.9. It was earlier running 6.5.
It wasn't upgraded though it was completely installed from scratched.
6.9 was the latest version available when I installed it a month ago.
7.0 was released Oct 14 2021.
/Mikael
On 2021-11-11 02:15, Thomas Munro wrote:
On Thu, Nov 11, 2021 at 1:15 PM Michael Paquier <michael@paquier.xyz> wrote:
morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
a version close to that, right? I am wondering if 7.0 may help
regarding this issue. Postgres is not wrong here.I dunno. Clocks on virtualised systems and even metal seem to be a
minefield of quirks and heuristics. Some discussion, may or may not
be relevant:https://marc.info/?l=openbsd-tech&m=161657532610882&w=2
Hm, interesting.
Some starter questions for Mikael would be: could you please check
which clock source it's using?,
It's disabling TSC:
cpu1: disabling user TSC (skew=-4465)
so it seems to be using the acpitimer?
acpitimer0 at acpi0: 3579545 Hz, 24 bits
is it under a hypervisor, and if so
which?,
VMWare ESXI 6.7 latest updates installed.
what is the CPU model?, what are other kernels choosing when
running as guests on the same hypervisor (if applicable)?
AMD Ryzen 5 3600X
Did
anything interesting happen on either the guest or host operating
system (if not bare metal) some time around 2021-11-10 21:59 CET?
Nope, not that I can see.
I'm no expert on this stuff but something tells me that a 124ms leap
needs a different explanation than the sub-µs difference reported
earlier...
It's running ntpd to keep time in sync if that matters?
/Mikael
On 2021-11-11 04:35, Andres Freund wrote:
One quite conceivable way could be a VM migration. Which quite a few hosters
can do behind ones back, to reduce downtimes in case of hardware maintenance
etc. If openbsd is using the TSC, and something in the stack isn't quite
dealing correctly with the necessary tsc offset / speed correction values
you'd expect to see something like this.
No VM migration as this is a stand alone ESXI host.
It's not using TSC as clock source. It's using acpitimer0 at acpi0:
3579545 Hz, 24 bits.
/Mikael
On 2021-11-11 02:15, Thomas Munro wrote:
I dunno. Clocks on virtualised systems and even metal seem to be a
minefield of quirks and heuristics. Some discussion, may or may not
be relevant:https://marc.info/?l=openbsd-tech&m=161657532610882&w=2
Just for fun I compiled and ran the test program and this was the result:
$ ./monotime
589047 Starting
284971 Starting
542819 Starting
315557 Starting
589047 Stopped
542819 Back 2728606.093473837 => 2728605.963205128
542819 Stopped
284971 Stopped
315557 Stopped
above was on 6.9
then I tried the same on my older 5.9 animal and this was the result:
./monotime
1003853 Starting
1004437 Starting
1032556 Starting
1001887 Starting
1004437 Stopped
1003853 Stopped
1032556 Stopped
1001887 Stopped
it's also running under the same VMWARE 6.7 instance / machine.
dmesg on that machine also indicates that it's using:
acpitimer0 at acpi0: 3579545 Hz, 24 bits
so looks like a kernel bug / regression in recent OpenBSD kernels then?
/Mikael