Weird failure in explain.out with OpenBSD

Started by Michael Paquierabout 5 years ago12 messages

michael@paquier.xyz

about 5 years ago

Hi all,

Buildfarm member gombessa just had an interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gombessa&dt=2020-12-29%2000%3A16%3A49
Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual
time=N.N..N.N rows=N loops=N)
Planning Time: N.N ms
- Execution Time: N.N ms
+ Execution Time: -N.N ms
(3 rows)

Not sure what to think about that, as this implies the calculation of
a negative execution time.
Regards,
--
Michael

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Michael Paquier (#1)

Re: Weird failure in explain.out with OpenBSD

Michael Paquier <michael@paquier.xyz> writes:

Buildfarm member gombessa just had an interesting failure:
- Execution Time: N.N ms
+ Execution Time: -N.N ms
Not sure what to think about that, as this implies the calculation of
a negative execution time.

Yeah, I saw that. I notice that gombessa uses

'HEAD' => [
'force_parallel_mode = regress'
]

so my guess is that something messed up in transmitting or combining a
parallel worker's execution time. No idea how to narrow it down more
than that.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Tom Lane (#2)

Re: Weird failure in explain.out with OpenBSD

I wrote:

... so my guess is that something messed up in transmitting or combining a
parallel worker's execution time.

Hmph, no, a look at explain.c shows that the "Execution Time" is just
based on the difference of INSTR_TIME_SET_CURRENT measurements taken
within the current process. It's difficult to conclude anything except
that the clock went backwards. Which is weird, because according to [1]https://man.openbsd.org/OpenBSD-6.8/clock_gettime
that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our
preferred choice of INSTR_TIME time base; and such clocks are not
supposed to go backwards ever.

Tis puzzling, but it's hard to avoid the suspicion of a kernel bug.

regards, tom lane

[1]: https://man.openbsd.org/OpenBSD-6.8/clock_gettime

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Tom Lane (#3)

Re: Weird failure in explain.out with OpenBSD

On Tue, Dec 29, 2020 at 04:16:06PM -0500, Tom Lane wrote:

Hmph, no, a look at explain.c shows that the "Execution Time" is just
based on the difference of INSTR_TIME_SET_CURRENT measurements taken
within the current process. It's difficult to conclude anything except
that the clock went backwards. Which is weird, because according to [1]
that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our
preferred choice of INSTR_TIME time base; and such clocks are not
supposed to go backwards ever.

I was looking at that, and I agree that this looks like a monotonic
clock going backwards. Or could it be possible that it gave 0.0 as
result, still a minus sign was appended? That would mean an execution
that took less than 1us per the system clock.
--
Michael

Thomas Munro

thomas.munro@gmail.com

about 4 years ago

In reply to: Michael Paquier (#4)

Re: Weird failure in explain.out with OpenBSD

On Wed, Dec 30, 2020 at 8:17 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Dec 29, 2020 at 04:16:06PM -0500, Tom Lane wrote:

Hmph, no, a look at explain.c shows that the "Execution Time" is just
based on the difference of INSTR_TIME_SET_CURRENT measurements taken
within the current process. It's difficult to conclude anything except
that the clock went backwards. Which is weird, because according to [1]
that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our
preferred choice of INSTR_TIME time base; and such clocks are not
supposed to go backwards ever.

I was looking at that, and I agree that this looks like a monotonic
clock going backwards. Or could it be possible that it gave 0.0 as
result, still a minus sign was appended? That would mean an execution
that took less than 1us per the system clock.

If I'm reading this right, it might be further evidence of
CLOCK_MONOTONIC going backwards on OpenBSD, this time by quite a lot
(the regular expression doesn't tolerate a leading minus sign, so the
test failed):

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32

# Log entry not matching: 1 22 -124769 0 1636578000 152817

Michael Paquier

michael@paquier.xyz

about 4 years ago

In reply to: Thomas Munro (#5)

Re: Weird failure in explain.out with OpenBSD

On Thu, Nov 11, 2021 at 10:23:06AM +1300, Thomas Munro wrote:

If I'm reading this right, it might be further evidence of
CLOCK_MONOTONIC going backwards on OpenBSD, this time by quite a lot
(the regular expression doesn't tolerate a leading minus sign, so the
test failed):

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32

Yeah, I have noticed this failure yesterday.

morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
a version close to that, right? I am wondering if 7.0 may help
regarding this issue. Postgres is not wrong here.
--
Michael

Thomas Munro

thomas.munro@gmail.com

about 4 years ago

In reply to: Michael Paquier (#6)

Re: Weird failure in explain.out with OpenBSD

On Thu, Nov 11, 2021 at 1:15 PM Michael Paquier <michael@paquier.xyz> wrote:

morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
a version close to that, right? I am wondering if 7.0 may help
regarding this issue. Postgres is not wrong here.

I dunno. Clocks on virtualised systems and even metal seem to be a
minefield of quirks and heuristics. Some discussion, may or may not
be relevant:

https://marc.info/?l=openbsd-tech&m=161657532610882&w=2

Some starter questions for Mikael would be: could you please check
which clock source it's using?, is it under a hypervisor, and if so
which?, what is the CPU model?, what are other kernels choosing when
running as guests on the same hypervisor (if applicable)? Did
anything interesting happen on either the guest or host operating
system (if not bare metal) some time around 2021-11-10 21:59 CET?

I'm no expert on this stuff but something tells me that a 124ms leap
needs a different explanation than the sub-µs difference reported
earlier...

Andres Freund

andres@anarazel.de

about 4 years ago

In reply to: Thomas Munro (#7)

Re: Weird failure in explain.out with OpenBSD

Hi,

On 2021-11-11 14:15:33 +1300, Thomas Munro wrote:

Some starter questions for Mikael would be: could you please check
which clock source it's using?, is it under a hypervisor, and if so
which?, what is the CPU model?, what are other kernels choosing when
running as guests on the same hypervisor (if applicable)? Did
anything interesting happen on either the guest or host operating
system (if not bare metal) some time around 2021-11-10 21:59 CET?

I'm no expert on this stuff but something tells me that a 124ms leap
needs a different explanation than the sub-ï¿½s difference reported
earlier...

One quite conceivable way could be a VM migration. Which quite a few hosters
can do behind ones back, to reduce downtimes in case of hardware maintenance
etc. If openbsd is using the TSC, and something in the stack isn't quite
dealing correctly with the necessary tsc offset / speed correction values
you'd expect to see something like this.

Greetings,

Andres Freund

Mikael Kjellström

mikael.kjellstrom@mksoft.nu

about 4 years ago

In reply to: Michael Paquier (#6)

Re: Weird failure in explain.out with OpenBSD

On 2021-11-11 01:15, Michael Paquier wrote:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32

Yeah, I have noticed this failure yesterday.

morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
a version close to that, right? I am wondering if 7.0 may help
regarding this issue. Postgres is not wrong here.

Just FYI. morepork was replaced with 6.9. It was earlier running 6.5.
It wasn't upgraded though it was completely installed from scratched.

6.9 was the latest version available when I installed it a month ago.
7.0 was released Oct 14 2021.

/Mikael

#10

Mikael Kjellström

mikael.kjellstrom@mksoft.nu

about 4 years ago

In reply to: Thomas Munro (#7)

Re: Weird failure in explain.out with OpenBSD

On 2021-11-11 02:15, Thomas Munro wrote:

On Thu, Nov 11, 2021 at 1:15 PM Michael Paquier <michael@paquier.xyz> wrote:

morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
a version close to that, right? I am wondering if 7.0 may help
regarding this issue. Postgres is not wrong here.

I dunno. Clocks on virtualised systems and even metal seem to be a
minefield of quirks and heuristics. Some discussion, may or may not
be relevant:

https://marc.info/?l=openbsd-tech&m=161657532610882&w=2

Hm, interesting.

Some starter questions for Mikael would be: could you please check
which clock source it's using?,

It's disabling TSC:

cpu1: disabling user TSC (skew=-4465)

so it seems to be using the acpitimer?

acpitimer0 at acpi0: 3579545 Hz, 24 bits

is it under a hypervisor, and if so
which?,

VMWare ESXI 6.7 latest updates installed.

what is the CPU model?, what are other kernels choosing when
running as guests on the same hypervisor (if applicable)?

AMD Ryzen 5 3600X

Did
anything interesting happen on either the guest or host operating
system (if not bare metal) some time around 2021-11-10 21:59 CET?

Nope, not that I can see.

I'm no expert on this stuff but something tells me that a 124ms leap
needs a different explanation than the sub-µs difference reported
earlier...

It's running ntpd to keep time in sync if that matters?

/Mikael

#11

Mikael Kjellström

mikael.kjellstrom@mksoft.nu

about 4 years ago

In reply to: Andres Freund (#8)

Re: Weird failure in explain.out with OpenBSD

On 2021-11-11 04:35, Andres Freund wrote:

One quite conceivable way could be a VM migration. Which quite a few hosters
can do behind ones back, to reduce downtimes in case of hardware maintenance
etc. If openbsd is using the TSC, and something in the stack isn't quite
dealing correctly with the necessary tsc offset / speed correction values
you'd expect to see something like this.

No VM migration as this is a stand alone ESXI host.

It's not using TSC as clock source. It's using acpitimer0 at acpi0:
3579545 Hz, 24 bits.

/Mikael

#12

Mikael Kjellström

mikael.kjellstrom@mksoft.nu

about 4 years ago

In reply to: Thomas Munro (#7)

Re: Weird failure in explain.out with OpenBSD

On 2021-11-11 02:15, Thomas Munro wrote:

I dunno. Clocks on virtualised systems and even metal seem to be a
minefield of quirks and heuristics. Some discussion, may or may not
be relevant:

https://marc.info/?l=openbsd-tech&m=161657532610882&w=2

Just for fun I compiled and ran the test program and this was the result:

$ ./monotime
589047 Starting
284971 Starting
542819 Starting
315557 Starting
589047 Stopped
542819 Back 2728606.093473837 => 2728605.963205128
542819 Stopped
284971 Stopped
315557 Stopped

above was on 6.9

then I tried the same on my older 5.9 animal and this was the result:

./monotime
1003853 Starting
1004437 Starting
1032556 Starting
1001887 Starting
1004437 Stopped
1003853 Stopped
1032556 Stopped
1001887 Stopped

it's also running under the same VMWARE 6.7 instance / machine.

dmesg on that machine also indicates that it's using:

acpitimer0 at acpi0: 3579545 Hz, 24 bits

so looks like a kernel bug / regression in recent OpenBSD kernels then?

/Mikael