"ERROR: latch already owned" on gharial

Started by Thomas Munroalmost 4 years ago24 messageshackers
Jump to latest
#1Thomas Munro
thomas.munro@gmail.com

Hi,

A couple of recent isolation test failures reported $SUBJECT.

It could be a bug in recent-ish latch refactoring work, though I don't
know why it would show up twice just recently.

Just BTW, that animal has shown signs of a flaky toolchain before[1]/messages/by-id/CA+hUKGJK5R0S1LL_W4vEzKxNQGY_xGAQ1XknR-WN9jqQeQtB_w@mail.gmail.com.
I know we have quite a lot of museum exhibits in the 'farm, in terms
of hardare, OS, and tool chain. In some cases, they're probably just
forgotten/not on anyone's upgrade radar. If they've shown signs of
misbehaving, maybe it's time to figure out if they can be upgraded?
For example, it'd be nice to be able to rule out problems in GCC 4.6.0
(that's like running PostgreSQL 9.1.0, in terms of vintage,
unsupported status, and long list of missing bugfixes from the time
when it was supported).

[1]: /messages/by-id/CA+hUKGJK5R0S1LL_W4vEzKxNQGY_xGAQ1XknR-WN9jqQeQtB_w@mail.gmail.com

#2Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#1)
Re: "ERROR: latch already owned" on gharial

Hi,

On 2022-05-25 12:45:21 +1200, Thomas Munro wrote:

A couple of recent isolation test failures reported $SUBJECT.

Was that just on gharial?

It could be a bug in recent-ish latch refactoring work, though I don't
know why it would show up twice just recently.

Yea, that's weird.

Just BTW, that animal has shown signs of a flaky toolchain before[1].
I know we have quite a lot of museum exhibits in the 'farm, in terms
of hardare, OS, and tool chain. In some cases, they're probably just
forgotten/not on anyone's upgrade radar. If they've shown signs of
misbehaving, maybe it's time to figure out if they can be upgraded?
For example, it'd be nice to be able to rule out problems in GCC 4.6.0
(that's like running PostgreSQL 9.1.0, in terms of vintage,
unsupported status, and long list of missing bugfixes from the time
when it was supported).

Yea. gcc 4.6.0 is pretty ridiculous - the only thing we gain by testing with a
.0 compiler of that vintage is pain. Could it be upgraded?

TBH, I think we should just desupport HPUX. It's makework to support it at
this point. 11.31 v3 is about to be old enough to drink in quite a few
countries...

Greetings,

Andres Freund

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#2)
Re: "ERROR: latch already owned" on gharial

Andres Freund <andres@anarazel.de> writes:

On 2022-05-25 12:45:21 +1200, Thomas Munro wrote:

I know we have quite a lot of museum exhibits in the 'farm, in terms
of hardare, OS, and tool chain. In some cases, they're probably just
forgotten/not on anyone's upgrade radar. If they've shown signs of
misbehaving, maybe it's time to figure out if they can be upgraded?

TBH, I think we should just desupport HPUX.

I think there's going to be a significant die-off of old BF animals
when (if?) we convert over to the meson build system; it's just not
going to be worth the trouble to upgrade those platforms to be able
to run meson and ninja. I'm inclined to wait until that's over and
see what's still standing before we make decisions about officially
desupporting things.

regards, tom lane

#4Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#2)
Re: "ERROR: latch already owned" on gharial

On Tue, May 24, 2022 at 06:24:39PM -0700, Andres Freund wrote:

On 2022-05-25 12:45:21 +1200, Thomas Munro wrote:

Just BTW, that animal has shown signs of a flaky toolchain before[1].
I know we have quite a lot of museum exhibits in the 'farm, in terms
of hardare, OS, and tool chain. In some cases, they're probably just
forgotten/not on anyone's upgrade radar. If they've shown signs of
misbehaving, maybe it's time to figure out if they can be upgraded?
For example, it'd be nice to be able to rule out problems in GCC 4.6.0
(that's like running PostgreSQL 9.1.0, in terms of vintage,
unsupported status, and long list of missing bugfixes from the time
when it was supported).

Yea. gcc 4.6.0 is pretty ridiculous - the only thing we gain by testing with a
.0 compiler of that vintage is pain. Could it be upgraded?

+1, this is at least the third non-obvious miscompilation from gharial.
Installing the latest GCC that builds easily (perhaps GCC 10.3) would make
this a good buildfarm member again. If that won't happen, at least add a note
to the animal like described in
/messages/by-id/20211109144021.GD940092@rfd.leadboat.com

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#4)
Re: "ERROR: latch already owned" on gharial

Noah Misch <noah@leadboat.com> writes:

+1, this is at least the third non-obvious miscompilation from gharial.

Is there any evidence that this is a compiler-sourced problem?
Maybe it is, but it's sure not obvious to me (he says, eyeing his
buildfarm animals with even older gcc versions).

regards, tom lane

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#5)
Re: "ERROR: latch already owned" on gharial

On Thu, May 26, 2022 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Noah Misch <noah@leadboat.com> writes:

+1, this is at least the third non-obvious miscompilation from gharial.

Is there any evidence that this is a compiler-sourced problem?
Maybe it is, but it's sure not obvious to me (he says, eyeing his
buildfarm animals with even older gcc versions).

Sorry for the ambiguity -- I have no evidence of miscompilation. My
"just BTW" paragraph was a reaction to the memory of the last couple
of times Noah and I wasted hours chasing red herrings on this system,
which is pretty demotivating when looking into an unexplained failure.

On a more practical note, I don't have access to the BF database right
now. Would you mind checking if "latch already owned" has occurred on
any other animals?

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#6)
Re: "ERROR: latch already owned" on gharial

Thomas Munro <thomas.munro@gmail.com> writes:

Sorry for the ambiguity -- I have no evidence of miscompilation. My
"just BTW" paragraph was a reaction to the memory of the last couple
of times Noah and I wasted hours chasing red herrings on this system,
which is pretty demotivating when looking into an unexplained failure.

I can't deny that those HPUX animals have produced more than their
fair share of problems.

On a more practical note, I don't have access to the BF database right
now. Would you mind checking if "latch already owned" has occurred on
any other animals?

Looking back 6 months, these are the only occurrences of that string
in failed tests:

sysname | branch | snapshot | stage | l
---------+--------+---------------------+----------------+-------------------------------------------------------------------
gharial | HEAD | 2022-04-28 23:37:51 | Check | 2022-04-28 18:36:26.981 MDT [22642:1] ERROR: latch already owned
gharial | HEAD | 2022-05-06 11:33:11 | IsolationCheck | 2022-05-06 10:10:52.727 MDT [7366:1] ERROR: latch already owned
gharial | HEAD | 2022-05-24 06:31:31 | IsolationCheck | 2022-05-24 02:44:51.850 MDT [13089:1] ERROR: latch already owned
(3 rows)

regards, tom lane

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#7)
Re: "ERROR: latch already owned" on gharial

On Thu, May 26, 2022 at 2:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On a more practical note, I don't have access to the BF database right
now. Would you mind checking if "latch already owned" has occurred on
any other animals?

Looking back 6 months, these are the only occurrences of that string
in failed tests:

sysname | branch | snapshot | stage | l
---------+--------+---------------------+----------------+-------------------------------------------------------------------
gharial | HEAD | 2022-04-28 23:37:51 | Check | 2022-04-28 18:36:26.981 MDT [22642:1] ERROR: latch already owned
gharial | HEAD | 2022-05-06 11:33:11 | IsolationCheck | 2022-05-06 10:10:52.727 MDT [7366:1] ERROR: latch already owned
gharial | HEAD | 2022-05-24 06:31:31 | IsolationCheck | 2022-05-24 02:44:51.850 MDT [13089:1] ERROR: latch already owned
(3 rows)

Thanks. Hmm. So far it's always a parallel worker. The best idea I
have is to include the ID of the mystery PID in the error message and
see if that provides a clue next time.

Attachments:

log-pid.patchtext/x-patch; charset=US-ASCII; name=log-pid.patchDownload+7-2
#9Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#8)
Re: "ERROR: latch already owned" on gharial

On Fri, May 27, 2022 at 7:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks. Hmm. So far it's always a parallel worker. The best idea I
have is to include the ID of the mystery PID in the error message and
see if that provides a clue next time.

What I'm inclined to do is get gharial and anole removed from the
buildfarm. anole was set up by Heikki in 2011. I don't know when
gharial was set up, or by whom. I don't think anyone at EDB cares
about these machines any more, or has any interest in maintaining
them. I think the only reason they're still running is that, just by
good fortune, they haven't fallen over and died yet. The hardest part
of getting them taken out of the buildfarm is likely to be finding
someone who has a working username and password to log into them and
take the jobs out of the crontab.

If someone really cares about figuring out what's going on here, it's
probably possible to get someone who is an EDB employee access to the
box to chase it down. But I'm having a hard time understanding what
value we get out of that given that the machines are running an
11-year-old compiler version on discontinued hardware on a
discontinued operating system. Even if we find a bug in PostgreSQL,
it's likely to be a bug that only matters on systems nobody cares
about.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: "ERROR: latch already owned" on gharial

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, May 27, 2022 at 7:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks. Hmm. So far it's always a parallel worker. The best idea I
have is to include the ID of the mystery PID in the error message and
see if that provides a clue next time.

... Even if we find a bug in PostgreSQL,
it's likely to be a bug that only matters on systems nobody cares
about.

That's possible, certainly. It's also possible that it's a real bug
that so far has only manifested there for (say) timing reasons.
The buildfarm is not so large that we can write off single-machine
failures as being unlikely to hit in the real world.

What I'd suggest is to promote that failure to elog(PANIC), which
would at least give us the PID and if we're lucky a stack trace.

regards, tom lane

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: "ERROR: latch already owned" on gharial

On Fri, May 27, 2022 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That's possible, certainly. It's also possible that it's a real bug
that so far has only manifested there for (say) timing reasons.
The buildfarm is not so large that we can write off single-machine
failures as being unlikely to hit in the real world.

What I'd suggest is to promote that failure to elog(PANIC), which
would at least give us the PID and if we're lucky a stack trace.

That proposed change is fine with me.

As to the question of whether it's a real bug, nobody can prove
anything unless we actually run it down. It's just a question of what
you think the odds are. Noah's PGCon talk a few years back on the long
tail of buildfarm failures convinced me (perhaps unintentionally) that
low-probability failures that occur only on obscure systems or
configurations are likely not worth running down, because while they
COULD be real bugs, a lot of them aren't, and the time it would take
to figure it out could be spent on other things - for instance, fixing
things that we know for certain are bugs. Spending 40 hours of
person-time on something with a 10% chance of being a bug in the
PostgreSQL code doesn't necessarily make sense to me, because while
you are correct that the buildfarm isn't that large, neither is the
developer community.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#11)
Re: "ERROR: latch already owned" on gharial

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, May 27, 2022 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

What I'd suggest is to promote that failure to elog(PANIC), which
would at least give us the PID and if we're lucky a stack trace.

That proposed change is fine with me.

As to the question of whether it's a real bug, nobody can prove
anything unless we actually run it down.

Agreed, and I'll even grant your point that if it is an HPUX-specific
or IA64-specific bug, it is not worth spending huge amounts of time
to isolate. The problem is that we don't know that. What we do know
so far is that if it can occur elsewhere, it's rare --- so we'd better
be prepared to glean as much info as possible if we do get such a
failure. Hence my thought of s/ERROR/PANIC/. And I'd be in favor of
any other low-effort change we can make to instrument the case better.

regards, tom lane

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#12)
Re: "ERROR: latch already owned" on gharial

On Sat, May 28, 2022 at 8:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, May 27, 2022 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

What I'd suggest is to promote that failure to elog(PANIC), which
would at least give us the PID and if we're lucky a stack trace.

That proposed change is fine with me.

As to the question of whether it's a real bug, nobody can prove
anything unless we actually run it down.

Agreed, and I'll even grant your point that if it is an HPUX-specific
or IA64-specific bug, it is not worth spending huge amounts of time
to isolate. The problem is that we don't know that. What we do know
so far is that if it can occur elsewhere, it's rare --- so we'd better
be prepared to glean as much info as possible if we do get such a
failure. Hence my thought of s/ERROR/PANIC/. And I'd be in favor of
any other low-effort change we can make to instrument the case better.

OK, pushed (except I realised that all the PIDs involved were int, not
pid_t). Let's see...

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#9)
Re: "ERROR: latch already owned" on gharial

On Sat, May 28, 2022 at 1:56 AM Robert Haas <robertmhaas@gmail.com> wrote:

What I'm inclined to do is get gharial and anole removed from the
buildfarm. anole was set up by Heikki in 2011. I don't know when
gharial was set up, or by whom. I don't think anyone at EDB cares
about these machines any more, or has any interest in maintaining
them. I think the only reason they're still running is that, just by
good fortune, they haven't fallen over and died yet. The hardest part
of getting them taken out of the buildfarm is likely to be finding
someone who has a working username and password to log into them and
take the jobs out of the crontab.

FWIW, in a previous investigation, Semab and Sandeep had access:

/messages/by-id/CABimMB4mRs9N3eivR-=qF9M8oWc5E6OX7GywsWF0DXN4P5gNEA@mail.gmail.com

#15Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#14)
Re: "ERROR: latch already owned" on gharial

On Mon, May 30, 2022 at 8:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 28, 2022 at 1:56 AM Robert Haas <robertmhaas@gmail.com> wrote:

What I'm inclined to do is get gharial and anole removed from the
buildfarm. anole was set up by Heikki in 2011. I don't know when
gharial was set up, or by whom. I don't think anyone at EDB cares
about these machines any more, or has any interest in maintaining
them. I think the only reason they're still running is that, just by
good fortune, they haven't fallen over and died yet. The hardest part
of getting them taken out of the buildfarm is likely to be finding
someone who has a working username and password to log into them and
take the jobs out of the crontab.

FWIW, in a previous investigation, Semab and Sandeep had access:

/messages/by-id/CABimMB4mRs9N3eivR-=qF9M8oWc5E6OX7GywsWF0DXN4P5gNEA@mail.gmail.com

Yeah, I'm in touch with Sandeep but not able to get in yet for some
reason. Will try to sort it out.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#15)
Re: "ERROR: latch already owned" on gharial

On Tue, May 31, 2022 at 8:20 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, May 30, 2022 at 8:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 28, 2022 at 1:56 AM Robert Haas <robertmhaas@gmail.com> wrote:

What I'm inclined to do is get gharial and anole removed from the
buildfarm. anole was set up by Heikki in 2011. I don't know when
gharial was set up, or by whom. I don't think anyone at EDB cares
about these machines any more, or has any interest in maintaining
them. I think the only reason they're still running is that, just by
good fortune, they haven't fallen over and died yet. The hardest part
of getting them taken out of the buildfarm is likely to be finding
someone who has a working username and password to log into them and
take the jobs out of the crontab.

FWIW, in a previous investigation, Semab and Sandeep had access:

/messages/by-id/CABimMB4mRs9N3eivR-=qF9M8oWc5E6OX7GywsWF0DXN4P5gNEA@mail.gmail.com

Yeah, I'm in touch with Sandeep but not able to get in yet for some
reason. Will try to sort it out.

OK, I have access to the box now. I guess I might as well leave the
crontab jobs enabled until the next time this happens, since Thomas
just took steps to improve the logging, but I do think these BF
members are overdue to be killed off, and would like to do that as
soon as it seems like a reasonable step to take.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#16)
Re: "ERROR: latch already owned" on gharial

On Wed, Jun 1, 2022 at 12:55 AM Robert Haas <robertmhaas@gmail.com> wrote:

OK, I have access to the box now. I guess I might as well leave the
crontab jobs enabled until the next time this happens, since Thomas
just took steps to improve the logging, but I do think these BF
members are overdue to be killed off, and would like to do that as
soon as it seems like a reasonable step to take.

A couple of months later, there has been no repeat of that error. I'd
happily forget about that and move on, if you want to decommission
these.

#18Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#17)
Re: "ERROR: latch already owned" on gharial

On Sun, Jul 3, 2022 at 11:51 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Jun 1, 2022 at 12:55 AM Robert Haas <robertmhaas@gmail.com> wrote:

OK, I have access to the box now. I guess I might as well leave the
crontab jobs enabled until the next time this happens, since Thomas
just took steps to improve the logging, but I do think these BF
members are overdue to be killed off, and would like to do that as
soon as it seems like a reasonable step to take.

A couple of months later, there has been no repeat of that error. I'd
happily forget about that and move on, if you want to decommission
these.

I have commented out the BF stuff in crontab on that machine.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Sandeep Thakkar
sandeep.thakkar@enterprisedb.com
In reply to: Robert Haas (#18)
Re: "ERROR: latch already owned" on gharial

Thanks Robert.

We are receiving the alerts from buildfarm-admins for anole and gharial not
reporting. Who can help to stop these? Thanks

On Wed, Jul 6, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jul 3, 2022 at 11:51 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Wed, Jun 1, 2022 at 12:55 AM Robert Haas <robertmhaas@gmail.com>

wrote:

OK, I have access to the box now. I guess I might as well leave the
crontab jobs enabled until the next time this happens, since Thomas
just took steps to improve the logging, but I do think these BF
members are overdue to be killed off, and would like to do that as
soon as it seems like a reasonable step to take.

A couple of months later, there has been no repeat of that error. I'd
happily forget about that and move on, if you want to decommission
these.

I have commented out the BF stuff in crontab on that machine.

--
Robert Haas
EDB: http://www.enterprisedb.com

--
Sandeep Thakkar

#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Sandeep Thakkar (#19)
Re: "ERROR: latch already owned" on gharial

On 2022-Jul-13, Sandeep Thakkar wrote:

Thanks Robert.

We are receiving the alerts from buildfarm-admins for anole and gharial not
reporting. Who can help to stop these? Thanks

Probably Andrew knows how to set buildsystems.no_alerts for these
animals.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)

#21Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Alvaro Herrera (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Soumyadeep Chakraborty (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#22)
#24Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Andres Freund (#23)