dblink crash on PPC

Started by Andrew Dunstanover 14 years ago8 messages

andrew@dunslane.net

over 14 years ago

Something odd is happening on buildfarm member wombat, a PPC970MP box
running Gentoo. We're getting dblink test failures. On the one I looked
at more closely I saw this:

[4ddf2c59.7aec:153] LOG: disconnection: session time: 0:00:00.444 user=markwkm database=contrib_regression host=[local]

and then:

[4ddf2c4e.79d4:2] LOG: server process (PID 31468) was terminated by signal 11: Segmentation fault
[4ddf2c4e.79d4:3] LOG: terminating any other active server processes

which makes it look like something is failing badly in the backend cleanup code. (7aec = hex(31468))

We don't seem to have a backtrace, which is sad.

This seems to be happening on the 9.0 branch too.

I wonder what it could be?

cheers

andrew

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Andrew Dunstan (#1)

Re: dblink crash on PPC

On Fri, May 27, 2011 at 8:44 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

Something odd is happening on buildfarm member wombat, a PPC970MP box
running Gentoo. We're getting dblink test failures. On the one I looked at
more closely I saw this:

[4ddf2c59.7aec:153] LOG: disconnection: session time: 0:00:00.444
user=markwkm database=contrib_regression host=[local]

and then:

[4ddf2c4e.79d4:2] LOG: server process (PID 31468) was terminated by signal
11: Segmentation fault
[4ddf2c4e.79d4:3] LOG: terminating any other active server processes

which makes it look like something is failing badly in the backend cleanup
code. (7aec = hex(31468))

We don't seem to have a backtrace, which is sad.

This seems to be happening on the 9.0 branch too.

I wonder what it could be?

Around when did it start failing?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 14 years ago

In reply to: Robert Haas (#2)

Re: dblink crash on PPC

Robert Haas <robertmhaas@gmail.com> wrote:

Andrew Dunstan <andrew@dunslane.net> wrote:

Something odd is happening on buildfarm member wombat, a PPC970MP
box running Gentoo. We're getting dblink test failures. On the
one I << looked at more closely I saw this:

[4ddf2c59.7aec:153] LOG: disconnection: session time:
0:00:00.444
user=markwkm database=contrib_regression host=[local]

and then:

[4ddf2c4e.79d4:2] LOG: server process (PID 31468) was terminated
by signal 11: Segmentation fault
[4ddf2c4e.79d4:3] LOG: terminating any other active server
processes

which makes it look like something is failing badly in the
backend cleanup code. (7aec = hex(31468))

We don't seem to have a backtrace, which is sad.

This seems to be happening on the 9.0 branch too.

I wonder what it could be?

Around when did it start failing?

According to the buildfarm logs the first failure was roughly 1 day
10 hours 40 minutes before this post.

Keep in mind that PPC is a platform with weak memory ordering....

-Kevin

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Kevin Grittner (#3)

Re: dblink crash on PPC

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Robert Haas <robertmhaas@gmail.com> wrote:

Around when did it start failing?

According to the buildfarm logs the first failure was roughly 1 day
10 hours 40 minutes before this post.

See
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=wombat&br=HEAD

The problem here is that wombat has been offline for about a month
before that, so it could have broken anytime in the past month.
It's also not unlikely that the hiatus signals a change in the
underlying hardware or software, which might have been the real
cause. (Mark?)

Keep in mind that PPC is a platform with weak memory ordering....

grebe, which is also a PPC64 machine, isn't showing the bug. And I just
failed to reproduce the problem on a RHEL6 PPC64 box. About to go try
it on RHEL5, which has a gcc version much closer to what wombat says
it's using, but I'm not very hopeful about that. I think the more
likely thing to be keeping in mind is that Gentoo is a platform with
poor quality control.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Tom Lane (#4)

Re: dblink crash on PPC

I wrote:

grebe, which is also a PPC64 machine, isn't showing the bug. And I just
failed to reproduce the problem on a RHEL6 PPC64 box. About to go try
it on RHEL5, which has a gcc version much closer to what wombat says
it's using, but I'm not very hopeful about that.

Nope, no luck there either. It's going to be hard to make any progress
on this without investigation on wombat itself.

regards, tom lane

Steve Singer

ssinger@ca.afilias.info

over 14 years ago

In reply to: Tom Lane (#4)

Re: dblink crash on PPC

On 11-05-27 12:35 PM, Tom Lane wrote:

grebe, which is also a PPC64 machine, isn't showing the bug. And I just
failed to reproduce the problem on a RHEL6 PPC64 box. About to go try
it on RHEL5, which has a gcc version much closer to what wombat says
it's using, but I'm not very hopeful about that. I think the more
likely thing to be keeping in mind is that Gentoo is a platform with
poor quality control.

regards, tom lane

As another data point, the dblink regression tests work fine for me on a
PPC32 debian (squeeze,gcc 4.4.5) based system.

Greg Stark

gsstark@mit.edu

over 14 years ago

In reply to: Steve Singer (#6)

Re: dblink crash on PPC

On Fri, May 27, 2011 at 10:06 AM, Steve Singer <ssinger@ca.afilias.info> wrote:

As another data point, the dblink regression tests work fine for me on a
PPC32 debian (squeeze,gcc 4.4.5) based system.

Given that it's dblink my guess is that it's picking up the wrong
version of libpq somehow.

--
greg

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Greg Stark (#7)

Re: dblink crash on PPC

Greg Stark <gsstark@mit.edu> writes:

On Fri, May 27, 2011 at 10:06 AM, Steve Singer <ssinger@ca.afilias.info> wrote:

As another data point, the dblink regression tests work fine for me on a
PPC32 debian (squeeze,gcc 4.4.5) based system.

Given that it's dblink my guess is that it's picking up the wrong
version of libpq somehow.

Maybe, but then why does the test only crash during backend exit, and
not while it's exercising dblink?

regards, tom lane