gcc versus division-by-zero traps

Started by Tom Laneover 16 years ago4 messages
#1Tom Lane
tgl@sss.pgh.pa.us

We have seen several previous reports of regression test failures
due to division by zero causing SIGFPE, even though the code
should never reach the division command:

http://archives.postgresql.org/pgsql-bugs/2006-11/msg00180.php
http://archives.postgresql.org/pgsql-bugs/2007-11/msg00032.php
http://archives.postgresql.org/pgsql-bugs/2008-05/msg00148.php
http://archives.postgresql.org/pgsql-general/2009-05/msg00774.php

It's always been on non-mainstream architectures so it was hard
to investigate. But I have finally been able to reproduce this:
https://bugzilla.redhat.com/show_bug.cgi?id=520916

While s390x is still not quite mainstream, at least I can get
access to one ;-). What turns out to be the case is that
"simple" test cases like
if (y == 0)
single_function_call(...);
z = x / y;
do not show the problem; you need something pretty complex in the
if-command. Like, say, an ereport() construct. So that's why the gcc
boys haven't already had visits from mobs of villagers about this.

I hope that the bug will get fixed in due course, but even if they
respond pretty quickly it will be years before the problem disappears
from every copy of gcc in the field. So I'm thinking that it would
behoove us to install a workaround, now that we've characterized the
problem sufficiently. What I am thinking is that in the three
functions known to exhibit the bug (int24div, int28div, int48div)
we should do something like this:

	if (arg2 == 0)
+	{
		ereport(ERROR,
				(errcode(ERRCODE_DIVISION_BY_ZERO),
				 errmsg("division by zero")));
+		/* ensure compiler realizes we don't reach the division */
+		PG_RETURN_NULL();
+	}
	/* No overflow is possible */
	PG_RETURN_INT64((int64) arg1 / arg2);

Thoughts?

regards, tom lane

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: gcc versus division-by-zero traps

David Fetter <david@fetter.org> writes:

On Thu, Sep 03, 2009 at 10:24:17AM -0400, Tom Lane wrote:

While s390x is still not quite mainstream, at least I can get
access to one ;-).

Do you also have access to z/OS with Unix System Services?

No, Red Hat's machines run RHEL ;-)

What I am thinking is that in the three
functions known to exhibit the bug (int24div, int28div, int48div)
we should do something like this:

How big would this change be? How would people know to use that
construct everywhere it's appropriate?

I'm talking about patching exactly those three functions. We don't
have any reports of trouble elsewhere. The long-term fix is in the
compiler anyway, this is just a workaround for currently-known issues.

Part of my motivation for this is to get rid of an existing hack in
the Red Hat RPMs:

# use -O1 on sparc64 and alpha
%ifarch sparc64 alpha
CFLAGS=`echo $CFLAGS| sed -e "s|-O2|-O1|g" `
%endif

I believe that that is only there to prevent exactly this problem.
Anyway I'd like to try removing it after patching as proposed, and
then we'll find out if there are other trouble spots ...

regards, tom lane

#3David Fetter
david@fetter.org
In reply to: Tom Lane (#2)
Re: gcc versus division-by-zero traps

On Thu, Sep 03, 2009 at 01:26:52PM -0400, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

On Thu, Sep 03, 2009 at 10:24:17AM -0400, Tom Lane wrote:

While s390x is still not quite mainstream, at least I can get
access to one ;-).

Do you also have access to z/OS with Unix System Services?

No, Red Hat's machines run RHEL ;-)

I'm given to understand it's possible to run both on the same
hardware.

Cheers,
David
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#4Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#1)
Re: gcc versus division-by-zero traps

Tom Lane wrote:

I hope that the bug will get fixed in due course, but even if they
respond pretty quickly it will be years before the problem disappears
from every copy of gcc in the field. So I'm thinking that it would
behoove us to install a workaround, now that we've characterized the
problem sufficiently. What I am thinking is that in the three
functions known to exhibit the bug (int24div, int28div, int48div)
we should do something like this:

if (arg2 == 0)
+	{
ereport(ERROR,
(errcode(ERRCODE_DIVISION_BY_ZERO),
errmsg("division by zero")));
+		/* ensure compiler realizes we don't reach the division */
+		PG_RETURN_NULL();
+	}

Could we document this a little better, perhaps refer to a file
somewhere or a central comment on its purpose?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +