Buildfarm failures for hash indexes: buffer leaks
Hi all,
Looking at the buildfarm, serinus and moonjelly have just complained
about the failure of $subject:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=moonjelly&dt=2018-10-20%2015%3A17%3A02
The first failure is unrelated to the involved commits, as they touched
completely different areas of the code:
INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
+ WARNING: buffer refcount leak: [6481] (rel=base/16384/32349, blockNum=156, flags=0x93800000, refcount=1 1)
And versions older than HEAD do not complain.
Any thoughts?
--
Michael
Hello Michaᅵl,
The first failure is unrelated to the involved commits, as they touched
completely different areas of the code:
INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
+ WARNING: buffer refcount leak: [6481] (rel=base/16384/32349, blockNum=156, flags=0x93800000, refcount=1 1)And versions older than HEAD do not complain.
Any thoughts?
Both animals use gcc experimental versions, which may rather underline a
new bug in gcc head rather than an existing issue in pg. Or not.
--
Fabien.
On Mon, Oct 22, 2018 at 1:42 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello Michaël,
The first failure is unrelated to the involved commits, as they touched
completely different areas of the code:
INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
+ WARNING: buffer refcount leak: [6481] (rel=base/16384/32349, blockNum=156, flags=0x93800000, refcount=1 1)And versions older than HEAD do not complain.
Any thoughts?
Both animals use gcc experimental versions, which may rather underline a
new bug in gcc head rather than an existing issue in pg. Or not.
It is possible, but what could be the possible theory? The warning
indicates that somewhere we forgot to call ReleaseBuffer. Today, I
had reviewed at the hash index code related to test case that is
failing but didn't find any obvious problem. What should we our next
step? Do we want to change gcc version and see?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes:
On Mon, Oct 22, 2018 at 1:42 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Both animals use gcc experimental versions, which may rather underline a
new bug in gcc head rather than an existing issue in pg. Or not.
It is possible, but what could be the possible theory?
It seems like the two feasible theories are (1) gcc bug, or (2) buffer
leak that only occurs in very narrow circumstances, perhaps from a race
condition. Given that the hash index code hasn't changed meaningfully
in several months, I thought (1) seemed more probable.
regards, tom lane
Hello Tom & Amit,
Both animals use gcc experimental versions, which may rather underline a
new bug in gcc head rather than an existing issue in pg. Or not.It is possible, but what could be the possible theory?
It seems like the two feasible theories are (1) gcc bug, or (2) buffer
leak that only occurs in very narrow circumstances, perhaps from a race
condition. Given that the hash index code hasn't changed meaningfully
in several months, I thought (1) seemed more probable.
Yep, that is my thought as well.
The problem is that this kind of issue is not simple to wrap-up as a gcc
bug report, unlike other earlier instances that I forwarded to clang & gcc
dev teams.
I'm in favor in waiting before trying to report it, to check whether the
probable underlying gcc problem is detected, reported by someone else, and
fixed in gcc head. If it persists, then we'll see.
--
Fabien.
On 2018-10-23 13:54:31 +0200, Fabien COELHO wrote:
Hello Tom & Amit,
Both animals use gcc experimental versions, which may rather underline a
new bug in gcc head rather than an existing issue in pg. Or not.It is possible, but what could be the possible theory?
It seems like the two feasible theories are (1) gcc bug, or (2) buffer
leak that only occurs in very narrow circumstances, perhaps from a race
condition. Given that the hash index code hasn't changed meaningfully
in several months, I thought (1) seemed more probable.Yep, that is my thought as well.
FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same problem:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02
So it seems much more likely to be 1).
The problem is that this kind of issue is not simple to wrap-up as a gcc bug
report, unlike other earlier instances that I forwarded to clang & gcc dev
teams.I'm in favor in waiting before trying to report it, to check whether the
probable underlying gcc problem is detected, reported by someone else, and
fixed in gcc head. If it persists, then we'll see.
I suspect the easiest thing to narrow it down would be to bisect the
problem in gcc :(
Greetings,
Andres Freund
On Tue, Oct 23, 2018 at 07:50:57AM -0700, Andres Freund wrote:
FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same problem:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02So it seems much more likely to be 1).
Thanks all for looking at it. Indeed it looks like a compiler issue
here.
--
Michael
On Tue, Oct 23, 2018 at 10:51 AM Andres Freund <andres@anarazel.de> wrote:
On 2018-10-23 13:54:31 +0200, Fabien COELHO wrote:
Hello Tom & Amit,
Both animals use gcc experimental versions, which may rather
underline a
new bug in gcc head rather than an existing issue in pg. Or not.
It is possible, but what could be the possible theory?
It seems like the two feasible theories are (1) gcc bug, or (2) buffer
leak that only occurs in very narrow circumstances, perhaps from a race
condition. Given that the hash index code hasn't changed meaningfully
in several months, I thought (1) seemed more probable.Yep, that is my thought as well.
FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same
problem:https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02
So it seems much more likely to be 1).
The problem is that this kind of issue is not simple to wrap-up as a gcc
bug
report, unlike other earlier instances that I forwarded to clang & gcc
dev
teams.
I'm in favor in waiting before trying to report it, to check whether the
probable underlying gcc problem is detected, reported by someone else,and
fixed in gcc head. If it persists, then we'll see.
I suspect the easiest thing to narrow it down would be to bisect the
problem in gcc :(
Their commit r265241 is what broke the PostgreSQL build. It also broke the
compiler itself--at that commit it was no longer possible to build itself.
I had to --disable-bootstrap in order to get a r265241 compiler to test
PostgreSQL on.
Their commit r265375 fixed the ability to compile itself, but built
PostgreSQL binaries remain broken there and thereafter.
I ran this on a AWS c5d.4xlarge
ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912
(ami-0f65671a86f061fcd)
Configuring PostgreSQL with ./configure --enable-cassert is necessary in
order for the subsequent "make check" to experience the failure.
I'm using PostgreSQL v12dev commit 5953c99697621174f, but the problem is
not sensitive to that and reproduces back to v10.0.
I haven't the slightest idea what to do with this information, and GCC's
cryptic SVN commit messages don't offer much insight to me.
Cheers,
Jeff
Hello Jeff,
I suspect the easiest thing to narrow it down would be to bisect the
problem in gcc :(Their commit r265241 is what broke the PostgreSQL build. It also broke the
compiler itself--at that commit it was no longer possible to build itself.
I had to --disable-bootstrap in order to get a r265241 compiler to test
PostgreSQL on.
It seems they have done a API change around some kind of "range" analysis,
which must have been incomplete.
Their commit r265375 fixed the ability to compile itself, but built
PostgreSQL binaries remain broken there and thereafter.|...]
Thanks a lot for this investigation! I can fill in a gcc bug report. There
would be a enormous work to narrow it down to a small test case, it is
unclear how they can act about it, but at least they would know.
--
Fabien.
On 2018-10-27 08:18:12 +0200, Fabien COELHO wrote:
Hello Jeff,
I suspect the easiest thing to narrow it down would be to bisect the
problem in gcc :(Their commit r265241 is what broke the PostgreSQL build. It also broke the
compiler itself--at that commit it was no longer possible to build itself.
I had to --disable-bootstrap in order to get a r265241 compiler to test
PostgreSQL on.It seems they have done a API change around some kind of "range" analysis,
which must have been incomplete.Their commit r265375 fixed the ability to compile itself, but built
PostgreSQL binaries remain broken there and thereafter.|...]
Thanks a lot for this investigation! I can fill in a gcc bug report. There
would be a enormous work to narrow it down to a small test case, it is
unclear how they can act about it, but at least they would know.
Have you done so? If so, what's the bug number?
Greetings,
Andres Freund
Their commit r265375 fixed the ability to compile itself, but built
PostgreSQL binaries remain broken there and thereafter.|...]
Thanks a lot for this investigation! I can fill in a gcc bug report. There
would be a enormous work to narrow it down to a small test case, it is
unclear how they can act about it, but at least they would know.Have you done so? If so, what's the bug number?
Not yet. There was no answer to my suggestion, so I did not feel like
that urgent. I'll try to do it over next week.
--
Fabien.
Hi,
On 2018-11-01 22:52:19 +0100, Fabien COELHO wrote:
Their commit r265375 fixed the ability to compile itself, but built
PostgreSQL binaries remain broken there and thereafter.|...]
Thanks a lot for this investigation! I can fill in a gcc bug report. There
would be a enormous work to narrow it down to a small test case, it is
unclear how they can act about it, but at least they would know.Have you done so? If so, what's the bug number?
Not yet. There was no answer to my suggestion, so I did not feel like that
urgent. I'll try to do it over next week.
FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't
just an accident and the optimization's introduced in the above revision
aren't still broken.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't
just an accident and the optimization's introduced in the above revision
aren't still broken.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEAD
Yeah, I saw that moonjelly had gone green. Are you going to update
serinus' compiler soon?
(And while I'm bugging you, would you fix your buildfarm menagerie
to build REL_11_STABLE?)
regards, tom lane
Hi,
On 2018-11-19 17:32:37 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't
just an accident and the optimization's introduced in the above revision
aren't still broken.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEADYeah, I saw that moonjelly had gone green. Are you going to update
serinus' compiler soon?
I'm using debian's gcc-snapshot package, which updates a bit less
frequently than what Fabien does. It'll auto-update as soon as that has
happened.
What disconcerts me a bit is that afaict there wasn't an intentional fix
in the relevant area (disabling value range propagation fixes the
miscompilation), so I'm not quite sure.
(And while I'm bugging you, would you fix your buildfarm menagerie
to build REL_11_STABLE?)
Oh, hum. I thought I had. But I did it only for the LLVM enabled
animals. Results should be coming up soon.
Greetings,
Andres Freund
On 2018-11-19 14:38:04 -0800, Andres Freund wrote:
Hi,
On 2018-11-19 17:32:37 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
FWIW, it seems that gcc's trunk works again. But I'm not sure this isn't
just an accident and the optimization's introduced in the above revision
aren't still broken.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=moonjelly&br=HEADYeah, I saw that moonjelly had gone green. Are you going to update
serinus' compiler soon?I'm using debian's gcc-snapshot package, which updates a bit less
frequently than what Fabien does. It'll auto-update as soon as that has
happened.What disconcerts me a bit is that afaict there wasn't an intentional fix
in the relevant area (disabling value range propagation fixes the
miscompilation), so I'm not quite sure.
Ooops, somehow deleted too much when editing the sentence. I was trying
to say that I'm not sure whether this is actually fixed, or a bug was
just hidden.
The issue is "fixed" in :
https://github.com/gcc-mirror/gcc/commit/48625f587d786089ce1e245f192bcf706f8fc1fc
and had been "caused" (or surfaced) in:
https://github.com/gcc-mirror/gcc/commit/be44111ed7ea7611d324d4d570e2f2d380f784b4
Greetings,
Andres Freund