BUG #15378: SP-GIST memory context screwup?
The following bug has been logged on the website:
Bug reference: 15378
Logged by: Andrew Gierth
Email address: andrew@tao11.riddles.org.uk
PostgreSQL version: 11beta3
Operating system: Debian
Description:
I found this while analyzing a report from IRC that initially looked like a
PostGIS bug, but which I now think is breakage in spgist:
spgrescan starts out by doing
MemoryContextReset(so->traversalCxt);
then later it calls resetSpGistScanOpaque(so);
which calls freeScanStack(so)
which calls freeScanStackEntry(so)
which does:
if (stackEntry->traversalValue)
pfree(stackEntry->traversalValue);
But stackEntry->traversalValue, if not NULL, is supposed to have been
allocated in so->traversalCxt, and so it's already gone.
The specific case we were looking at involved a query using a LATERAL
subselect with a LIMIT 1, with conditions that would have used the spgist
index (on a PostGIS geometry column). The PostGIS code seems to be correctly
allocating the traversalValues in the correct context, but we were getting
this crash (on pg11 beta3):
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x0000561e38427e1c in freeScanStackEntry (stackEntry=0x561e396ef718,
so=<optimized out>) at ./build/../src/backend/access/spgist/spgscan.c:47
#2 0x0000561e38428a6e in freeScanStack (so=0x561e396ed508) at
./build/../src/backend/access/spgist/spgscan.c:60
#3 resetSpGistScanOpaque (so=0x561e396ed508) at
./build/../src/backend/access/spgist/spgscan.c:75
#4 spgrescan (scan=<optimized out>, scankey=0x561e3969eab8,
nscankeys=<optimized out>, orderbys=<optimized out>, norderbys=<optimized
out>)
at ./build/../src/backend/access/spgist/spgscan.c:229
which is consistent with a bad pfree(), I think.
The original reporter's data set is proprietary, but this is the query
that died:
SELECT b.idhu, layer, max(dba)
FROM (SELECT b."IDHU" AS idhu, first(geom) AS geom
FROM do_buildings b JOIN do_data d ON b."IDHU"::text =
d."IDHU"::text
GROUP BY 1 LIMIT 100) b
CROSS JOIN unnest('{str_isof_den}'::text[]) AS lyr
CROSS JOIN LATERAL (
SELECT idhu, layer, dba
FROM do_laerm l
WHERE layer =lyr AND (b.geom && l.geom AND ST_Intersects(l.geom,
b.geom))
LIMIT 1
) AS target
GROUP by 1, 2;
Note that do_laerm(geom) has an SP-GIST index, and the LATERAL placement
means that this is going to be rescanned at a fairly arbitrary point in the
scan (due to the LIMIT). Unfortunately I don't think this can be
demonstrated with the built-in spgist opclasses, which don't allocate
traversalValues.
I /think/ the offending pfree could just be removed...
[CCing Teodor as committer of ccd6eb49a]
"PG" == PG Bug reporting form <noreply@postgresql.org> writes:
PG> I found this while analyzing a report from IRC that initially
PG> looked like a PostGIS bug, but which I now think is breakage in
PG> spgist:
PG> spgrescan starts out by doing
PG> MemoryContextReset(so->traversalCxt);
PG> then later it calls resetSpGistScanOpaque(so);
PG> which calls freeScanStack(so)
PG> which calls freeScanStackEntry(so)
PG> which does:
PG> if (stackEntry->traversalValue)
PG> pfree(stackEntry->traversalValue);
PG> But stackEntry->traversalValue, if not NULL, is supposed to have
PG> been allocated in so->traversalCxt, and so it's already gone.
[...]
PG> Unfortunately I don't think this can be demonstrated with the
PG> built-in spgist opclasses, which don't allocate traversalValues.
Turns out I was looking in the wrong place, and in fact we can reproduce
this easily (at least on an assert build):
create table boxes (b box);
insert into boxes
select box(point(i,j),point(i+s,j+s))
from generate_series(1,100,5) i,
generate_series(1,100,5) j,
generate_series(1,10) s;
create index on boxes using spgist (b);
select *
from (values (box(point(5,5),point(8,8))),(box(point(2,2),point(12,12)))) v(b)
cross join lateral (select * from boxes where boxes.b && v.b limit 1) pt;
So this logic was added in ccd6eb49a and was wrong from the start.
Testing suggests that removing the offending pfree does indeed fix the
issue; any objections?
--
Andrew (irc:RhodiumToad)
On Tue, Sep 11, 2018 at 6:13 AM Andrew Gierth <andrew@tao11.riddles.org.uk>
wrote:
[CCing Teodor as committer of ccd6eb49a]
"PG" == PG Bug reporting form <noreply@postgresql.org> writes:
PG> I found this while analyzing a report from IRC that initially
PG> looked like a PostGIS bug, but which I now think is breakage in
PG> spgist:PG> spgrescan starts out by doing
PG> MemoryContextReset(so->traversalCxt);PG> then later it calls resetSpGistScanOpaque(so);
PG> which calls freeScanStack(so)
PG> which calls freeScanStackEntry(so)
PG> which does:PG> if (stackEntry->traversalValue)
PG> pfree(stackEntry->traversalValue);PG> But stackEntry->traversalValue, if not NULL, is supposed to have
PG> been allocated in so->traversalCxt, and so it's already gone.
[...]
PG> Unfortunately I don't think this can be demonstrated with the
PG> built-in spgist opclasses, which don't allocate traversalValues.Turns out I was looking in the wrong place, and in fact we can reproduce
this easily (at least on an assert build):create table boxes (b box);
insert into boxes
select box(point(i,j),point(i+s,j+s))
from generate_series(1,100,5) i,
generate_series(1,100,5) j,
generate_series(1,10) s;
create index on boxes using spgist (b);
select *
from (values
(box(point(5,5),point(8,8))),(box(point(2,2),point(12,12)))) v(b)
cross join lateral (select * from boxes where boxes.b && v.b limit
1) pt;So this logic was added in ccd6eb49a and was wrong from the start.
Testing suggests that removing the offending pfree does indeed fix the
issue; any objections?
No objections from me. What about
moving MemoryContextReset(so->traversalCxt) into resetSpGistScanOpaque()?
For me it seems that resetting of traversal memory context is part of
opaque reset.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
"Alexander" == Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
So this logic was added in ccd6eb49a and was wrong from the start.
Testing suggests that removing the offending pfree does indeed fix
the issue; any objections?
Alexander> No objections from me. What about moving
Alexander> MemoryContextReset(so->traversalCxt) into
Alexander> resetSpGistScanOpaque()? For me it seems that resetting of
Alexander> traversal memory context is part of opaque reset.
Looking at all those retail pfrees makes me itch... is there some reason
why reconstructedValue couldn't go in the traversal context too, and
then the list nodes could go there as well, and the whole thing freed
with a context reset?
--
Andrew (irc:RhodiumToad)
"Alexander" == Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
So this logic was added in ccd6eb49a and was wrong from the start.
Testing suggests that removing the offending pfree does indeed fix
the issue; any objections?
Alexander> No objections from me.
But it turns out that removing the pfree will cause transient leakage
within the scan, since ScanStackEntry objects are also freed retail
during a walk. ugh.
So the simplest fix would be to move the memory context reset to just
after the freeScanStack in resetSpGistScanOpaque. And with retail
freeing going on during the scan, it makes less sense to try and avoid
it during rescan, double ugh.
--
Andrew (irc:RhodiumToad)
"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
Andrew> So the simplest fix would be to move the memory context reset
Andrew> to just after the freeScanStack in resetSpGistScanOpaque. And
Andrew> with retail freeing going on during the scan, it makes less
Andrew> sense to try and avoid it during rescan, double ugh.
Proposed patch attached.
--
Andrew (irc:RhodiumToad)
Attachments:
spgmem.patchtext/x-patchDownload+40-3
"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
Andrew> Proposed patch attached.
Pushed this.
--
Andrew (irc:RhodiumToad)
On Tue, Sep 11, 2018 at 6:47 PM Andrew Gierth <andrew@tao11.riddles.org.uk>
wrote:
But it turns out that removing the pfree will cause transient leakage
within the scan, since ScanStackEntry objects are also freed retail
during a walk. ugh.
Right, I've missed code path freeScanStackEntry() call from spgWalk().
On Tue, Sep 11, 2018 at 9:59 PM Andrew Gierth <andrew@tao11.riddles.org.uk>
wrote:
Andrew> Proposed patch attached.
Pushed this.
Pushed version looks correct to me. Thank you!
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company