BRIN index creation on geometry column causes crash

Started by Tobias Wendorffover 1 year ago6 messagesbugs
Jump to latest
#1Tobias Wendorff
tobias.wendorff@tu-dortmund.de

I'm reporting a server crash that occurs when creating a BRIN index on a
geometry column in PostgreSQL 17.2.

### Steps to reproduce
1. Set up PostgreSQL 17.2 with PostGIS 3.5.1 extension
2. Execute the following SQL commands:

```sql
DROP TABLE IF EXISTS random_points;
CREATE TABLE random_points AS
SELECT ST_MakePoint(0, 0) AS geom FROM generate_series(1, 130_561);
CREATE INDEX ON random_points USING brin(geom);
```

### Expected behavior
The BRIN index should be created successfully without server termination.

### Actual behavior
The server terminates abruptly during index creation with the following
error:
```
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
```

### SELECT version();
```
PostgreSQL 17.2 (Debian 17.2-1.pgdg120+1) on x86_64-pc-linux-gnu,
compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
```

### SELECT PostGIS_Full_Version();
```
POSTGIS="3.5.1 48ab069" [EXTENSION] PGSQL="170"
GEOS="3.11.1-CAPI-1.17.1" SFCGAL="SFCGAL 1.4.1, CGAL 5.5.1, BOOST
1.74.0" PROJ="9.1.1 NETWORK_ENABLED=OFF
URL_ENDPOINT=https://cdn.proj.org USER_WRITABLE_DIRECTORY=/tmp/proj
DATABASE_PATH=/usr/share/proj/proj.db" (compiled against PROJ 9.1.1)
LIBXML="2.9.14" LIBJSON="0.16" LIBPROTOBUF="1.4.1" WAGYU="0.5.0
(Internal)" TOPOLOGY
```

### Trace
IRC user "GrayShade" was so nice to reproduce the problem and to create
a trace (see attachment). He also told me that this information might be
important:
```
fcinfo is {flinfo = 0x0, context = 0x0, resultinfo = 0x0, fncollation =
0, isnull = false, nargs = 2, args = 0x7ffeec862140}
```

### Additional notes
- The crash is consistently reproducible with the provided test case.
- 130_561 seems to be the sweet spot. Try with 1E6 or higher, not lower.
- Default PostgreSQL configuration settings are being used.
- The server log is not available in the current session due to the crash.

Please let me know if you need any additional information or server logs
to investigate this issue.

Attachments:

trace.txttext/plain; charset=UTF-8; name=trace.txtDownload
#2David Rowley
dgrowleyml@gmail.com
In reply to: Tobias Wendorff (#1)
Re: BRIN index creation on geometry column causes crash

On Fri, 3 Jan 2025 at 11:02, Tobias Wendorff
<tobias.wendorff@tu-dortmund.de> wrote:

```sql
DROP TABLE IF EXISTS random_points;
CREATE TABLE random_points AS
SELECT ST_MakePoint(0, 0) AS geom FROM generate_series(1, 130_561);
CREATE INDEX ON random_points USING brin(geom);
```

### Actual behavior
The server terminates abruptly during index creation with the following
error:
```
server closed the connection unexpectedly

Thanks for the report. That's certainly a bug.

What seems to be going on is that the following code in
brin_inclusion_union() looks for the function for the PROCNUM_MERGE
strategy but cannot find it, and not finding it shouldn't result in a
crash. The code does:

finfo = inclusion_get_procinfo(bdesc, attno, PROCNUM_MERGE);
Assert(finfo != NULL);

I built with Asserts enabled and that Assert triggers for me.

The other strategy types seem to check for NULLs with:

finfo = inclusion_get_procinfo(bdesc, attno, PROCNUM_MERGEABLE);
if (finfo != NULL &&

It seems, going by the following comment that PROCNUM_MERGE isn't
optional, which probably explains about the lack of NULL check for
that strategy, however, I don't see any code anywhere which checks to
make sure it exists before we get to this point.

#define PROCNUM_MERGE 11 /* required */

I see PostGIS sets this up with the following:

CREATE OPERATOR CLASS brin_geography_inclusion_ops
DEFAULT FOR TYPE geography
USING brin AS
FUNCTION 1 brin_inclusion_opcinfo(internal),
FUNCTION 2 geog_brin_inclusion_add_value(internal,
internal, internal, internal),
FUNCTION 3 brin_inclusion_consistent(internal,
internal, internal),
FUNCTION 4 brin_inclusion_union(internal, internal, internal),
OPERATOR 3 &&(geography, geography),
OPERATOR 3 &&(geography, gidx),
OPERATOR 3 &&(gidx, geography),
OPERATOR 3 &&(gidx, gidx),
STORAGE gidx;

I've copied in Álvaro, as he knows this code much better than I do.

David

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tobias Wendorff (#1)
Re: BRIN index creation on geometry column causes crash

On 1/2/25 23:02, Tobias Wendorff wrote:

I'm reporting a server crash that occurs when creating a BRIN index on a
geometry column in PostgreSQL 17.2.

### Steps to reproduce
1. Set up PostgreSQL 17.2 with PostGIS 3.5.1 extension
2. Execute the following SQL commands:

```sql
DROP TABLE IF EXISTS random_points;
CREATE TABLE random_points AS
SELECT ST_MakePoint(0, 0) AS geom FROM generate_series(1, 130_561);
CREATE INDEX ON random_points USING brin(geom);
```

Reproduced, but I belive this is actually a long-standing PostGIS bug.
The exact place where it crashes is here:

/* Finally, merge B to A. */
finfo = inclusion_get_procinfo(bdesc, attno, PROCNUM_MERGE);
Assert(finfo != NULL);
result = FunctionCall2Coll(finfo, colloid, ...);

With asserts, it fails on the assert, i.e. finfo is NULL. This means the
opfamily is missing the "MERGE" support procedure, which is however
required (from the very beginning of BRIN in 9.5).

This is supported by amvalidate():

test=# select * from pg_opclass where opcname =
'brin_geometry_inclusion_ops_2d';
oid | opcmethod | opcname | opcnamespace |
opcowner | opcfamily | opcintype | opcdefault | opckeytype
-------+-----------+--------------------------------+--------------+----------+-----------+-----------+------------+------------
17327 | 3580 | brin_geometry_inclusion_ops_2d | 2200 |
10 | 17326 | 16395 | t | 16430
(1 row)

test=# select amvalidate(17327);
INFO: operator family "brin_geometry_inclusion_ops_2d" of access method
brin is missing support function(s) for types box2df and box2df
amvalidate
------------
f
(1 row)

The reason why serial builds work is that the opclass does not add
values through the MERGE procedure, because it defines the procedures
like this:

test=# select amprocnum, amproc from pg_amproc where amprocfamily = 17326;
amprocnum | amproc
-----------+---------------------------------
1 | brin_inclusion_opcinfo
2 | geom2d_brin_inclusion_add_value
3 | brin_inclusion_consistent
4 | brin_inclusion_union
(4 rows)

So it uses a custom geom2d_brin_inclusion_add_value, instead of the
usual brin_inclusion_add_value, simply does the "merge" ad-hoc. Which is
a bit unorthodox (the "traditional way" would be to keep the usual
add_value proc, and add the type-specific logic in the "merge").

The problem however is that it keeps the brin_inclusion_union, which
also uses the MERGE procedure. And so it crashes.

This means even without the parallel builds this opclass is likely
broken, because as soon as it invokes the _union, it crashes in exactly
the same way. And it's not that difficult to hit, really. All it takes
is roughly this with two sessions:

S1: desummarize range 0

SELECT brin_desummarize_range('random_points_geom_idx', 0);

S1: summarize range 0, but break on brin_can_do_samepage_update (in
summarize_range)

SELECT brin_summarize_range('random_points_geom_idx', 0);

S2: insert a tuple into range 0

INSERT INTO random_points SELECT ST_MakePoint(1000, 1000);

S1: continue execution

KABOOOM!

So I think this is a bug in the opclass, which pretends to be an
inclusion opclass, but also tries not to be - and it ends up with an
inconsistent set of procedures. It needs to make it's mind and either
override (at least) the _union too, or add the MERGE required by
inclusion ops.

FWIW this is not the only opclass with this issue - I get a whole bunch
of similar failures (attached).

I belive this needs to be reported to PostGIS. It's a bit too late for
me, so I'll do that tomorrow, unless someone beats me to it.

regards

--
Tomas Vondra

Attachments:

amvalidate-failures.txttext/plain; charset=UTF-8; name=amvalidate-failures.txtDownload
#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#3)
Re: BRIN index creation on geometry column causes crash

On 2025-Jan-03, Tomas Vondra wrote:

Reproduced, but I belive this is actually a long-standing PostGIS bug.
The exact place where it crashes is here:

/* Finally, merge B to A. */
finfo = inclusion_get_procinfo(bdesc, attno, PROCNUM_MERGE);
Assert(finfo != NULL);
result = FunctionCall2Coll(finfo, colloid, ...);

With asserts, it fails on the assert, i.e. finfo is NULL. This means the
opfamily is missing the "MERGE" support procedure, which is however
required (from the very beginning of BRIN in 9.5).

Hmm, I completely agree that this is a PostGIS bug, but at the same time
I think it's wrong for Postgres not to verify that the support function
exists, so that we can throw an error instead of crashing. I think the
most expedient way to implement this is to add a "missing_ok" argument
to the various "foo_get_procinfo" functions, and make to pass it true
only in cases where the usage of the return value already admit that it
could be a null pointer (meaning it's really an optional support proc).

This should avoid further crashes with invalidly defined opclasses; see
attached POC patch. The user is still in a bad place, because for
instance if it's autovacuum that tries to do the range merging, then
vacuuming will fail but people may not notice for months unless they're
paying careful attention to the server logs. Crashing is good (tm)
because now a bunch of people are aware that the opclass is broken.

My next question is why didn't brinvalidate detect this problem. Is it
just that nobody ran 'amvalidate' on the opclass, or is it that
brinvalidate does not know that it needs to require the merge proc? But
I'll leave that for later.

Now, in writing this patch I noticed that both brin_bloom.c and
brin_minmax_multi.c seem to have cargo-culted the idea that support
procs can be optional; both of them have a single support proc
(PROCNUM_HASH and PROCNUM_DISTANCE, respectively) which is not optional;
so the "extra_proc_missing" stuff seems unnecessary for them. And
therefore the new missing_ok support I introduce in this patch would
also be unnecessary. As far as I can tell, there are no ABI concerns
about removing opaque->extra_proc_missing[] from both these opclass
scaffolds (and reintroducing it later, if we determine that we do need
optional support procs in those scaffolds after all.)

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

Attachments:

v1-0001-BRIN-be-stricter-about-required-support-procs.patchtext/x-diff; charset=utf-8Download+51-20
#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#4)
Re: BRIN index creation on geometry column causes crash

On 2/12/25 11:55, Álvaro Herrera wrote:

On 2025-Jan-03, Tomas Vondra wrote:

Reproduced, but I belive this is actually a long-standing PostGIS bug.
The exact place where it crashes is here:

/* Finally, merge B to A. */
finfo = inclusion_get_procinfo(bdesc, attno, PROCNUM_MERGE);
Assert(finfo != NULL);
result = FunctionCall2Coll(finfo, colloid, ...);

With asserts, it fails on the assert, i.e. finfo is NULL. This means the
opfamily is missing the "MERGE" support procedure, which is however
required (from the very beginning of BRIN in 9.5).

Hmm, I completely agree that this is a PostGIS bug, but at the same time
I think it's wrong for Postgres not to verify that the support function
exists, so that we can throw an error instead of crashing. I think the
most expedient way to implement this is to add a "missing_ok" argument
to the various "foo_get_procinfo" functions, and make to pass it true
only in cases where the usage of the return value already admit that it
could be a null pointer (meaning it's really an optional support proc).

This should avoid further crashes with invalidly defined opclasses; see
attached POC patch. The user is still in a bad place, because for
instance if it's autovacuum that tries to do the range merging, then
vacuuming will fail but people may not notice for months unless they're
paying careful attention to the server logs. Crashing is good (tm)
because now a bunch of people are aware that the opclass is broken.

Yeah, I'm not opposed to doing this. I don't know what's the right level
of paranoia when dealing with user-defined stuff.

But if I get this right, this is merely a mitigation turning a crash
into ERROR, it would not help with detecting the opclass brokenness any
earlier. Until the parallel builds it was very rare to need the MERGE.

My next question is why didn't brinvalidate detect this problem. Is it
just that nobody ran 'amvalidate' on the opclass, or is it that
brinvalidate does not know that it needs to require the merge proc? But
I'll leave that for later.

I don't know. Either no one tried running it, or maybe it wasn't quite
clear which of the reported issues are serious and which can be ignored.
I certainly wasn't quite clear to me when I tried running it, because
all the messages are INFO, it doesn't say which functions are missing,
and at least some of it seemed to be due to cross-type operators.

See this for details:

https://lists.osgeo.org/pipermail/postgis-devel/2025-January/030457.html

Now, in writing this patch I noticed that both brin_bloom.c and
brin_minmax_multi.c seem to have cargo-culted the idea that support
procs can be optional; both of them have a single support proc
(PROCNUM_HASH and PROCNUM_DISTANCE, respectively) which is not optional;
so the "extra_proc_missing" stuff seems unnecessary for them. And
therefore the new missing_ok support I introduce in this patch would
also be unnecessary. As far as I can tell, there are no ABI concerns
about removing opaque->extra_proc_missing[] from both these opclass
scaffolds (and reintroducing it later, if we determine that we do need
optional support procs in those scaffolds after all.)

Yeah, copy-paste is my basic coding tool, so it's quite possible the
bloom and minmax_multi could do stuff in a simpler way.

I can't think of a reason why would removing extra_proc_missing[] break
ABI, but I don't quite see the need to change this in backbranches. And
if we want to stop doing unnecessary stuff, maybe it'd be enough to just
remove the code, but leave extra_proc_missing in the struct.

regards

--
Tomas Vondra

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#5)
Re: BRIN index creation on geometry column causes crash

On 2025-Feb-12, Tomas Vondra wrote:

On 2/12/25 11:55, Álvaro Herrera wrote:

Hmm, I completely agree that this is a PostGIS bug, but at the same time
I think it's wrong for Postgres not to verify that the support function
exists, so that we can throw an error instead of crashing.

Yeah, I'm not opposed to doing this. I don't know what's the right level
of paranoia when dealing with user-defined stuff.

Yeah, I think "make sure we don't crash" is better. I've pushed this to
all live branches now.

But if I get this right, this is merely a mitigation turning a crash
into ERROR, it would not help with detecting the opclass brokenness any
earlier. Until the parallel builds it was very rare to need the MERGE.

True. The final fix belongs in the PostGIS opclass definition. With
this patch, you can still have an index defined with a bogus opclass,
but we won't crash anymore.

My next question is why didn't brinvalidate detect this problem.

I don't know. Either no one tried running it, or maybe it wasn't quite
clear which of the reported issues are serious and which can be ignored.
I certainly wasn't quite clear to me when I tried running it, because
all the messages are INFO, it doesn't say which functions are missing,
and at least some of it seemed to be due to cross-type operators.

See this for details:

https://lists.osgeo.org/pipermail/postgis-devel/2025-January/030457.html

Ugh. Yeah, this report does not appear to be good enough -- it's far
too noisy and at the same time not detailed enough. And maybe we're
missing some detailed guidance on what to do about each of those errors.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/