impact join syntax ?? and gist index ??

Started by Marc Millasover 3 years ago8 messagesgeneral

marc.millas@mokadb.com

over 3 years ago

Hi,

postgres 12, postgis 3.0

I have a small table A, 11 rows with a varchar column x and a geometry
column y.
gist index on the geometry column.
the geometry do contains multipolygons (regions on a map)
I have a second table B , same structure, around 420 000 rows.
no index,
the geometry do contains points.
all geometries are on 4326 srid.

If i ask to count points in each multipolygons:

select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
it takes 11 seconds (everything in shared buffers).
If I do the very same thing as:
select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by
A.x;
same result, but 85 seconds (every thing in shared buffers, again)
if I redo asking with explain analyze, buffers, the plan is very different.

if I do create a gist index on geometry column of the big table,
both syntax takes 21 seconds.

I get the feeling I am missing something.. (at least 2 things...)
can someone shed some light ??

thanks

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

marc.millas@mokadb.com

over 3 years ago

In reply to: Marc Millas (#1)

Re: impact join syntax ?? and gist index ??

Yes, I know the 2 syntax provide a different result: one provides the 6
meaningful lines, the left join do add 5 lines with a count of 0...
...

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

On Sat, Jan 7, 2023 at 8:46 PM Marc Millas <marc.millas@mokadb.com> wrote:

Show quoted text

Hi,

postgres 12, postgis 3.0

I have a small table A, 11 rows with a varchar column x and a geometry
column y.
gist index on the geometry column.
the geometry do contains multipolygons (regions on a map)
I have a second table B , same structure, around 420 000 rows.
no index,
the geometry do contains points.
all geometries are on 4326 srid.

If i ask to count points in each multipolygons:

select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
it takes 11 seconds (everything in shared buffers).
If I do the very same thing as:
select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by
A.x;
same result, but 85 seconds (every thing in shared buffers, again)
if I redo asking with explain analyze, buffers, the plan is very different.

if I do create a gist index on geometry column of the big table,
both syntax takes 21 seconds.

I get the feeling I am missing something.. (at least 2 things...)
can someone shed some light ??

thanks

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

marc.millas@mokadb.com

over 3 years ago

In reply to: Marc Millas (#1)

Re: impact join syntax ?? and gist index ??

on postgres 15 and postgis 3.3, with the very same dataset,
without gist index on the 420k rows table, the syntax with the left join
takes 25 seconds, and without 770 ms.
so to get 5 empty lines its 30 times slower.
if I add the gist index, both syntaxes takes 770 ms...

at least, this close the discussion about the versions my project will use
:-)

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

On Sat, Jan 7, 2023 at 8:46 PM Marc Millas <marc.millas@mokadb.com> wrote:

Show quoted text

Hi,

postgres 12, postgis 3.0

I have a small table A, 11 rows with a varchar column x and a geometry
column y.
gist index on the geometry column.
the geometry do contains multipolygons (regions on a map)
I have a second table B , same structure, around 420 000 rows.
no index,
the geometry do contains points.
all geometries are on 4326 srid.

If i ask to count points in each multipolygons:

select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
it takes 11 seconds (everything in shared buffers).
If I do the very same thing as:
select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by
A.x;
same result, but 85 seconds (every thing in shared buffers, again)
if I redo asking with explain analyze, buffers, the plan is very different.

if I do create a gist index on geometry column of the big table,
both syntax takes 21 seconds.

I get the feeling I am missing something.. (at least 2 things...)
can someone shed some light ??

thanks

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

ewie@ewie.name

over 3 years ago

In reply to: Marc Millas (#1)

Re: impact join syntax ?? and gist index ??

On 07/01/2023 20:46 CET Marc Millas <marc.millas@mokadb.com> wrote:

Hi,

postgres 12, postgis 3.0

I have a small table A, 11 rows with a varchar column x and a geometry column y.
gist index on the geometry column.
the geometry do contains multipolygons (regions on a map)
I have a second table B , same structure, around 420 000 rows.
no index,
the geometry do contains points.
all geometries are on 4326 srid.

If i ask to count points in each multipolygons:

select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
it takes 11 seconds (everything in shared buffers).
If I do the very same thing as:
select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by A.x;
same result, but 85 seconds (every thing in shared buffers, again)
if I redo asking with explain analyze, buffers, the plan is very different.

if I do create a gist index on geometry column of the big table, both syntax takes 21 seconds.

I get the feeling I am missing something.. (at least 2 things...)
can someone shed some light ??

Please provide the executions plans for both queries with and without the index on B.y.

--
Erik

marc.millas@mokadb.com

over 3 years ago

In reply to: Erik Wienhold (#4)

Re: impact join syntax ?? and gist index ??

here they are: (I replace the column and table names) also I post 2 more
remarks, one on left join, and one on the test I did on postgres 15 postgis
3.3...

2023=# explain (analyze, buffers) select A.x, count(B.x) from A left join B
on st_within(B.geom, A.geom) group by A.x;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=212638398.98..212701792.16 rows=20 width=16) (actual
time=86717.857..86757.820 rows=11 loops=1)
Group Key: A.x
Buffers: shared hit=4243867
-> Sort (cost=212638398.98..212659529.97 rows=8452398 width=16)
(actual time=86717.851..86727.334 rows=421307 loops=1)
Sort Key: A.x
Sort Method: quicksort Memory: 37963kB
Buffers: shared hit=4243867
-> Nested Loop Left Join (cost=0.00..211521459.31 rows=8452398
width=16) (actual time=17.473..86642.332 rows=421307 loops=1)
Join Filter: st_within(B.geom, A.geom)
Rows Removed by Join Filter: 4229377
Buffers: shared hit=4243867
-> Seq Scan on A (cost=0.00..9.20 rows=20 width=17752)
(actual time=0.009..0.043 rows=11 loops=1)
Buffers: shared hit=9
-> Materialize (cost=0.00..22309.83 rows=422789 width=40)
(actual time=0.001..23.392 rows=422789 loops=11)
Buffers: shared hit=15968
-> Seq Scan on B (cost=0.00..20195.89 rows=422789
width=40) (actual time=0.006..57.651 rows=422789 loops=1)
Buffers: shared hit=15968
Planning Time: 0.693 ms
Execution Time: 86763.087 ms
(19 lignes)

2023=# explain (analyze, buffers) select A.x, count(B.x) from A, B where
st_within(B.geom, A.geom) group by A.x;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=6301606.00..6301608.60 rows=20 width=16)
(actual time=11857.363..11863.212 rows=6 loops=1)
Group Key: A.x
Buffers: shared hit=2128836
-> Gather Merge (cost=6301606.00..6301608.30 rows=20 width=16) (actual
time=11857.359..11863.207 rows=12 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=2128836
-> Sort (cost=6300605.99..6300606.04 rows=20 width=16) (actual
time=11840.355..11840.356 rows=6 loops=2)
Sort Key: A.x
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Buffers: shared hit=2128836
-> Partial HashAggregate (cost=6300605.36..6300605.56
rows=20 width=16) (actual time=11840.331..11840.332 rows=6 loops=2)
Group Key: A.x
Buffers: shared hit=2128825
-> Nested Loop (cost=0.13..6275745.36 rows=4971999
width=16) (actual time=0.505..11781.817 rows=210651 loops=2)
Buffers: shared hit=2128825
-> Parallel Seq Scan on B (cost=0.00..18454.99
rows=248699 width=40) (actual time=0.005..22.859 rows=211395 loops=2)
Buffers: shared hit=15968
-> Index Scan using A_geom_idx on A
(cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1
loops=422789)
Index Cond: (geom ~ B.geom)
Filter: st_within(B.geom, geom)
Rows Removed by Filter: 0
Buffers: shared hit=2112857
Planning Time: 0.252 ms
Execution Time: 11863.357 ms
(26 lignes)

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

On Sat, Jan 7, 2023 at 9:40 PM Erik Wienhold <ewie@ewie.name> wrote:

Show quoted text

On 07/01/2023 20:46 CET Marc Millas <marc.millas@mokadb.com> wrote:

Hi,

postgres 12, postgis 3.0

I have a small table A, 11 rows with a varchar column x and a geometry

column y.

gist index on the geometry column.
the geometry do contains multipolygons (regions on a map)
I have a second table B , same structure, around 420 000 rows.
no index,
the geometry do contains points.
all geometries are on 4326 srid.

If i ask to count points in each multipolygons:

select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
it takes 11 seconds (everything in shared buffers).
If I do the very same thing as:
select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group

by A.x;

same result, but 85 seconds (every thing in shared buffers, again)
if I redo asking with explain analyze, buffers, the plan is very

different.

if I do create a gist index on geometry column of the big table,

both syntax takes 21 seconds.

I get the feeling I am missing something.. (at least 2 things...)
can someone shed some light ??

Please provide the executions plans for both queries with and without the
index on B.y.

--
Erik

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Marc Millas (#5)

Re: impact join syntax ?? and gist index ??

Marc Millas <marc.millas@mokadb.com> writes:

2023=# explain (analyze, buffers) select A.x, count(B.x) from A left join B
on st_within(B.geom, A.geom) group by A.x;

So the problem with this is that the only decently-performant way to
do the join is like

-> Nested Loop (cost=0.13..6275745.36 rows=4971999 width=16) (actual time=0.505..11781.817 rows=210651 loops=2)
-> Parallel Seq Scan on B (cost=0.00..18454.99 rows=248699 width=40) (actual time=0.005..22.859 rows=211395 loops=2)
-> Index Scan using A_geom_idx on A (cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1 loops=422789)
Index Cond: (geom ~ B.geom)
Filter: st_within(B.geom, geom)

(Ignore the parallelism, it's not very relevant here.) There's no
chance for merge or hash join because those require simple equality
join conditions. The only way to avoid a stupid
compare-every-row-of-A-to-every-row-of-B nestloop is to use a
parameterized inner indexscan, as this plan does. But that only works
if the join is inner or has the indexed table on the nullable side.
We have no support for nestloop right join, which is what would be
needed to make things run fast with no index on B.

regards, tom lane

marc.millas@mokadb.com

over 3 years ago

In reply to: Tom Lane (#6)

Re: impact join syntax ?? and gist index ??

I read your answer, Tom, but I cannot connect it to my measurements: why
adding the index did slow the request twice ??

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

On Sat, Jan 7, 2023 at 10:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Marc Millas <marc.millas@mokadb.com> writes:

2023=# explain (analyze, buffers) select A.x, count(B.x) from A left

join B

on st_within(B.geom, A.geom) group by A.x;

So the problem with this is that the only decently-performant way to
do the join is like

-> Nested Loop (cost=0.13..6275745.36

rows=4971999 width=16) (actual time=0.505..11781.817 rows=210651 loops=2)

-> Parallel Seq Scan on B

(cost=0.00..18454.99 rows=248699 width=40) (actual time=0.005..22.859
rows=211395 loops=2)

-> Index Scan using A_geom_idx on A

(cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1
loops=422789)

Index Cond: (geom ~ B.geom)
Filter: st_within(B.geom, geom)

(Ignore the parallelism, it's not very relevant here.) There's no
chance for merge or hash join because those require simple equality
join conditions. The only way to avoid a stupid
compare-every-row-of-A-to-every-row-of-B nestloop is to use a
parameterized inner indexscan, as this plan does. But that only works
if the join is inner or has the indexed table on the nullable side.
We have no support for nestloop right join, which is what would be
needed to make things run fast with no index on B.

regards, tom lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Marc Millas (#7)

Re: impact join syntax ?? and gist index ??

Marc Millas <marc.millas@mokadb.com> writes:

I read your answer, Tom, but I cannot connect it to my measurements: why
adding the index did slow the request twice ??

Are you referring to

if I do create a gist index on geometry column of the big table,
both syntax takes 21 seconds.

? That result is pretty much what I'd expect. If the planner has
to form a nestloop-with-inner-indexscan between a small table and
a big one, it's pretty much always going to prefer to put the small
table on the outside of the loop if it can. The cost of such a
loop is going to be more or less number of outer rows times the
log of the number of inner rows (assuming at great risk of
oversimplification that an index probe into a table of size N
requires about O(log N) work), and it's not hard to see that
S * log(B) is less than B * log(S). You seem to have lucked into a
case where the other way comes out faster despite that, which perhaps
can be explained by buffering effects, but it's not something to bet
on across-the-board.

(If you see this holding consistently, maybe it'd be advisable to
reduce the planner's effective_cache_size parameter to something
closer to shared_buffers, as it seems to indicate that fetches
from kernel space are pretty expensive on your platform.)

regards, tom lane