BUG #16076: JIT causes huge delays in a complex query. jit=off solves it.

Started by PG Bug reporting formover 6 years ago5 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 16076
Logged by: Yuri Astrakhan
Email address: yuriastrakhan@gmail.com
PostgreSQL version: 12.0
Operating system: Ubuntu 19.10
Description:

(huge thanks to @Komzpa - Darafei Praliaskouski who figured out the
culprit)

PostgreSQL 12.0 (Ubuntu 12.0-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled
by gcc (Ubuntu 9.2.1-9ubuntu1) 9.2.1 20191008, 64-bit
POSTGIS="3.0.0 r17983" [EXTENSION] PGSQL="120" GEOS="3.8.0-CAPI-1.13.1 "
PROJ="Rel. 5.2.0, September 15th, 2018" LIBXML="2.9.4" LIBJSON="0.13.1"
LIBPROTOBUF="1.3.1" WAGYU="0.4.3 (Internal)"

I have a full OpenStreetMap loaded using OpenMapTiles configuration into
multiple machine configurations. A single query (wrapped as a UDF) generates
the whole MVT tile blob, and uses many layers, sub-queries, and UDFs. When
JIT is enabled, the query takes 5-10 seconds to run. Disabling JIT makes the
query run in milliseconds. Same queries on PostgreSQL 10 + Postgis
2.5.2/GEOS 3.6 have no issues (as there was no JIT).

Details:
The query is a concatenation of multiple map layers (binary blobs in MVT
format from ST_AsMVT()). All layers are "UNION ALL-ed" from subqueries, one
query per layer. The "explain analyze" shows huge cost of the first layer
query, regardless of the order of those layers. Every layer added about
200-300ms to the execution (not planning!) time, totalling about 5-7
seconds. Yet, I noticed that when I was testing a portion of layers,
removing layer_waterway made query run in milliseconds (I suspect there are
more than one layer that causes this bug).

I will try to reproduce this error using OpenMapTiles's quickstart setup,
but that will involve some heavy customization. Is this a known bug, or
should I try to get an easy to set up repo? Thanks!

* "perf top" results -
https://gist.github.com/nyurik/fa0cb0dd935d50e796cd5d58d8d41067
* A few explore analyze runs with different layer orderings:
* https://explain.depesz.com/s/fecC
* https://explain.depesz.com/s/MpMo
* https://explain.depesz.com/s/d7RU
* main query -
https://gist.github.com/nyurik/ed80c5e6d20f67fd14d37ea2503a9cbd
* helpers - https://gist.github.com/nyurik/00c4c9d69b7d4fa67c86159b9ce83027

#2Andres Freund
andres@anarazel.de
In reply to: PG Bug reporting form (#1)
Re: BUG #16076: JIT causes huge delays in a complex query. jit=off solves it.

Hi,

On 2019-10-24 15:12:16 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 16076
Logged by: Yuri Astrakhan
Email address: yuriastrakhan@gmail.com
PostgreSQL version: 12.0
Operating system: Ubuntu 19.10
Description:

(huge thanks to @Komzpa - Darafei Praliaskouski who figured out the
culprit)

PostgreSQL 12.0 (Ubuntu 12.0-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled
by gcc (Ubuntu 9.2.1-9ubuntu1) 9.2.1 20191008, 64-bit
POSTGIS="3.0.0 r17983" [EXTENSION] PGSQL="120" GEOS="3.8.0-CAPI-1.13.1 "
PROJ="Rel. 5.2.0, September 15th, 2018" LIBXML="2.9.4" LIBJSON="0.13.1"
LIBPROTOBUF="1.3.1" WAGYU="0.4.3 (Internal)"

I have a full OpenStreetMap loaded using OpenMapTiles configuration into
multiple machine configurations. A single query (wrapped as a UDF) generates
the whole MVT tile blob, and uses many layers, sub-queries, and UDFs. When
JIT is enabled, the query takes 5-10 seconds to run. Disabling JIT makes the
query run in milliseconds. Same queries on PostgreSQL 10 + Postgis
2.5.2/GEOS 3.6 have no issues (as there was no JIT).

Ugh.

Details:
The query is a concatenation of multiple map layers (binary blobs in MVT
format from ST_AsMVT()). All layers are "UNION ALL-ed" from subqueries, one
query per layer. The "explain analyze" shows huge cost of the first layer
query, regardless of the order of those layers. Every layer added about
200-300ms to the execution (not planning!) time, totalling about 5-7
seconds. Yet, I noticed that when I was testing a portion of layers,
removing layer_waterway made query run in milliseconds (I suspect there are
more than one layer that causes this bug).

I will try to reproduce this error using OpenMapTiles's quickstart setup,
but that will involve some heavy customization. Is this a known bug, or
should I try to get an easy to set up repo? Thanks!

A repro would be very helpful.

* "perf top" results -
https://gist.github.com/nyurik/fa0cb0dd935d50e796cd5d58d8d41067

Hm. That's a flat looking profile, making it hard to glean information
about what's going wrong. It might just be the size of the emitted
code...

If it's too hard to generate a quick repro, it might already help if you
sent me the bitcode for such a slow query. You can do so by setting
'jit_dump_bitcode' and looking for the *.bc files in the data
directory.

* A few explore analyze runs with different layer orderings:
* https://explain.depesz.com/s/fecC
* https://explain.depesz.com/s/MpMo
* https://explain.depesz.com/s/d7RU
* main query -
https://gist.github.com/nyurik/ed80c5e6d20f67fd14d37ea2503a9cbd
* helpers - https://gist.github.com/nyurik/00c4c9d69b7d4fa67c86159b9ce83027

This looks like it creates pretty darn huge queries. There's a *lot* of
table references here, and each of them has their own targetlist etc.

I want to figure out why it's *this* slow, but this many expressions are
going to have overhead to JIT. There's 330 generated functions in that
last explain. A lot of them never actually executed, because there's
basically no content. Presumably there's a lot of redundancy, with the
expressions being basically the same - but we're unable to take
advantage of that in 12, due to the pointer constants making the
expressions not actually be the same, c.f. patch at[1]/messages/by-id/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de (which also is
why there's no caching).

Unfortunately I don't think there's going to be full blown fix for this
in 12, although we might be able to improve things a bit. So you'll
probably end up having to just disable jit. But if you can try to get me
enough data to work with, the situation in 13 might be better.

Greetings,

Andres Freund

[1]: /messages/by-id/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de

#3Yuri Astrakhan
yuriastrakhan@gmail.com
In reply to: Andres Freund (#2)
Re: BUG #16076: JIT causes huge delays in a complex query. jit=off solves it.

Thanks Andres, I sent you the 2 BC files (optimized and non-optimized)
directly. Let me know if you can figure out the problem (wasn't sure if its
ok to spam the list with 2MBs of it :) ). Thanks!

On Thu, Oct 24, 2019 at 9:04 PM Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi,

On 2019-10-24 15:12:16 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 16076
Logged by: Yuri Astrakhan
Email address: yuriastrakhan@gmail.com
PostgreSQL version: 12.0
Operating system: Ubuntu 19.10
Description:

(huge thanks to @Komzpa - Darafei Praliaskouski who figured out the
culprit)

PostgreSQL 12.0 (Ubuntu 12.0-2.pgdg19.10+1) on x86_64-pc-linux-gnu,

compiled

by gcc (Ubuntu 9.2.1-9ubuntu1) 9.2.1 20191008, 64-bit
POSTGIS="3.0.0 r17983" [EXTENSION] PGSQL="120" GEOS="3.8.0-CAPI-1.13.1 "
PROJ="Rel. 5.2.0, September 15th, 2018" LIBXML="2.9.4" LIBJSON="0.13.1"
LIBPROTOBUF="1.3.1" WAGYU="0.4.3 (Internal)"

I have a full OpenStreetMap loaded using OpenMapTiles configuration into
multiple machine configurations. A single query (wrapped as a UDF)

generates

the whole MVT tile blob, and uses many layers, sub-queries, and UDFs.

When

JIT is enabled, the query takes 5-10 seconds to run. Disabling JIT makes

the

query run in milliseconds. Same queries on PostgreSQL 10 + Postgis
2.5.2/GEOS 3.6 have no issues (as there was no JIT).

Ugh.

Details:
The query is a concatenation of multiple map layers (binary blobs in MVT
format from ST_AsMVT()). All layers are "UNION ALL-ed" from subqueries,

one

query per layer. The "explain analyze" shows huge cost of the first layer
query, regardless of the order of those layers. Every layer added about
200-300ms to the execution (not planning!) time, totalling about 5-7
seconds. Yet, I noticed that when I was testing a portion of layers,
removing layer_waterway made query run in milliseconds (I suspect there

are

more than one layer that causes this bug).

I will try to reproduce this error using OpenMapTiles's quickstart setup,
but that will involve some heavy customization. Is this a known bug, or
should I try to get an easy to set up repo? Thanks!

A repro would be very helpful.

* "perf top" results -
https://gist.github.com/nyurik/fa0cb0dd935d50e796cd5d58d8d41067

Hm. That's a flat looking profile, making it hard to glean information
about what's going wrong. It might just be the size of the emitted
code...

If it's too hard to generate a quick repro, it might already help if you
sent me the bitcode for such a slow query. You can do so by setting
'jit_dump_bitcode' and looking for the *.bc files in the data
directory.

* A few explore analyze runs with different layer orderings:
* https://explain.depesz.com/s/fecC
* https://explain.depesz.com/s/MpMo
* https://explain.depesz.com/s/d7RU
* main query -
https://gist.github.com/nyurik/ed80c5e6d20f67fd14d37ea2503a9cbd
* helpers -

https://gist.github.com/nyurik/00c4c9d69b7d4fa67c86159b9ce83027

This looks like it creates pretty darn huge queries. There's a *lot* of
table references here, and each of them has their own targetlist etc.

I want to figure out why it's *this* slow, but this many expressions are
going to have overhead to JIT. There's 330 generated functions in that
last explain. A lot of them never actually executed, because there's
basically no content. Presumably there's a lot of redundancy, with the
expressions being basically the same - but we're unable to take
advantage of that in 12, due to the pointer constants making the
expressions not actually be the same, c.f. patch at[1] (which also is
why there's no caching).

Unfortunately I don't think there's going to be full blown fix for this
in 12, although we might be able to improve things a bit. So you'll
probably end up having to just disable jit. But if you can try to get me
enough data to work with, the situation in 13 might be better.

Greetings,

Andres Freund

[1]
/messages/by-id/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de

#4Yuri Astrakhan
yuriastrakhan@gmail.com
In reply to: Andres Freund (#2)
Re: BUG #16076: JIT causes huge delays in a complex query. jit=off solves it.

Hi Andres, I just heard v14 beta is out (congrats!). Do you know if the
very large query optimizations you mentioned (see last paragraph) have made
it into v13 or v14b1? Thanks!

On Thu, Oct 24, 2019 at 9:04 PM Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi,

On 2019-10-24 15:12:16 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 16076
Logged by: Yuri Astrakhan
Email address: yuriastrakhan@gmail.com
PostgreSQL version: 12.0
Operating system: Ubuntu 19.10
Description:

(huge thanks to @Komzpa - Darafei Praliaskouski who figured out the
culprit)

PostgreSQL 12.0 (Ubuntu 12.0-2.pgdg19.10+1) on x86_64-pc-linux-gnu,

compiled

by gcc (Ubuntu 9.2.1-9ubuntu1) 9.2.1 20191008, 64-bit
POSTGIS="3.0.0 r17983" [EXTENSION] PGSQL="120" GEOS="3.8.0-CAPI-1.13.1 "
PROJ="Rel. 5.2.0, September 15th, 2018" LIBXML="2.9.4" LIBJSON="0.13.1"
LIBPROTOBUF="1.3.1" WAGYU="0.4.3 (Internal)"

I have a full OpenStreetMap loaded using OpenMapTiles configuration into
multiple machine configurations. A single query (wrapped as a UDF)

generates

the whole MVT tile blob, and uses many layers, sub-queries, and UDFs.

When

JIT is enabled, the query takes 5-10 seconds to run. Disabling JIT makes

the

query run in milliseconds. Same queries on PostgreSQL 10 + Postgis
2.5.2/GEOS 3.6 have no issues (as there was no JIT).

Ugh.

Details:
The query is a concatenation of multiple map layers (binary blobs in MVT
format from ST_AsMVT()). All layers are "UNION ALL-ed" from subqueries,

one

query per layer. The "explain analyze" shows huge cost of the first layer
query, regardless of the order of those layers. Every layer added about
200-300ms to the execution (not planning!) time, totalling about 5-7
seconds. Yet, I noticed that when I was testing a portion of layers,
removing layer_waterway made query run in milliseconds (I suspect there

are

more than one layer that causes this bug).

I will try to reproduce this error using OpenMapTiles's quickstart setup,
but that will involve some heavy customization. Is this a known bug, or
should I try to get an easy to set up repo? Thanks!

A repro would be very helpful.

* "perf top" results -
https://gist.github.com/nyurik/fa0cb0dd935d50e796cd5d58d8d41067

Hm. That's a flat looking profile, making it hard to glean information
about what's going wrong. It might just be the size of the emitted
code...

If it's too hard to generate a quick repro, it might already help if you
sent me the bitcode for such a slow query. You can do so by setting
'jit_dump_bitcode' and looking for the *.bc files in the data
directory.

* A few explore analyze runs with different layer orderings:
* https://explain.depesz.com/s/fecC
* https://explain.depesz.com/s/MpMo
* https://explain.depesz.com/s/d7RU
* main query -
https://gist.github.com/nyurik/ed80c5e6d20f67fd14d37ea2503a9cbd
* helpers -

https://gist.github.com/nyurik/00c4c9d69b7d4fa67c86159b9ce83027

This looks like it creates pretty darn huge queries. There's a *lot* of
table references here, and each of them has their own targetlist etc.

I want to figure out why it's *this* slow, but this many expressions are
going to have overhead to JIT. There's 330 generated functions in that
last explain. A lot of them never actually executed, because there's
basically no content. Presumably there's a lot of redundancy, with the
expressions being basically the same - but we're unable to take
advantage of that in 12, due to the pointer constants making the
expressions not actually be the same, c.f. patch at[1] (which also is
why there's no caching).

Unfortunately I don't think there's going to be full blown fix for this
in 12, although we might be able to improve things a bit. So you'll
probably end up having to just disable jit. But if you can try to get me
enough data to work with, the situation in 13 might be better.

Greetings,

Andres Freund

[1]
/messages/by-id/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de

#5Andres Freund
andres@anarazel.de
In reply to: Yuri Astrakhan (#4)
Re: BUG #16076: JIT causes huge delays in a complex query. jit=off solves it.

Hi,

I'm afraid not :(. I really could use some help around this...

Regards,

Andres

On 2021-05-21 17:54:16 -0400, Yuri Astrakhan wrote:

Hi Andres, I just heard v14 beta is out (congrats!). Do you know if the
very large query optimizations you mentioned (see last paragraph) have made
it into v13 or v14b1? Thanks!

On Thu, Oct 24, 2019 at 9:04 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-10-24 15:12:16 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 16076
Logged by: Yuri Astrakhan
Email address: yuriastrakhan@gmail.com
PostgreSQL version: 12.0
Operating system: Ubuntu 19.10
Description:

(huge thanks to @Komzpa - Darafei Praliaskouski who figured out the
culprit)

PostgreSQL 12.0 (Ubuntu 12.0-2.pgdg19.10+1) on x86_64-pc-linux-gnu,

compiled

by gcc (Ubuntu 9.2.1-9ubuntu1) 9.2.1 20191008, 64-bit
POSTGIS="3.0.0 r17983" [EXTENSION] PGSQL="120" GEOS="3.8.0-CAPI-1.13.1 "
PROJ="Rel. 5.2.0, September 15th, 2018" LIBXML="2.9.4" LIBJSON="0.13.1"
LIBPROTOBUF="1.3.1" WAGYU="0.4.3 (Internal)"

I have a full OpenStreetMap loaded using OpenMapTiles configuration into
multiple machine configurations. A single query (wrapped as a UDF)

generates

the whole MVT tile blob, and uses many layers, sub-queries, and UDFs.

When

JIT is enabled, the query takes 5-10 seconds to run. Disabling JIT makes

the

query run in milliseconds. Same queries on PostgreSQL 10 + Postgis
2.5.2/GEOS 3.6 have no issues (as there was no JIT).

Ugh.

Details:
The query is a concatenation of multiple map layers (binary blobs in MVT
format from ST_AsMVT()). All layers are "UNION ALL-ed" from subqueries,

one

query per layer. The "explain analyze" shows huge cost of the first layer
query, regardless of the order of those layers. Every layer added about
200-300ms to the execution (not planning!) time, totalling about 5-7
seconds. Yet, I noticed that when I was testing a portion of layers,
removing layer_waterway made query run in milliseconds (I suspect there

are

more than one layer that causes this bug).

I will try to reproduce this error using OpenMapTiles's quickstart setup,
but that will involve some heavy customization. Is this a known bug, or
should I try to get an easy to set up repo? Thanks!

A repro would be very helpful.

* "perf top" results -
https://gist.github.com/nyurik/fa0cb0dd935d50e796cd5d58d8d41067

Hm. That's a flat looking profile, making it hard to glean information
about what's going wrong. It might just be the size of the emitted
code...

If it's too hard to generate a quick repro, it might already help if you
sent me the bitcode for such a slow query. You can do so by setting
'jit_dump_bitcode' and looking for the *.bc files in the data
directory.

* A few explore analyze runs with different layer orderings:
* https://explain.depesz.com/s/fecC
* https://explain.depesz.com/s/MpMo
* https://explain.depesz.com/s/d7RU
* main query -
https://gist.github.com/nyurik/ed80c5e6d20f67fd14d37ea2503a9cbd
* helpers -

https://gist.github.com/nyurik/00c4c9d69b7d4fa67c86159b9ce83027

This looks like it creates pretty darn huge queries. There's a *lot* of
table references here, and each of them has their own targetlist etc.

I want to figure out why it's *this* slow, but this many expressions are
going to have overhead to JIT. There's 330 generated functions in that
last explain. A lot of them never actually executed, because there's
basically no content. Presumably there's a lot of redundancy, with the
expressions being basically the same - but we're unable to take
advantage of that in 12, due to the pointer constants making the
expressions not actually be the same, c.f. patch at[1] (which also is
why there's no caching).

Unfortunately I don't think there's going to be full blown fix for this
in 12, although we might be able to improve things a bit. So you'll
probably end up having to just disable jit. But if you can try to get me
enough data to work with, the situation in 13 might be better.

Greetings,

Andres Freund

[1]
/messages/by-id/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de

Greetings,

Andres Freund