Oom on temp (un-analyzed table caused by JIT) V16.1

Started by Kirk Wolakalmost 2 years ago21 messages
#1Kirk Wolak
wolakk@gmail.com
1 attachment(s)

Daniel,
You have a commit [1]= 2cf50585e54a7b0c6bc62a087c69043ae57e4252 <https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2cf50585e54a7b0c6bc62a087c69043ae57e4252&gt; that MIGHT fix this.
I have a script that recreates the problem, using random data in pg_temp.
And a nested cursor.

It took me a few days to reduce this from actual code that was
experiencing this. If I turn off JIT, the problem goes away. (if I don't
FETCH the first row, the memory loss does not happen. Maybe because
opening a cursor is more decoration/prepare)

I don't have an easy way to test this script right now against the commit.
I am hopeful that your fix fixes this.

This was my first OOM issue in PG in 3yrs of working with it.

The problem goes away if the TABLE is analyzed, or JIT is disabled.

The current script, if run, will consume about 25% of my system memory
(10GB).
Just call the function below until it dies if that's what you need. The
only way to get the memory back down is to close the connection.

SELECT pg_temp.fx(497);

Surprisingly, to me, the report from pg_get_backend_memory_contexts()
doesn't really show "missing memory", which I thought it would. (FWIW, we
caught this with multiple rounds of testing our code, slowing down, then
crashing... Is there ANY way to interrogate that we are above X% of system
memory so we know to let this backend go?)

It takes about 18 minutes to run on my 4 CPU VM.

For now, we are going to add some ANALYZE statements to our code.
We will consider disabling JIT.

Thanks,
Kirk
[1]: = 2cf50585e54a7b0c6bc62a087c69043ae57e4252 <https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2cf50585e54a7b0c6bc62a087c69043ae57e4252&gt;
<https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2cf50585e54a7b0c6bc62a087c69043ae57e4252&gt;

Attachments:

Oom_recreation.sqltext/plain; charset=US-ASCII; name=Oom_recreation.sqlDownload
#2Pavel Stehule
pavel.stehule@gmail.com
In reply to: Kirk Wolak (#1)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1

Hi

po 15. 1. 2024 v 7:24 odesílatel Kirk Wolak <wolakk@gmail.com> napsal:

Daniel,
You have a commit [1] that MIGHT fix this.
I have a script that recreates the problem, using random data in pg_temp.
And a nested cursor.

It took me a few days to reduce this from actual code that was
experiencing this. If I turn off JIT, the problem goes away. (if I don't
FETCH the first row, the memory loss does not happen. Maybe because
opening a cursor is more decoration/prepare)

I don't have an easy way to test this script right now against the
commit.
I am hopeful that your fix fixes this.

This was my first OOM issue in PG in 3yrs of working with it.

The problem goes away if the TABLE is analyzed, or JIT is disabled.

The current script, if run, will consume about 25% of my system memory
(10GB).
Just call the function below until it dies if that's what you need. The
only way to get the memory back down is to close the connection.

SELECT pg_temp.fx(497);

Surprisingly, to me, the report from pg_get_backend_memory_contexts()
doesn't really show "missing memory", which I thought it would. (FWIW, we
caught this with multiple rounds of testing our code, slowing down, then
crashing... Is there ANY way to interrogate that we are above X% of system
memory so we know to let this backend go?)

I wrote simple extension that can show memory allocation from system
perspective

https://github.com/okbob/pgmeminfo

It takes about 18 minutes to run on my 4 CPU VM.

For now, we are going to add some ANALYZE statements to our code.

remember - don't run anything without VACUUM ANALYZE.

Without it, the queries can be slow - ANALYZE sets stats, VACUUM prepare
visibility maps - without visibility maps index only scan cannot be used

autovacuum doesn't see into opened transactions, and autovacuum is executed
in 1minute cycles. Autovacuum doesn't see temporary tables too. Temporary
tables (data) are visible only from owner process.

We will consider disabling JIT.

Has sense only for bigger analytics queries.

Regards

Pavel

Show quoted text

Thanks,
Kirk
[1] = 2cf50585e54a7b0c6bc62a087c69043ae57e4252
<https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2cf50585e54a7b0c6bc62a087c69043ae57e4252&gt;

#3Daniel Gustafsson
daniel@yesql.se
In reply to: Kirk Wolak (#1)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1

On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com> wrote:

You have a commit [1] that MIGHT fix this.
I have a script that recreates the problem, using random data in pg_temp.
And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on the
v16 HEAD I cannot see any memory growth, so it's plausible that the upcoming
16.2 will work better in your environment.

It took me a few days to reduce this from actual code that was experiencing this. If I turn off JIT, the problem goes away. (if I don't FETCH the first row, the memory loss does not happen. Maybe because opening a cursor is more decoration/prepare)

I don't have an easy way to test this script right now against the commit.

There are up to date snapshots of the upcoming v16 minor release which might
make testing easier than building postgres from source?

https://www.postgresql.org/download/snapshots/

Admittedly I don't know whether those are built with LLVM support but I think
they might be.

I am hopeful that your fix fixes this.

It seems likely, so it would be very valuable if you could try running the pre
-release version of v16 before 16.2 ships to verify this.

--
Daniel Gustafsson

#4Kirk Wolak
wolakk@gmail.com
In reply to: Daniel Gustafsson (#3)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1

On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com> wrote:

You have a commit [1] that MIGHT fix this.
I have a script that recreates the problem, using random data in pg_temp.
And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on
the
v16 HEAD I cannot see any memory growth, so it's plausible that the
upcoming
16.2 will work better in your environment.

The script starts by creating 90 Million rows... In my world that part of
the script, plus the indexes, etc. Takes about 8-9 minutes.
And has no memory loss.

I used the memory reported by HTOP to track the problem. I Forgot to
mention this.
I am curious what you used? (Because it doesn't display symptoms [running
dog slow] until the backend has about 85% of the machines memory)

There are up to date snapshots of the upcoming v16 minor release which
might
make testing easier than building postgres from source?

https://www.postgresql.org/download/snapshots/

Thank you. I have assigned that task to the guy who maintains our
VMs/Environments.
I will report back to you.

Kirk

#5Pavel Stehule
pavel.stehule@gmail.com
In reply to: Daniel Gustafsson (#3)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1

po 15. 1. 2024 v 15:03 odesílatel Daniel Gustafsson <daniel@yesql.se>
napsal:

On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com> wrote:

You have a commit [1] that MIGHT fix this.
I have a script that recreates the problem, using random data in pg_temp.
And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on
the
v16 HEAD I cannot see any memory growth, so it's plausible that the
upcoming
16.2 will work better in your environment.

yes

Show quoted text

It took me a few days to reduce this from actual code that was

experiencing this. If I turn off JIT, the problem goes away. (if I don't
FETCH the first row, the memory loss does not happen. Maybe because
opening a cursor is more decoration/prepare)

I don't have an easy way to test this script right now against the

commit.

There are up to date snapshots of the upcoming v16 minor release which
might
make testing easier than building postgres from source?

https://www.postgresql.org/download/snapshots/

Admittedly I don't know whether those are built with LLVM support but I
think
they might be.

I am hopeful that your fix fixes this.

It seems likely, so it would be very valuable if you could try running the
pre
-release version of v16 before 16.2 ships to verify this.

--
Daniel Gustafsson

#6Daniel Gustafsson
daniel@yesql.se
In reply to: Kirk Wolak (#4)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1

On 15 Jan 2024, at 16:49, Kirk Wolak <wolakk@gmail.com> wrote:
On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote:

The script starts by creating 90 Million rows... In my world that part of the script, plus the indexes, etc. Takes about 8-9 minutes.
And has no memory loss.

That's expected, the memory leak did not affect those operations.

I used the memory reported by HTOP to track the problem. I Forgot to mention this.
I am curious what you used? (Because it doesn't display symptoms [running dog slow] until the backend has about 85% of the machines memory)

I use a combination of tools, in thise case I analyzed a build with Instruments on macOS.

There are up to date snapshots of the upcoming v16 minor release which might
make testing easier than building postgres from source?

https://www.postgresql.org/download/snapshots/ <https://www.postgresql.org/download/snapshots/&gt;

Thank you. I have assigned that task to the guy who maintains our VMs/Environments.
I will report back to you.

Great!

--
Daniel Gustafsson

#7Kirk Wolak
wolakk@gmail.com
In reply to: Daniel Gustafsson (#3)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com> wrote:

You have a commit [1] that MIGHT fix this.
I have a script that recreates the problem, using random data in pg_temp.
And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on
the
v16 HEAD I cannot see any memory growth, so it's plausible that the
upcoming
16.2 will work better in your environment.

Okay, I took the latest source off of git (17devel) and got it to work
there in a VM.

It appears this issue is fixed. It must have been related to the issue
originally tagged.

Thanks!

#8Daniel Gustafsson
daniel@yesql.se
In reply to: Kirk Wolak (#7)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

On 16 Jan 2024, at 02:53, Kirk Wolak <wolakk@gmail.com> wrote:

On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote:

On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com <mailto:wolakk@gmail.com>> wrote:

You have a commit [1] that MIGHT fix this.
I have a script that recreates the problem, using random data in pg_temp.
And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on the
v16 HEAD I cannot see any memory growth, so it's plausible that the upcoming
16.2 will work better in your environment.

Okay, I took the latest source off of git (17devel) and got it to work there in a VM.

It appears this issue is fixed. It must have been related to the issue originally tagged.

Thanks for testing and confirming! Testing pre-release builds on real life
workloads is invaluable for the development of Postgres so thank you taking the
time.

--
Daniel Gustafsson

#9Kirk Wolak
wolakk@gmail.com
In reply to: Daniel Gustafsson (#8)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

On Tue, Jan 16, 2024 at 3:43 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 16 Jan 2024, at 02:53, Kirk Wolak <wolakk@gmail.com> wrote:

On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se

<mailto:daniel@yesql.se>> wrote:

On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com <mailto:

wolakk@gmail.com>> wrote:

...
Okay, I took the latest source off of git (17devel) and got it to work

there in a VM.

It appears this issue is fixed. It must have been related to the issue

originally tagged.

Thanks for testing and confirming! Testing pre-release builds on real life
workloads is invaluable for the development of Postgres so thank you
taking the
time.

Daniel,
I did a little more checking and the reason I did not see the link MIGHT
be because EXPLAIN did not show a JIT attempt.
I tried to use settings that FORCE a JIT... But to no avail.

I am now concerned that the problem is more hidden in my use case.
Meaning I CANNOT conclude it is fixed.
But I know of NO WAY to force a JIT (I lowered costs to 1, etc. ).

You don't know a way to force at least the JIT analysis to happen?
(because I already knew if JIT was off, the leak wouldn't happen).

Thanks,

Kirk Out!
PS: I assume there is no pg_jit(1) function I can call. LOL

#10Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Kirk Wolak (#9)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

On Thu, 2024-01-18 at 19:50 -0500, Kirk Wolak wrote:

  I did a little more checking and the reason I did not see the link MIGHT be because EXPLAIN did not show a JIT attempt.
I tried to use settings that FORCE a JIT...  But to no avail.

  I am now concerned that the problem is more hidden in my use case.  Meaning I CANNOT conclude it is fixed.
But I know of NO WAY to force a JIT (I lowered costs to 1, etc.  ).

  You don't know a way to force at least the JIT analysis to happen?  (because I already knew if JIT was off, the leak wouldn't happen). 

If you set the limits to 0, you can trigger it easily:

SET jit = on;
SET jit_above_cost = 0;
SET jit_inline_above_cost = 0;
SET jit_optimize_above_cost = 0;

EXPLAIN (ANALYZE) SELECT count(*) FROM foo;
QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Finalize Aggregate (cost=58889.84..58889.85 rows=1 width=8) (actual time=400.462..418.214 rows=1 loops=1)
-> Gather (cost=58889.62..58889.83 rows=2 width=8) (actual time=400.300..418.190 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=57889.62..57889.64 rows=1 width=8) (actual time=384.876..384.878 rows=1 loops=3)
-> Parallel Seq Scan on foo (cost=0.00..52681.30 rows=2083330 width=0) (actual time=0.028..168.510 rows=1666667 loops=3)
Planning Time: 0.133 ms
JIT:
Functions: 8
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.038 ms, Inlining 279.779 ms, Optimization 38.395 ms, Emission 73.105 ms, Total 392.316 ms
Execution Time: 478.257 ms

Yours,
Laurenz Albe

#11Daniel Gustafsson
daniel@yesql.se
In reply to: Kirk Wolak (#9)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

On 19 Jan 2024, at 01:50, Kirk Wolak <wolakk@gmail.com> wrote:

I did a little more checking and the reason I did not see the link MIGHT be because EXPLAIN did not show a JIT attempt.
I tried to use settings that FORCE a JIT... But to no avail.

Are you sure you are running a JIT enabled server? Did you compile it yourself
or install a snapshot?

You don't know a way to force at least the JIT analysis to happen? (because I already knew if JIT was off, the leak wouldn't happen).

If you set jit_above_cost=0 then postgres will compile a JIT enabled execution
tree. This does bring up an interesting point, I don't think there is a way
for a user to know whether the server is jit enabled or not (apart from
explaining a query with costs adjusted but that's not all that userfriendly).
Maybe we need a way to reliably tell if JIT is active or not.

--
Daniel Gustafsson

#12Michael Banck
mbanck@gmx.net
In reply to: Daniel Gustafsson (#11)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

Hi,

On Fri, Jan 19, 2024 at 10:48:12AM +0100, Daniel Gustafsson wrote:

This does bring up an interesting point, I don't think there is a way
for a user to know whether the server is jit enabled or not (apart
from explaining a query with costs adjusted but that's not all that
userfriendly). Maybe we need a way to reliably tell if JIT is active
or not.

I thought this is what pg_jit_available() is for?

postgres=> SHOW jit;
jit
-----
on
(1 Zeile)

postgres=> SELECT pg_jit_available();
pg_jit_available
------------------
f
(1 Zeile)

Michael

#13Daniel Gustafsson
daniel@yesql.se
In reply to: Michael Banck (#12)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

On 19 Jan 2024, at 11:04, Michael Banck <mbanck@gmx.net> wrote:

Hi,

On Fri, Jan 19, 2024 at 10:48:12AM +0100, Daniel Gustafsson wrote:

This does bring up an interesting point, I don't think there is a way
for a user to know whether the server is jit enabled or not (apart
from explaining a query with costs adjusted but that's not all that
userfriendly). Maybe we need a way to reliably tell if JIT is active
or not.

I thought this is what pg_jit_available() is for?

Ah, it is, I completely forgot we had that one. Thanks!

--
Daniel Gustafsson

#14Kirk Wolak
wolakk@gmail.com
In reply to: Laurenz Albe (#10)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

On Fri, Jan 19, 2024 at 4:20 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Thu, 2024-01-18 at 19:50 -0500, Kirk Wolak wrote:

I did a little more checking and the reason I did not see the link

MIGHT be because EXPLAIN did not show a JIT attempt.

I tried to use settings that FORCE a JIT... But to no avail.

I am now concerned that the problem is more hidden in my use case.

Meaning I CANNOT conclude it is fixed.

But I know of NO WAY to force a JIT (I lowered costs to 1, etc. ).

You don't know a way to force at least the JIT analysis to happen?

(because I already knew if JIT was off, the leak wouldn't happen).

If you set the limits to 0, you can trigger it easily:

SET jit = on;
SET jit_above_cost = 0;
SET jit_inline_above_cost = 0;
SET jit_optimize_above_cost = 0;

Okay,
I Did exactly this (just now). But the EXPLAIN does not contain the JIT?

-------------------------------------------------------------------------------
Sort (cost=5458075.88..5477861.00 rows=7914047 width=12)
Sort Key: seid
-> HashAggregate (cost=3910139.62..4280784.00 rows=7914047 width=12)
Group Key: seid, fr_field_name, st_field_name
Planned Partitions: 128
-> Seq Scan on parts (cost=0.00..1923249.00 rows=29850000
width=12)
Filter: ((seid <> 497) AND ((partnum)::text >= '1'::text))
(7 rows)

From a FUTURE email, I noticed pg_jit_available()

and it's set to f??

Okay, so does this require a special BUILD command?

postgres=# select pg_jit_available();
pg_jit_available
------------------
f
(1 row)

postgres=# \dconfig *jit*
List of configuration parameters
Parameter | Value
-------------------------+---------
jit | on
jit_above_cost | 100000
jit_debugging_support | off
jit_dump_bitcode | off
jit_expressions | on
jit_inline_above_cost | 500000
jit_optimize_above_cost | 500000
jit_profiling_support | off
jit_provider | llvmjit
jit_tuple_deforming | on
(10 rows)

#15Daniel Gustafsson
daniel@yesql.se
In reply to: Kirk Wolak (#14)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:

From a FUTURE email, I noticed pg_jit_available() and it's set to f??

Right, then this installation does not contain the necessary library to JIT
compile the query.

Okay, so does this require a special BUILD command?

Yes, it requires that you compile with --with-llvm. If you don't have llvm in
the PATH you might need to set LLVM_CONFIG to point to your llvm-config binary.
With autotools that would be something like:

./configure <other params> --with-llvm LLVM_CONFIG=/path/to/llvm-config

--
Daniel Gustafsson

#16Kirk Wolak
wolakk@gmail.com
In reply to: Daniel Gustafsson (#15)
1 attachment(s)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson <daniel@yesql.se> wrote:

On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:

From a FUTURE email, I noticed pg_jit_available() and it's set to f??

Right, then this installation does not contain the necessary library to JIT
compile the query.

Okay, so does this require a special BUILD command?

Yes, it requires that you compile with --with-llvm. If you don't have
llvm in
the PATH you might need to set LLVM_CONFIG to point to your llvm-config
binary.
With autotools that would be something like:

./configure <other params> --with-llvm LLVM_CONFIG=/path/to/llvm-config

--
Daniel Gustafsson

Thank you, that made it possible to build and run...
UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
I am watching it already consuming 6% of my system memory.

I am re-attaching my script. WHICH includes the settings to FORCE JIT.
It also does an EXPLAIN so you can verify that JIT is on (this is what I
added/noticed!)
And it takes over 20 minutes to get this far. It's still running.
I am re-attaching the script. (as I tweaked it).

This creates 90 million rows of data, so it takes a while.
I BELIEVE that it consumes far less memory if you do not fetch any rows (I
had problems reproducing it if no rows were fetched).
So, this may be beyond the planning stages.

Thanks,

Kirk Out!

Attachments:

Oom_recreation.sqltext/plain; charset=US-ASCII; name=Oom_recreation.sqlDownload
#17Kirk Wolak
wolakk@gmail.com
In reply to: Kirk Wolak (#16)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

On Mon, Jan 22, 2024 at 1:30 AM Kirk Wolak <wolakk@gmail.com> wrote:

On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson <daniel@yesql.se> wrote:

On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:

...
./configure <other params> --with-llvm
LLVM_CONFIG=/path/to/llvm-config

--
Daniel Gustafsson

Thank you, that made it possible to build and run...
UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
I am watching it already consuming 6% of my system memory.

Daniel,
In the previous email, I made note that once the JIT was enabled, the
problem exists in 17Devel.
I re-included my script, which forced the JIT to be used...

I attached an updated script that forced the settings.
But this is still leaking memory (outside of the
pg_backend_memory_context() calls).
Probably because it's at the LLVM level? And it does NOT happen from
planning/opening the query. It appears I have to fetch the rows to see the
problem.

Thanks in advance. Let me know if I should be doing something different?

Kirk Out!
PS: I was wondering if we had a function that showed total memory of the
backend. For helping to determine if we might have a 3rd party leak?
[increase in total memory consumed not noticed by
pg_backend_memory_contexts)

#include "postgres.h"
#include <sys/resource.h>

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(pg_backend_memory_footprint);

Datum pg_backend_memory_footprint(PG_FUNCTION_ARGS) {
long memory_usage_bytes = 0;
struct rusage usage;

getrusage(RUSAGE_SELF, &usage);
memory_usage_bytes = usage.ru_maxrss * 1024;

PG_RETURN_INT64(memory_usage_bytes);
}

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Kirk Wolak (#17)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

On Thu, Jan 25, 2024 at 8:51 AM Kirk Wolak <wolakk@gmail.com> wrote:

getrusage(RUSAGE_SELF, &usage);
memory_usage_bytes = usage.ru_maxrss * 1024;

FWIW log_statement_stats = on shows that in the logs. See ShowUsage()
in postgres.c.

#19Kirk Wolak
wolakk@gmail.com
In reply to: Thomas Munro (#18)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

On Wed, Jan 24, 2024 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Jan 25, 2024 at 8:51 AM Kirk Wolak <wolakk@gmail.com> wrote:

getrusage(RUSAGE_SELF, &usage);
memory_usage_bytes = usage.ru_maxrss * 1024;

FWIW log_statement_stats = on shows that in the logs. See ShowUsage()
in postgres.c.

Thank you for this, here is the *TERMINAL *(Below is the tail of the log).
Notice that the pg_backend_memory_contexts does NOT show the memory
consumed.
But your logging sure did! (I wonder if I enable logging during planning,
but there is like 82,000 cursors being opened... (This removed the FETCH
and still leaks)

7:01:08 kwolak@postgres= # *select pg_temp.fx(497);*
NOTICE: ("9848 kB","10 MB","638 kB")
NOTICE: -----------after close, Count a: 82636, count b: 82636
NOTICE: ("9997 kB","10 MB","648 kB")
fx
----

(1 row)

Time: 525870.117 ms (08:45.870)

*Tail*:

024-01-24 17:01:08.752 EST [28804] DETAIL: ! system usage stats:
! 0.001792 s user, 0.000000 s system, 0.005349 s elapsed
! [560.535969 s user, 31.441656 s system total]
! 185300 kB max resident size
! 232/0 [29219648/54937864] filesystem blocks in/out
! 0/25 [0/1016519] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 10/1 [62671/9660] voluntary/involuntary context switches
2024-01-24 17:01:08.752 EST [28804] STATEMENT: explain SELECT DISTINCT
seid, fr_field_name, st_field_name
FROM pg_temp.parts
WHERE seid <> 497 AND partnum >= '1'
ORDER BY seid;
2024-01-24 17:01:08.759 EST [28804] LOG: QUERY STATISTICS
2024-01-24 17:01:08.759 EST [28804] DETAIL: ! system usage stats:
! 0.006207 s user, 0.000092 s system, 0.006306 s elapsed
! [560.542262 s user, 31.441748 s system total]
!* 185300 kB max resident size*
! 0/0 [29219648/54937864] filesystem blocks in/out
! 0/4 [0/1016523] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/1 [62672/9661] voluntary/involuntary context switches
2024-01-24 17:01:08.759 EST [28804] STATEMENT: SELECT 'pg_temp.fx(497); --
Not run, do \dt+ parts';
2024-01-24 17:04:30.844 EST [28746] LOG: checkpoint starting: time
2024-01-24 17:04:32.931 EST [28746] LOG: checkpoint complete: wrote 21
buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.008 s,
sync=0.006 s, total=2.087 s; sync files=15, longest=0.001 s, average=0.001
s; distance=98 kB, estimate=134 kB; lsn=0/16304D8, redo lsn=0/1630480
2024-01-24 17:11:06.350 EST [28804] LOG: QUERY STATISTICS
2024-01-24 17:11:06.350 EST [28804] DETAIL: ! system usage stats:
! 515.952870 s user, 6.688389 s system, 525.869933 s elapsed
! [1076.495280 s user, 38.130145 s system total]
!* 708104 kB max resident size*
! 370000/3840 [29589648/54941712] filesystem blocks in/out
! 0/327338 [0/1343861] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 22001/5216 [84675/14878] voluntary/involuntary context
switches
2024-01-24 17:11:06.350 EST [28804] STATEMENT: * select pg_temp.fx(497);*
2024-01-24 17:12:16.162 EST [28804] LOG: QUERY STATISTICS
2024-01-24 17:12:16.162 EST [28804] DETAIL: ! system usage stats:
! 1.130029 s user, 0.007727 s system, 1.157486 s elapsed
! [1077.625396 s user, 38.137921 s system total]
! *708104 kB max resident size*
! 992/0 [29590640/54941720] filesystem blocks in/out
! 3/41 [3/1343902] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 9/68 [84685/14946] voluntary/involuntary context switches
2024-01-24 17:12:16.162 EST [28804] STATEMENT: select now();
2024-01-24 17:12:30.944 EST [28804] LOG: QUERY STATISTICS
2024-01-24 17:12:30.944 EST [28804] DETAIL: ! system usage stats:
! 0.004561 s user, 0.000019 s system, 0.004580 s elapsed
! [1077.630064 s user, 38.137944 s system total]
! *708104 kB max resident size*
! 0/0 [29590640/54941728] filesystem blocks in/out
! 0/4 [3/1343906] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [84686/14947] voluntary/involuntary context switches
2024-01-24 17:12:30.944 EST [28804] STATEMENT: select now();

#20Michael Banck
mbanck@gmx.net
In reply to: Kirk Wolak (#17)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

Hi,

On Wed, Jan 24, 2024 at 02:50:52PM -0500, Kirk Wolak wrote:

On Mon, Jan 22, 2024 at 1:30 AM Kirk Wolak <wolakk@gmail.com> wrote:

On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson <daniel@yesql.se> wrote:

On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:

Thank you, that made it possible to build and run...
UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
I am watching it already consuming 6% of my system memory.

Daniel,
In the previous email, I made note that once the JIT was enabled, the
problem exists in 17Devel.
I re-included my script, which forced the JIT to be used...

I attached an updated script that forced the settings.
But this is still leaking memory (outside of the
pg_backend_memory_context() calls).
Probably because it's at the LLVM level? And it does NOT happen from
planning/opening the query. It appears I have to fetch the rows to
see the problem.

I had a look at this (and blogged about it here[1]https://www.credativ.de/en/blog/postgresql-en/quick-benchmark-postgresql-2024q1-release-performance-improvements/) and was also
wondering what was going on with 17devel and the recent back-branch
releases, cause I could also reproduce those continuing memory leaks.

Adding some debug logging when llvm_inline_reset_caches() is called
solves the mystery: as you are calling a function, the fix that is in
17devel and the back-branch releases is not applicable and only after
the function returns llvm_inline_reset_caches() is being called (as
llvm_jit_context_in_use_count is greater than zero, presumably, so it
never reaches the call-site of llvm_inline_reset_caches()).

If you instead run your SQL in a DO-loop (as in the blog post) and not
as a PL/PgSQL function, you should see that it no longer leaks. This
might be obvious to some (and Andres mentioned it in
/messages/by-id/20210421002056.gjd6rpe6toumiqd6@alap3.anarazel.de)
but it took me a while to figure out/find.

Michael

[1]: https://www.credativ.de/en/blog/postgresql-en/quick-benchmark-postgresql-2024q1-release-performance-improvements/

#21Kirk Wolak
wolakk@gmail.com
In reply to: Michael Banck (#20)
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

On Thu, Feb 22, 2024 at 4:49 PM Michael Banck <mbanck@gmx.net> wrote:

Hi,

On Wed, Jan 24, 2024 at 02:50:52PM -0500, Kirk Wolak wrote:

On Mon, Jan 22, 2024 at 1:30 AM Kirk Wolak <wolakk@gmail.com> wrote:

On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson <daniel@yesql.se>

wrote:

On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:

Thank you, that made it possible to build and run...
UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
I am watching it already consuming 6% of my system memory.

...
I had a look at this (and blogged about it here[1]) and was also
wondering what was going on with 17devel and the recent back-branch
releases, cause I could also reproduce those continuing memory leaks.

Adding some debug logging when llvm_inline_reset_caches() is called
solves the mystery: as you are calling a function, the fix that is in
17devel and the back-branch releases is not applicable and only after
the function returns llvm_inline_reset_caches() is being called (as
llvm_jit_context_in_use_count is greater than zero, presumably, so it
never reaches the call-site of llvm_inline_reset_caches()).

If you instead run your SQL in a DO-loop (as in the blog post) and not
as a PL/PgSQL function, you should see that it no longer leaks. This
might be obvious to some (and Andres mentioned it in

/messages/by-id/20210421002056.gjd6rpe6toumiqd6@alap3.anarazel.de
)
but it took me a while to figure out/find.

Thanks for confirming. Inside a routine is required. But disabling JIT

was good enough for us.

Curious if there was another way to end up calling the cleanup? But it had
that "feel" that the context of the cleanup was
being lost between iterations of the loop?