BUG #16707: Memory leak

Started by PG Bug reporting formover 5 years ago21 messagesbugs

noreply@postgresql.org

over 5 years ago

The following bug has been logged on the website:

Bug reference: 16707
Logged by: Kurt Roeckx
Email address: kurt@roeckx.be
PostgreSQL version: 12.4
Operating system: Debian
Description:

Hi,

It seems that since I switched from PostgreSQL 9.5 to 12, I started to have
a memory leak. This results in PostgreSQL dying because it can no longer
allocate memory, in which case it restarts and most of the time everything
then just recovers, which is why I'm rather late in noticing the problem.

I migrated early April to 12.2-2.pgdg100+1, and I'm currently running
12.4-1.pgdg100+1.

I'm running 2 database on that that server, but I only seem to have issues
with 1 of them, only the processes that connect to that database seem to
have a memory leak.

The only database I have a problem with is the one used by synapse
(https://github.com/matrix-org/synapse).

My configuration has:
shared_buffers = 4096MB
work_mem = 64MB
maintenance_work_mem = 64MB

The VSZ started around 4.3 GB, now is at 5.4 GB, of which 5.1 is RSS. The
VSZ and RSS grow over time. There are 10 such processes, all connections to
the synapse database.
All other processes, including a connection the other database, still have a
VSZ of 4.3 GB. The stats collector is the exception with only 64MB VSZ.

I've put log_statement to "all" for a short while to see the queries that
were happening. The only type of query I see that's different to what all
the other databases do is that it uses "BEGIN ISOLATION LEVEL REPEATABLE
READ", while all the others just use "BEGIN". Not sure if that's related or
not.

Kurt

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #16707: Memory leak

PG Bug reporting form <noreply@postgresql.org> writes:

It seems that since I switched from PostgreSQL 9.5 to 12, I started to have
a memory leak.

Could you maybe collect a memory map from one of the bloated processes?
It'd go something like

* attach to process with gdb

* "call MemoryContextStats(TopMemoryContext)"

* quit gdb

* look in postmaster log for memory dump

You probably need to have debug symbols installed for step 2 to work.

regards, tom lane

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Tom Lane (#2)

Re: BUG #16707: Memory leak

On Mon, Nov 09, 2020 at 04:34:33PM -0500, Tom Lane wrote:

* look in postmaster log for memory dump

TopMemoryContext: 154592 total in 8 blocks; 40296 free (89 chunks); 114296 used
RI compare cache: 16384 total in 2 blocks; 6664 free (3 chunks); 9720 used
RI query cache: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
RI constraint cache: 40888 total in 2 blocks; 2624 free (0 chunks); 38264 used
CFuncHash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
Tsearch dictionary cache: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used
Tsearch parser cache: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used
Tsearch configuration cache: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used
Sequence values: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used
Btree proof lookup cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used
RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used
Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
smgr relation table: 65536 total in 4 blocks; 16664 free (14 chunks); 48872 used
TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used
Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used
Relcache by OID: 32768 total in 3 blocks; 11536 free (6 chunks); 21232 used
CacheMemoryContext: 2340096 total in 18 blocks; 191496 free (53 chunks); 2148600 used
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_push_actions_staging_id
index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: ev_extrem_id
index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: ev_extrem_room
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: event_forward_extremities_event_id_room_id_key
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: event_push_actions_stream_ordering
index info: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used: event_push_actions_rm_tokens
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: event_push_actions_room_id_user_id
index info: 3072 total in 2 blocks; 1104 free (1 chunks); 1968 used: event_id_user_id_profile_tag_uniqueness
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: room_stats_historical_end_ts
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: room_stats_historical_pkey
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: application_services_txns_id
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: application_services_txns_as_id_txn_id_key
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: event_push_summary_user_rm
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: presence_stream_user_id
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: presence_stream_id
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: device_inbox_stream_id
index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: device_inbox_user_stream_id
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: user_directory_stream_pos_lock_key
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: stream_positions_idx
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: stats_incremental_position_lock_key
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: room_stats_current_pkey
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: application_services_state_pkey
index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_statistic_relid_att_inh_index
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: appservice_stream_position_lock_key
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: federation_stream_position_instance
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: users_in_public_rooms_r_idx
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: users_in_public_rooms_u_idx
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_directory_user_idx
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_directory_room_idx
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_directory_search_user_idx
index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: user_directory_search_fts_idx
index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: user_ips_user_token_ip_unique_index
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_ips_last_seen_only
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: user_ips_last_seen
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_signature_stream_idx
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: local_current_membership_room_idx
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: local_current_membership_idx
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: server_signature_keys_server_name_key_id_key
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_stats_current_pkey
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: device_uniqueness
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_lists_remote_cache_unique_id
index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: threepid_validation_token_session_id
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: threepid_validation_token_pkey
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: destinations_pkey
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: stream_ordering_to_exterm_rm_idx
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: stream_ordering_to_exterm_idx
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: received_transactions_ts
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: received_transactions_transaction_id_origin_key
index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: e2e_cross_signing_keys_idx
index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: server_keys_json_uniqueness
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: monthly_active_users_time_stamp
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: monthly_active_users_users
index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: current_state_events_room_id_type_state_key_key
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: current_state_events_event_id_key
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: event_push_summary_stream_ordering_lock_key
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_70339161_index
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_federation_inbox_sender_id
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: account_data_stream_id
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: account_data_uniqueness
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: group_attestations_renewals_v_idx
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: group_attestations_renewals_u_idx
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: group_attestations_renewals_g_idx
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: blocked_rooms_idx
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: ui_auth_sessions_session_id_key
index info: 1024 total in 1 blocks; 64 free (0 chunks); 960 used: room_aliases_id
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: room_aliases_room_alias_key
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: ex_outlier_stream_pkey
index info: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used: e2e_room_keys_with_version_idx
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: room_account_data_stream_id
index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: room_account_data_uniqueness
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_daily_visits_ts_idx
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: user_daily_visits_uts_idx
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: private_user_data_max_stream_id_lock_key
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_lists_outbound_last_success_unique_idx
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: device_federation_outbox_id
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_federation_outbox_destination_id
index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: device_lists_outbound_pokes_stream
index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_lists_outbound_pokes_user
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: device_lists_outbound_pokes_id
index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: e2e_cross_signing_signatures2_idx
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_stats_historical_end_ts
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: user_stats_historical_pkey
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_15165859_index
index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: event_txn_id_ts
index info: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used: event_txn_id_txn_id
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_txn_id_event_id
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: remote_profile_cache_time
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: remote_profile_cache_user_id
TS dictionary: 1024 total in 1 blocks; 696 free (0 chunks); 328 used: simple
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_search_event_id_idx
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_search_ev_ridx
index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: event_search_fts_idx
TS dictionary: 8192 total in 4 blocks; 2864 free (9 chunks); 5328 used: english_stem
index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: room_tag_uniqueness
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: profiles_user_id_key
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: e2e_device_keys_json_uniqueness
index info: 3072 total in 2 blocks; 984 free (1 chunks); 2088 used: remote_media_repository_thumbn_media_origin_id_width_height_met
index info: 3072 total in 2 blocks; 1048 free (1 chunks); 2024 used: remote_media_cache_thumbnails_media_origin_media_id_thumbna_key
index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: e2e_room_keys_versions_idx
150 more child contexts containing 287744 total in 275 blocks; 108096 free (155 chunks); 179648 used
WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used
PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
MdSmgr: 32768 total in 3 blocks; 19048 free (15 chunks); 13720 used
LOCALLOCK hash: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
ErrorContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used

Which was for this process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 10000 2.6 16.3 5547172 5374656 ? Ss Nov08 54:10 postgres: synapse synapse [local] idle

Kurt

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Kurt Roeckx (#3)

Re: BUG #16707: Memory leak

Kurt Roeckx <kurt@roeckx.be> writes:

On Mon, Nov 09, 2020 at 04:34:33PM -0500, Tom Lane wrote:

* look in postmaster log for memory dump

...
Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used

Which was for this process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 10000 2.6 16.3 5547172 5374656 ? Ss Nov08 54:10 postgres: synapse synapse [local] idle

Hm. It would seem that whatever you're leaking was not allocated via
palloc. Have you got any extensions loaded into that backend?

It's also worth noting that if you've got 4GB of shared buffers,
a total process vsize of 5.3GB doesn't seem all that far out of
line. I'm not quite convinced that you have a leak at all,
as opposed to processes gradually touching more and more of the
shared buffer arena.

regards, tom lane

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#4)

Re: BUG #16707: Memory leak

Hi,

On 2020-11-09 17:20:37 -0500, Tom Lane wrote:

Kurt Roeckx <kurt@roeckx.be> writes:

Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used

Which was for this process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 10000 2.6 16.3 5547172 5374656 ? Ss Nov08 54:10 postgres: synapse synapse [local] idle

Hm. It would seem that whatever you're leaking was not allocated via
palloc. Have you got any extensions loaded into that backend?

It's also worth noting that if you've got 4GB of shared buffers,
a total process vsize of 5.3GB doesn't seem all that far out of
line. I'm not quite convinced that you have a leak at all,
as opposed to processes gradually touching more and more of the
shared buffer arena.

As this is on a halfway recent linux, I suggest doing something like

$ grep ^Rss /proc/$pid/status
RssAnon: 6664 kB
RssFile: 69512 kB
RssShmem: 15788 kB

Anon are allocations and some other small stuff, RssFile is memory
mapped files, shmem is shared memory (but 0 when huge pages are used).

Greetings,

Andres Freund

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Tom Lane (#4)

Re: BUG #16707: Memory leak

On Mon, Nov 09, 2020 at 05:20:37PM -0500, Tom Lane wrote:

Kurt Roeckx <kurt@roeckx.be> writes:

On Mon, Nov 09, 2020 at 04:34:33PM -0500, Tom Lane wrote:

* look in postmaster log for memory dump

...
Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used

Which was for this process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 10000 2.6 16.3 5547172 5374656 ? Ss Nov08 54:10 postgres: synapse synapse [local] idle

Hm. It would seem that whatever you're leaking was not allocated via
palloc. Have you got any extensions loaded into that backend?

It's also worth noting that if you've got 4GB of shared buffers,
a total process vsize of 5.3GB doesn't seem all that far out of
line. I'm not quite convinced that you have a leak at all,
as opposed to processes gradually touching more and more of the
shared buffer arena.

Top says the shared size is now 4.1 GB, and has said so for a
while. On the other hand, the virtual size keeps growning. The
virtual size shouldn't change much over time. The resident size
will grow until the shared buffer is fully used, then should stay
around that value.

ps now says:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 10000 2.2 16.6 5632572 5468232 ? Ss Nov08 61:09 postgres: synapse synapse [local] idle

At some point all 32 GB of RAM will get used, it starts to swap
out, then malloc starts to fail. (I've set Linux to not
allow overcommit (vm.overcommit_memory = 2), otherwise it would be
the OOM killer doing it's thing.

This is what an other process looks like:
postgres 8966 0.2 3.8 4387236 1276104 ? Ss Nov08 7:10 postgres: kurt certs [local] idle

top says that that process has 1.2 GB shared RAM, which also
matches it resident size.

Kurt

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Andres Freund (#5)

Re: BUG #16707: Memory leak

On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:

As this is on a halfway recent linux, I suggest doing something like

$ grep ^Rss /proc/$pid/status
RssAnon: 6664 kB
RssFile: 69512 kB
RssShmem: 15788 kB

RssAnon: 1197064 kB
RssFile: 27420 kB
RssShmem: 4248052 kB

Kurt

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Kurt Roeckx (#7)

Re: BUG #16707: Memory leak

On Tue, Nov 10, 2020 at 09:11:20AM +0100, Kurt Roeckx wrote:

On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:

As this is on a halfway recent linux, I suggest doing something like

$ grep ^Rss /proc/$pid/status
RssAnon: 6664 kB
RssFile: 69512 kB
RssShmem: 15788 kB

RssAnon: 1197064 kB
RssFile: 27420 kB
RssShmem: 4248052 kB

Maybe this is useful:
$ grep kB /proc/10000/status
VmPeak: 5654956 kB
VmSize: 5637004 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 5479104 kB
VmRSS: 5472668 kB
RssAnon: 1197196 kB
RssFile: 27420 kB
RssShmem: 4248052 kB
VmData: 1192724 kB
VmStk: 132 kB
VmExe: 5388 kB
VmLib: 68200 kB
VmPTE: 10932 kB
VmSwap: 0 kB
HugetlbPages: 0 kB

Kurt

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Kurt Roeckx (#8)

Re: BUG #16707: Memory leak

On Tue, Nov 10, 2020 at 09:19:34AM +0100, Kurt Roeckx wrote:

On Tue, Nov 10, 2020 at 09:11:20AM +0100, Kurt Roeckx wrote:

On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:

As this is on a halfway recent linux, I suggest doing something like

$ grep ^Rss /proc/$pid/status
RssAnon: 6664 kB
RssFile: 69512 kB
RssShmem: 15788 kB

RssAnon: 1197064 kB
RssFile: 27420 kB
RssShmem: 4248052 kB

Maybe this is useful:
$ grep kB /proc/10000/status
VmPeak: 5654956 kB
VmSize: 5637004 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 5479104 kB
VmRSS: 5472668 kB
RssAnon: 1197196 kB
RssFile: 27420 kB
RssShmem: 4248052 kB
VmData: 1192724 kB
VmStk: 132 kB
VmExe: 5388 kB
VmLib: 68200 kB
VmPTE: 10932 kB
VmSwap: 0 kB
HugetlbPages: 0 kB

The same thing, about 9 hours later:
VmPeak: 5882112 kB
VmSize: 5857276 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 5722668 kB
VmRSS: 5697940 kB
RssAnon: 1414988 kB
RssFile: 27472 kB
RssShmem: 4255480 kB
VmData: 1412996 kB
VmStk: 132 kB
VmExe: 5388 kB
VmLib: 68200 kB
VmPTE: 11372 kB
VmSwap: 0 kB
HugetlbPages: 0 kB

Kurt

#10

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Kurt Roeckx (#7)

Re: BUG #16707: Memory leak

Hi,

On 2020-11-10 09:11:20 +0100, Kurt Roeckx wrote:

On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:

As this is on a halfway recent linux, I suggest doing something like

$ grep ^Rss /proc/$pid/status
RssAnon: 6664 kB
RssFile: 69512 kB
RssShmem: 15788 kB

RssAnon: 1197064 kB
RssFile: 27420 kB
RssShmem: 4248052 kB

Ok, so it's actual allocations that are the problem. What kind of
queries is this workload running?

There's one known (slow) memory leak in the JIT code / LLVM. Could you
check if the issue vanishes if you disable JIT (jit = 0)?

Otherwise it might be useful to collect stack traces for memory
allocations. You could try something like 'heaptrack' or add a perf
probe on malloc, and do a perf profile.

E.g. something like
perf probe -x /lib/x86_64-linux-gnu/libc.so.6 -a malloc
perf record -e probe_libc:malloc --call-graph dwarf -p $pid_of_problematic_process

Regards,

Andres

#11

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Kurt Roeckx (#9)

Re: BUG #16707: Memory leak

I took a copy of /proc/$pid/smaps about 2 hours apart, and a diff
shows:
-5604495f1000-5604886fd000 rw-p 00000000 00:00 0 [heap]
-Size:            1033264 kB
+5604495f1000-56048cc1c000 rw-p 00000000 00:00 0 [heap]
+Size:            1104044 kB
 KernelPageSize:        4 kB
 MMUPageSize:           4 kB
-Rss:             1030296 kB
-Pss:             1030296 kB
+Rss:             1101288 kB
+Pss:             1101288 kB
 Shared_Clean:          0 kB
 Shared_Dirty:          0 kB
 Private_Clean:         0 kB
-Private_Dirty:   1030296 kB
-Referenced:       998992 kB
-Anonymous:       1030296 kB
+Private_Dirty:   1101288 kB
+Referenced:      1067220 kB
+Anonymous:       1101288 kB
 LazyFree:              0 kB
 AnonHugePages:         0 kB
 ShmemPmdMapped:        0 kB

So the heap grew with 70 MB.

The new results from status:
VmPeak: 5928312 kB
VmSize: 5928056 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 5772336 kB
VmRSS: 5772084 kB
RssAnon: 1485980 kB
RssFile: 27472 kB
RssShmem: 4258632 kB
VmData: 1483776 kB
VmStk: 132 kB
VmExe: 5388 kB
VmLib: 68200 kB
VmPTE: 11512 kB
VmSwap: 0 kB
HugetlbPages: 0 kB

Both seem to say that the heap grew with 70 MB in about 2 hours.
With 10 processes growing at this rate, it grows with about 8 GB a
day.

Kurt

#12

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Andres Freund (#10)

Re: BUG #16707: Memory leak

On Tue, Nov 10, 2020 at 11:35:17AM -0800, Andres Freund wrote:

Hi,

On 2020-11-10 09:11:20 +0100, Kurt Roeckx wrote:

On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:

As this is on a halfway recent linux, I suggest doing something like

$ grep ^Rss /proc/$pid/status
RssAnon: 6664 kB
RssFile: 69512 kB
RssShmem: 15788 kB

RssAnon: 1197064 kB
RssFile: 27420 kB
RssShmem: 4248052 kB

Ok, so it's actual allocations that are the problem. What kind of
queries is this workload running?

I really have no idea, but I'll try to get an idea if the jit
thing doesn't work.

There's one known (slow) memory leak in the JIT code / LLVM. Could you
check if the issue vanishes if you disable JIT (jit = 0)?

I've just restarted it with jit = 0.

Otherwise it might be useful to collect stack traces for memory
allocations. You could try something like 'heaptrack' or add a perf
probe on malloc, and do a perf profile.

E.g. something like
perf probe -x /lib/x86_64-linux-gnu/libc.so.6 -a malloc
perf record -e probe_libc:malloc --call-graph dwarf -p $pid_of_problematic_process

Let's first see what happens with jit disabled. If I still see it,
I'll try that.

Kurt

#13

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Kurt Roeckx (#12)

Re: BUG #16707: Memory leak

On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote:

There's one known (slow) memory leak in the JIT code / LLVM. Could you
check if the issue vanishes if you disable JIT (jit = 0)?

I've just restarted it with jit = 0.

It's been about 3 hours since the restart, and it looks much
better, it seems to be solved.

Kurt

#14

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Kurt Roeckx (#13)

Re: BUG #16707: Memory leak

Hi,

On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote:

On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote:

There's one known (slow) memory leak in the JIT code / LLVM. Could you
check if the issue vanishes if you disable JIT (jit = 0)?

I've just restarted it with jit = 0.

It's been about 3 hours since the restart, and it looks much
better, it seems to be solved.

Hm, darn. Any chance you could check if the leak is present if you turn
on jit again, but disable inlining with jit_inline_above_cost=-1? If
that still fixes the leak I think I know the issue / have a reproducer
already...

- Andres

#15

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Andres Freund (#14)

Re: BUG #16707: Memory leak

On Tue, Nov 10, 2020 at 05:38:08PM -0800, Andres Freund wrote:

Hi,

On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote:

On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote:

There's one known (slow) memory leak in the JIT code / LLVM. Could you
check if the issue vanishes if you disable JIT (jit = 0)?

I've just restarted it with jit = 0.

It's been about 3 hours since the restart, and it looks much
better, it seems to be solved.

Hm, darn. Any chance you could check if the leak is present if you turn
on jit again, but disable inlining with jit_inline_above_cost=-1? If
that still fixes the leak I think I know the issue / have a reproducer
already...

So after about 14 hours, it shows:
$ grep kB /proc/32117/status
VmPeak: 4412436 kB
VmSize: 4389988 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 4261196 kB
VmRSS: 4239936 kB
RssAnon: 6112 kB
RssFile: 7340 kB
RssShmem: 4226484 kB
VmData: 5412 kB
VmStk: 132 kB
VmExe: 5388 kB
VmLib: 13032 kB
VmPTE: 8532 kB
VmSwap: 0 kB
HugetlbPages: 0 kB

The munin graph is also very flat now, so with jit=0, it seems
fixed.

I've just restarted it with
jit=1
jit_inline_above_cost=-1

I will let you know.

Kurt

#16

Kurt Roeckx

kurt@roeckx.be

over 5 years ago

In reply to: Kurt Roeckx (#15)

Re: BUG #16707: Memory leak

On Wed, Nov 11, 2020 at 10:49:18AM +0100, Kurt Roeckx wrote:

Hm, darn. Any chance you could check if the leak is present if you turn
on jit again, but disable inlining with jit_inline_above_cost=-1? If
that still fixes the leak I think I know the issue / have a reproducer
already...

I've just restarted it with
jit=1
jit_inline_above_cost=-1

So after more than 2 hours, I see no leak with those settings.

Kurt

#17

laurent.mignon@acsone.eu

over 5 years ago

In reply to: Kurt Roeckx (#15)

Re: BUG #16707: Memory leak

Hi,

We also encounter this issue on our postgresql12 database. Disabling jit
inlining seems to reduce drastically the memory leak. But after 4 days
without restart (compared to the 3 restart / day before the change) it
seems that we still have a leak. The RAM is full and the swap almost full..
I hope that a fix will land quickly since this issue has a huge impact on
our business.

Regards

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html

#18

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: laurent.mignon@acsone.eu (#17)

Re: BUG #16707: Memory leak

Hi,

On 2020-11-19 09:45:50 -0700, laurent.mignon@acsone.eu wrote:

We also encounter this issue on our postgresql12 database. Disabling jit
inlining seems to reduce drastically the memory leak. But after 4 days
without restart (compared to the 3 restart / day before the change) it
seems that we still have a leak. The RAM is full and the swap almost
full..

In that case it is a separate issue from Kurt's. Could you follow the
instructions from
/messages/by-id/1999972.1604957673@sss.pgh.pa.us
and
/messages/by-id/20201110043127.sfkyvvjqy7x3er5k@alap3.anarazel.de
providing the output here?

I hope that a fix will land quickly since this issue has a huge impact on
our business.

It's unfortunately a nontrivial issue - I have a prototype for a fix,
but it's complicated and not yet complete. And unfortunately I've since
injured my hand and am now typing very slowly...

Btw, I'll be more likely to see your reply promptly if you CC me in your
reply.

Greetings,

Andres Freund

#19

Jaime Casanova

jcasanov@systemguards.com.ec

about 5 years ago

In reply to: Andres Freund (#14)

Re: BUG #16707: Memory leak

On Tue, Nov 10, 2020 at 05:38:08PM -0800, Andres Freund wrote:

Hi,

On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote:

On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote:

There's one known (slow) memory leak in the JIT code / LLVM. Could you
check if the issue vanishes if you disable JIT (jit = 0)?

I've just restarted it with jit = 0.

It's been about 3 hours since the restart, and it looks much
better, it seems to be solved.

Hm, darn. Any chance you could check if the leak is present if you turn
on jit again, but disable inlining with jit_inline_above_cost=-1? If
that still fixes the leak I think I know the issue / have a reproducer
already...

I was bit by this too while testing something. I thought it could have been
a problem caused by the support of llvm 12? but given that this report
is for pg12, it seems this is older than that.

should we document it somewhere?

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#20

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Jaime Casanova (#19)

Re: BUG #16707: Memory leak

Hi,

On Wed, Apr 7, 2021, at 09:28, Jaime Casanova wrote:

On Tue, Nov 10, 2020 at 05:38:08PM -0800, Andres Freund wrote:

Hi,

On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote:

On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote:

There's one known (slow) memory leak in the JIT code / LLVM. Could you
check if the issue vanishes if you disable JIT (jit = 0)?

I've just restarted it with jit = 0.

It's been about 3 hours since the restart, and it looks much
better, it seems to be solved.

Hm, darn. Any chance you could check if the leak is present if you turn
on jit again, but disable inlining with jit_inline_above_cost=-1? If
that still fixes the leak I think I know the issue / have a reproducer
already...

I was bit by this too while testing something. I thought it could have been
a problem caused by the support of llvm 12? but given that this report
is for pg12, it seems this is older than that.

I hope to fix it soon - I have two different mostly working fixes. I'll get back to it once the dust around the freeze settles.

Andres

#21

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Andres Freund (#20)