12.2: Howto check memory-leak in worker?

Started by Peteralmost 6 years ago8 messagesgeneral
Jump to latest
#1Peter
pmc@citylink.dinoex.sub.org

Hi all,
I have something that looks a bit insane:

# ps axl | grep 6145
UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
770 6145 1 0 20 0 241756 868 select SsJ - 0:24.62 /usr/local/bin/postgres -D
770 6147 6145 0 23 0 243804 109784 select IsJ - 3:18.52 postgres: checkpointer (
770 6148 6145 0 20 0 241756 21348 select SsJ - 2:02.83 postgres: background writer
770 6149 6145 0 20 0 241756 7240 select SsJ - 16:36.80 postgres: walwriter (pos
770 6150 6145 0 20 0 21980 876 select SsJ - 0:13.92 postgres: archiver last w
770 6151 6145 0 20 0 21980 980 select SsJ - 0:58.45 postgres: stats collector
770 6152 6145 0 20 0 241756 1268 select IsJ - 0:02.07 postgres: logical replicati
770 43315 6145 0 21 0 251844 7520 select IsJ - 1:07.74 postgres: admin postgres 19
770 43317 6145 0 25 0 251764 8684 select IsJ - 1:28.89 postgres: admin bareos 192.
770 43596 6145 0 20 0 245620 4476 select IsJ - 0:00.12 postgres: admin bareos 192.
770 43761 6145 0 20 0 245620 4476 select IsJ - 0:00.15 postgres: admin bareos 192.
770 90206 6145 0 52 0 1331256 219720 racct DsJ - 563:45.41 postgres: bareos bareos 192

The 90206 is continuously growing. It is the unspecific, all-purpose
worker for the www.bareos.com backup tool, so it is a bit difficult to
figure what precisely it does - but it tends to be rather simple
straight-forward queries, so it is unlikely to have dozens of "internal sort
operations and hash tables".

What I can say that at times this worker is completely idle in
ClientRead, but does not shrink in memory. Is this a normal behaviour?

Here is a more dynamic picture: it continues to add 2048kB chunks (and
does not do noticeable paging):

UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
Mon May 4 13:33:09 CEST 2020
770 90206 6145 0 91 0 1335352 226900 - RsJ - 569:09.19 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:33:39 CEST 2020
770 90206 6145 0 93 0 1335352 227696 - RsJ - 569:28.48 postgres: bareos bareos idle (postgres)
Mon May 4 13:34:09 CEST 2020
770 90206 6145 0 92 0 1337400 228116 - RsJ - 569:47.46 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:34:39 CEST 2020
770 90206 6145 0 92 0 1337400 228596 - RsJ - 570:06.56 postgres: bareos bareos UPDATE (postgres)
Mon May 4 13:35:09 CEST 2020
770 90206 6145 0 92 0 1337400 228944 - RsJ - 570:25.62 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:35:40 CEST 2020
770 90206 6145 0 52 0 1337400 229288 racct DsJ - 570:44.33 postgres: bareos bareos UPDATE (postgres)
Mon May 4 13:36:10 CEST 2020
770 90206 6145 0 91 0 1337400 229952 - RsJ - 571:03.20 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:36:40 CEST 2020
770 90206 6145 0 52 0 1337400 223772 racct DsJ - 571:21.50 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:37:10 CEST 2020
770 90206 6145 0 91 0 1337400 224448 - RsJ - 571:40.63 postgres: bareos bareos idle (postgres)
Mon May 4 13:37:40 CEST 2020
770 90206 6145 0 91 0 1339448 225464 - RsJ - 571:58.36 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:38:10 CEST 2020
770 90206 6145 0 52 0 1339448 215620 select SsJ - 572:14.24 postgres: bareos bareos idle (postgres)
Mon May 4 13:38:40 CEST 2020
770 90206 6145 0 81 0 1339448 215320 - RsJ - 572:21.09 postgres: bareos bareos idle (postgres)
Mon May 4 13:39:10 CEST 2020

OS is FreeBSD 11.3-RELEASE-p8 r360175M i386
PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit

autovacuum is Disabled.

The memory-specific config is:

shared_buffers = 200MB
temp_buffers = 40MB
work_mem = 80MB
maintenance_work_mem = 250MB
dynamic_shared_memory_type = posix
random_page_cost = 2.0
effective_cache_size = 1GB

(others are left at default)

I remember vaguely that there are means to have a closer look into
what is using the memory, but do not recall the specifics. Some
pointers or ideas to proceed would be gladly appreciated (Dtrace
should work) - processes will usually fail with OOM at this size, due
to machine configuration - I'm waiting for that now (it is a very very
old pentium3 machine ;) ).

cheerio,
PMc

#2Dilip Kumar
dilipbalaut@gmail.com
In reply to: Peter (#1)
Re: 12.2: Howto check memory-leak in worker?

On Mon, May 4, 2020 at 5:43 PM Peter <pmc@citylink.dinoex.sub.org> wrote:

Hi all,
I have something that looks a bit insane:

# ps axl | grep 6145
UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
770 6145 1 0 20 0 241756 868 select SsJ - 0:24.62 /usr/local/bin/postgres -D
770 6147 6145 0 23 0 243804 109784 select IsJ - 3:18.52 postgres: checkpointer (
770 6148 6145 0 20 0 241756 21348 select SsJ - 2:02.83 postgres: background writer
770 6149 6145 0 20 0 241756 7240 select SsJ - 16:36.80 postgres: walwriter (pos
770 6150 6145 0 20 0 21980 876 select SsJ - 0:13.92 postgres: archiver last w
770 6151 6145 0 20 0 21980 980 select SsJ - 0:58.45 postgres: stats collector
770 6152 6145 0 20 0 241756 1268 select IsJ - 0:02.07 postgres: logical replicati
770 43315 6145 0 21 0 251844 7520 select IsJ - 1:07.74 postgres: admin postgres 19
770 43317 6145 0 25 0 251764 8684 select IsJ - 1:28.89 postgres: admin bareos 192.
770 43596 6145 0 20 0 245620 4476 select IsJ - 0:00.12 postgres: admin bareos 192.
770 43761 6145 0 20 0 245620 4476 select IsJ - 0:00.15 postgres: admin bareos 192.
770 90206 6145 0 52 0 1331256 219720 racct DsJ - 563:45.41 postgres: bareos bareos 192

The 90206 is continuously growing. It is the unspecific, all-purpose
worker for the www.bareos.com backup tool, so it is a bit difficult to
figure what precisely it does - but it tends to be rather simple
straight-forward queries, so it is unlikely to have dozens of "internal sort
operations and hash tables".

What I can say that at times this worker is completely idle in
ClientRead, but does not shrink in memory. Is this a normal behaviour?

Here is a more dynamic picture: it continues to add 2048kB chunks (and
does not do noticeable paging):

UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
Mon May 4 13:33:09 CEST 2020
770 90206 6145 0 91 0 1335352 226900 - RsJ - 569:09.19 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:33:39 CEST 2020
770 90206 6145 0 93 0 1335352 227696 - RsJ - 569:28.48 postgres: bareos bareos idle (postgres)
Mon May 4 13:34:09 CEST 2020
770 90206 6145 0 92 0 1337400 228116 - RsJ - 569:47.46 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:34:39 CEST 2020
770 90206 6145 0 92 0 1337400 228596 - RsJ - 570:06.56 postgres: bareos bareos UPDATE (postgres)
Mon May 4 13:35:09 CEST 2020
770 90206 6145 0 92 0 1337400 228944 - RsJ - 570:25.62 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:35:40 CEST 2020
770 90206 6145 0 52 0 1337400 229288 racct DsJ - 570:44.33 postgres: bareos bareos UPDATE (postgres)
Mon May 4 13:36:10 CEST 2020
770 90206 6145 0 91 0 1337400 229952 - RsJ - 571:03.20 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:36:40 CEST 2020
770 90206 6145 0 52 0 1337400 223772 racct DsJ - 571:21.50 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:37:10 CEST 2020
770 90206 6145 0 91 0 1337400 224448 - RsJ - 571:40.63 postgres: bareos bareos idle (postgres)
Mon May 4 13:37:40 CEST 2020
770 90206 6145 0 91 0 1339448 225464 - RsJ - 571:58.36 postgres: bareos bareos SELECT (postgres)
Mon May 4 13:38:10 CEST 2020
770 90206 6145 0 52 0 1339448 215620 select SsJ - 572:14.24 postgres: bareos bareos idle (postgres)
Mon May 4 13:38:40 CEST 2020
770 90206 6145 0 81 0 1339448 215320 - RsJ - 572:21.09 postgres: bareos bareos idle (postgres)
Mon May 4 13:39:10 CEST 2020

OS is FreeBSD 11.3-RELEASE-p8 r360175M i386
PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit

autovacuum is Disabled.

The memory-specific config is:

shared_buffers = 200MB
temp_buffers = 40MB
work_mem = 80MB
maintenance_work_mem = 250MB
dynamic_shared_memory_type = posix
random_page_cost = 2.0
effective_cache_size = 1GB

(others are left at default)

I remember vaguely that there are means to have a closer look into
what is using the memory, but do not recall the specifics. Some
pointers or ideas to proceed would be gladly appreciated (Dtrace
should work) - processes will usually fail with OOM at this size, due
to machine configuration - I'm waiting for that now (it is a very very
old pentium3 machine ;) ).

One idea is that you can attach your process in gdb and call
MemoryContextStats(TopMemoryContext). This will show which context is
using how much memory. So basically u can call this function 2-3
times with some interval and see in which context the memory is
continuously increasing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#3Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Peter (#1)
Re: 12.2: Howto check memory-leak in worker?

On 5/4/20 4:56 AM, Peter wrote:

Hi all,
I have something that looks a bit insane:

# ps axl | grep 6145
UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
770 6145 1 0 20 0 241756 868 select SsJ - 0:24.62 /usr/local/bin/postgres -D
770 6147 6145 0 23 0 243804 109784 select IsJ - 3:18.52 postgres: checkpointer (
770 6148 6145 0 20 0 241756 21348 select SsJ - 2:02.83 postgres: background writer
770 6149 6145 0 20 0 241756 7240 select SsJ - 16:36.80 postgres: walwriter (pos
770 6150 6145 0 20 0 21980 876 select SsJ - 0:13.92 postgres: archiver last w
770 6151 6145 0 20 0 21980 980 select SsJ - 0:58.45 postgres: stats collector
770 6152 6145 0 20 0 241756 1268 select IsJ - 0:02.07 postgres: logical replicati
770 43315 6145 0 21 0 251844 7520 select IsJ - 1:07.74 postgres: admin postgres 19
770 43317 6145 0 25 0 251764 8684 select IsJ - 1:28.89 postgres: admin bareos 192.
770 43596 6145 0 20 0 245620 4476 select IsJ - 0:00.12 postgres: admin bareos 192.
770 43761 6145 0 20 0 245620 4476 select IsJ - 0:00.15 postgres: admin bareos 192.
770 90206 6145 0 52 0 1331256 219720 racct DsJ - 563:45.41 postgres: bareos bareos 192

The 90206 is continuously growing. It is the unspecific, all-purpose
worker for the www.bareos.com backup tool, so it is a bit difficult to
figure what precisely it does - but it tends to be rather simple
straight-forward queries, so it is unlikely to have dozens of "internal sort
operations and hash tables".

Is there any relevant information in the bareos or Postgres logs?

OS is FreeBSD 11.3-RELEASE-p8 r360175M i386
PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit

autovacuum is Disabled.

Any particular reason for above?

cheerio,
PMc

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Peter
pmc@citylink.dinoex.sub.org
In reply to: Adrian Klaver (#3)
Re: 12.2: Howto check memory-leak in worker?

On Mon, May 04, 2020 at 12:55:38PM -0700, Adrian Klaver wrote:

! > The 90206 is continuously growing. It is the unspecific, all-purpose
! > worker for the www.bareos.com backup tool, so it is a bit difficult to
! > figure what precisely it does - but it tends to be rather simple
! > straight-forward queries, so it is unlikely to have dozens of "internal sort
! > operations and hash tables".
! >
!
! Is there any relevant information in the bareos or Postgres logs?

No. Only the final exitus-in-tabula.

! > autovacuum is Disabled.
!
! Any particular reason for above?

Yes. Disk spindown doesn't work so, and I see no reason to have a
backup zone that is used few times a day spinning 24 hours.

BTW, I would greatly appreciate if we would reconsider the need for
the server to read the postmaster.pid file every few seconds (probably
needed for something, I don't know).
That makes it necessary to set atime=off to get a spindown, and I
usually prefer to have atime=on so I can see what my stuff is
currently doing.

cheerio,
PMc

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Peter (#4)
Re: 12.2: Howto check memory-leak in worker?

On Tue, May 5, 2020 at 10:13 AM Peter <pmc@citylink.dinoex.sub.org> wrote:

BTW, I would greatly appreciate if we would reconsider the need for
the server to read the postmaster.pid file every few seconds (probably
needed for something, I don't know).
That makes it necessary to set atime=off to get a spindown, and I
usually prefer to have atime=on so I can see what my stuff is
currently doing.

That'd be this:

/*
* Once a minute, verify that postmaster.pid hasn't been removed or
* overwritten. If it has, we force a shutdown. This avoids having
* postmasters and child processes hanging around after their database
* is gone, and maybe causing problems if a new database cluster is
* created in the same place. It also provides some protection
* against a DBA foolishly removing postmaster.pid and manually
* starting a new postmaster. Data corruption is likely to ensue from
* that anyway, but we can minimize the damage by aborting ASAP.
*/
if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE)
{
if (!RecheckDataDirLockFile())

#6Peter
pmc@citylink.dinoex.sub.org
In reply to: Thomas Munro (#5)
Re: 12.2: Howto check memory-leak in worker?

On Tue, May 05, 2020 at 10:57:04AM +1200, Thomas Munro wrote:
! On Tue, May 5, 2020 at 10:13 AM Peter <pmc@citylink.dinoex.sub.org> wrote:
! > BTW, I would greatly appreciate if we would reconsider the need for
! > the server to read the postmaster.pid file every few seconds (probably
! > needed for something, I don't know).
! > That makes it necessary to set atime=off to get a spindown, and I
! > usually prefer to have atime=on so I can see what my stuff is
! > currently doing.
!
! That'd be this:
!
! /*
! * Once a minute, verify that postmaster.pid hasn't been removed or
! * overwritten. If it has, we force a shutdown. This avoids having
! * postmasters and child processes hanging around after their database
! * is gone, and maybe causing problems if a new database cluster is
! * created in the same place. It also provides some protection
! * against a DBA foolishly removing postmaster.pid and manually
! * starting a new postmaster. Data corruption is likely to ensue from
! * that anyway, but we can minimize the damage by aborting ASAP.
! */
! if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE)
! {
! if (!RecheckDataDirLockFile())

Ah. Yes. Thanks for identifying this issue.

It is really hard to comment on this in a way that might not be
considered offensive, so lets put it that way: You all have seen
the Apollo-13 movie, so You know that a need to power-down as much
as possible may appear en-route, and in a situation where you have
lots of other issues, so what you need the least is things like
this getting in your way.
I usually take this a bit further and consider only two relevant
quality levels for software: interplanetary and interstellar - where
i.p. is the kind of stuff on which we will trust our lives during
the manned Jupiter mission. PG might well qualify for this, except
with things like the quoted; so I think such should be tagged with
appropriate compile-time switches.

Now concerning the memory leak:
That one was introduced with the work done on the GSSAPI encryption;
it goes away when setting 'hostnogssenc' in pg_hba. (And, whoever has
editor rights, there is also a little fixup needed for that switch in
pg_hba.conf; it should probably be "eleven formats".)

cheerio,
PMc

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter (#6)
Re: 12.2: Howto check memory-leak in worker?

Peter <pmc@citylink.dinoex.sub.org> writes:

On Tue, May 05, 2020 at 10:57:04AM +1200, Thomas Munro wrote:
! * Once a minute, verify that postmaster.pid hasn't been removed or
! * overwritten. If it has, we force a shutdown.

It is really hard to comment on this in a way that might not be
considered offensive, so lets put it that way: You all have seen
the Apollo-13 movie, so You know that a need to power-down as much
as possible may appear en-route, and in a situation where you have
lots of other issues, so what you need the least is things like
this getting in your way.

Well, the choice we face is preventing somebody's disk from spinning
down, versus preventing somebody else from completely corrupting their
database. From where I sit that's not a difficult choice, nor one
that I feel a need to let users second-guess.

Now concerning the memory leak:
That one was introduced with the work done on the GSSAPI encryption;
it goes away when setting 'hostnogssenc' in pg_hba.

Oooh ... it looks like some of the encryption code paths have neglected
to call gss_release_buffer. Will fix, thanks for the report!

regards, tom lane

#8Peter
pmc@citylink.dinoex.sub.org
In reply to: Tom Lane (#7)
Re: 12.2: Howto check memory-leak in worker?

On Tue, May 05, 2020 at 11:59:27AM -0400, Tom Lane wrote:

! Well, the choice we face is preventing somebody's disk from spinning
! down, versus preventing somebody else from completely corrupting their
! database. From where I sit that's not a difficult choice, nor one
! that I feel a need to let users second-guess.

Then maybe You see a scenario where that feature would actually
prevent db corruption, while I have not yet found a realistic one.
Instead, what gave me headaches is that ZFS might take a single
tablespace (=pool) offline with the cluster continuing to run - and
I am not sure if the db is supposed to survive that (mine did, after I
had hit the power button in horror), nor is it easy to protect from
that.
Anyway, I can now switch that feature off per local patch, which is
the second-best solution.

! Oooh ... it looks like some of the encryption code paths have neglected
! to call gss_release_buffer. Will fix, thanks for the report!

Great! So I assume I don't need to send a bug report I had prepared
interim. Feel free to ask me for anything that might be still needed.

cheerio,
PMc