autovacuum worker running amok - and me too ;)

Started by wambacherover 11 years ago25 messagesgeneral

wnordmann@gmx.de

over 11 years ago

Hi,

running postgresql on ubuntu for many years, but now i'm in big trouble.

My system has 24GB of real memory but after some hours one autovacuum worker
is using 80-90% of memory, the OOM-Killer (out of memory killer) kills the
process with kill -9 and the postgresql-server is restarting because of that
problem.

i changed the base configuration to use very small buffers, restartetd the
server twice but the problem still exists.

i think, it's allways the same table and that table is huge: 111GB data and
3 Indices with 4GB, 128 GB and 12 GB. It's the table planet_osm_ways from
openstreetmap. maybe that helps.

without running server 8GB are user by other programs and 16GB are free.

no idea whats going on.

regards
walter

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: wambacher (#1)

Re: autovacuum worker running amok - and me too ;)

wambacher <wnordmann@gmx.de> writes:

My system has 24GB of real memory but after some hours one autovacuum worker
is using 80-90% of memory, the OOM-Killer (out of memory killer) kills the
process with kill -9 and the postgresql-server is restarting because of that
problem.

i changed the base configuration to use very small buffers, restartetd the
server twice but the problem still exists.

i think, it's allways the same table and that table is huge: 111GB data and
3 Indices with 4GB, 128 GB and 12 GB. It's the table planet_osm_ways from
openstreetmap. maybe that helps.

Maybe you could reduce the statistics targets for that table.

I think we've heard that the analyze functions for PostGIS data types are
memory hogs, too --- maybe it's worth inquiring about that on the postgis
mailing lists.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: Tom Lane (#2)

Re: autovacuum worker running amok - and me too ;)

Tom Lane-2 wrote

Maybe you could reduce the statistics targets for that table.

don't understand what you mean. do you mean how often that table is
autovacuumed? at the moment about once a day or once in two days, i think.

I think we've heard that the analyze functions for PostGIS data types are
memory hogs, too --- maybe it's worth inquiring about that on the postgis
mailing listl

good idea and yes, it's a table with postgis-geometries and the
corresponding indices.

will ask the postgis guys too.

at the moment i disabled autovacuum but i'll need it soon again, because the
server is running 24/7 and data is changing permantly.

thanks
walter

btw: no big problems in my mini-config, ok?

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840305.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: wambacher (#3)

Re: autovacuum worker running amok - and me too ;)

wambacher <wnordmann@gmx.de> writes:

Tom Lane-2 wrote

Maybe you could reduce the statistics targets for that table.

don't understand what you mean. do you mean how often that table is
autovacuumed? at the moment about once a day or once in two days, i think.

No, I mean the amount of stats detail that ANALYZE tries to collect.
I'm guessing that it's not auto-vacuum but auto-analyze that's getting
OOMed.

See ALTER TABLE SET STATISTICS TARGET.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: Tom Lane (#4)

Re: autovacuum worker running amok - and me too ;)

Tom Lane-2 wrote

See ALTER TABLE SET STATISTICS TARGET.

thanks, will try it

regards
walter

btw: the postgis analyze problem has been fixed more than one year ago, but
i'll ask them too.

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840313.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Roxanne Reid-Bennett

rox@tara-lu.com

over 11 years ago

In reply to: Tom Lane (#2)

Re: autovacuum worker running amok - and me too ;)

On 3/3/2015 6:17 PM, Tom Lane wrote:

wambacher <wnordmann@gmx.de> writes:

My system has 24GB of real memory but after some hours one autovacuum worker
is using 80-90% of memory, the OOM-Killer (out of memory killer) kills the
process with kill -9 and the postgresql-server is restarting because of that
problem.
i changed the base configuration to use very small buffers, restartetd the
server twice but the problem still exists.
i think, it's allways the same table and that table is huge: 111GB data and
3 Indices with 4GB, 128 GB and 12 GB. It's the table planet_osm_ways from
openstreetmap. maybe that helps.

Maybe you could reduce the statistics targets for that table.

I think we've heard that the analyze functions for PostGIS data types are
memory hogs, too --- maybe it's worth inquiring about that on the postgis
mailing lists.

Most definitely ask on the Postgis list. Identify the full Postgis
version and Postgres versions as well.
We had a case on a trial upgrade (9.1 to 9.3 and Postgis upgrade
(2.0->2.1)) where analyze was running amok memory wise on 3 tables with
geometry types. (posted on this list)
Unfortunately the VM that system was on got corrupted and I wasn't able
to provide the data for a test scenario to Paul Ramsey - so never filed
the bug report.
(perhaps the VM was the issue and NOT the upgrade...) At the time, we
ended up NOT upgrading the production box due this issue potentially
being a show stopper. I've also never tried to re-create the upgrade
stack to test it out on a current copy of the data.

Roxanne

regards, tom lane

--
[At other schools] I think the most common fault in general is to teach students how to pass exams instead of teaching them the science.
Donald Knuth

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: Roxanne Reid-Bennett (#6)

Re: autovacuum worker running amok - and me too ;)

Roxanne Reid-Bennett wrote

Most definitely ask on the Postgis list. Identify the full Postgis
version and Postgres versions as well.

Hi Roxanne,

seconds before sending it to the postgis-list i checked the table
planet_osm_ways and there is no geometry:

That can't be a postgis problem. I'll check fastupdate and the other hints,
i got yesterday

But here the missing infos:

Regards
walter

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840452.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Paul Ramsey

pramsey@cleverelephant.ca

over 11 years ago

In reply to: Tom Lane (#2)

Re: autovacuum worker running amok - and me too ;)

Circling back on this one, I had a look at our analyze code. I found
one place where *maybe* we weren't freeing memory and freed it, but
analyzing a 2M record table I barely see any bump up in memory usage
(from 22M up to 24M at peak) during analyze. And the change I made
didn't appear to alter that (though the objects were probably all
small enough that they weren't being detoasted into copies in any
event). Though maybe with a really big table? (with really big
objects?) Though still, doesn't analyze just pull a limited sample
(30K approx max) so why would table size make any difference after a
certain point?

On Tue, Mar 3, 2015 at 3:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

wambacher <wnordmann@gmx.de> writes:

My system has 24GB of real memory but after some hours one autovacuum worker
is using 80-90% of memory, the OOM-Killer (out of memory killer) kills the
process with kill -9 and the postgresql-server is restarting because of that
problem.

i changed the base configuration to use very small buffers, restartetd the
server twice but the problem still exists.

i think, it's allways the same table and that table is huge: 111GB data and
3 Indices with 4GB, 128 GB and 12 GB. It's the table planet_osm_ways from
openstreetmap. maybe that helps.

Maybe you could reduce the statistics targets for that table.

I think we've heard that the analyze functions for PostGIS data types are
memory hogs, too --- maybe it's worth inquiring about that on the postgis
mailing lists.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: Paul Ramsey (#8)

Re: autovacuum worker running amok - and me too ;)

Paul Ramsey wrote

Though maybe with a really big table? (with really big
objects?) Though still, doesn't analyze just pull a limited sample
(30K approx max) so why would table size make any difference after a
certain point?

Hi paul, "my" table is quite big (about 293.049.000 records) but the objects
are not.

nodes[] contains maximal 2000 bigint and tags[] up to some hundred chars,
sometimes some thousands chars.

watching the memory usage of the autovaccum process: is was getting bigger
and bigger at nearly constant speed. some MB per minute, iir.

i'm just recreating planet_osm_ways_nodes without "fastupdate=off"

regards
walter

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840485.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 11 years ago

In reply to: wambacher (#9)

Re: autovacuum worker running amok - and me too ;)

wambacher <wnordmann@gmx.de> wrote:

watching the memory usage of the autovaccum process: is was getting bigger

and bigger at nearly constant speed. some MB per minute, iir.

What are your settings for maintenance_work_mem and autovacuum_max_workers?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: Kevin Grittner (#10)

Re: autovacuum worker running amok - and me too ;)

Hi,

in my first post you can see all params:

maintenance_work_mem = 64MB and two workers. i configured my system to the
absolutely minimum ( got 24 GB real memory) and the problem was still
there.

Last night i rebuilt one index (122GB Size) and just in this minutes i
started a manual "analyze verbose planet_osm_ways" to see whats happening.
this will need some hours.

Regards
walter

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840685.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: wambacher (#11)

Re: autovacuum worker running amok - and me too ;)

... this will need some hours.

Done after 30 Minutes :)

nearly 50% dead rows - strange.

Now i'll run a "vacuum verbose planet_osm_ways" because the system crashed
during the autovacuum.

cross my fingers.

Walter

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840688.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: wambacher (#12)

Re: autovacuum worker running amok - and me too ;)

crashed:

no idea what to do now.

walter

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840696.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14

Jim Nasby

Jim.Nasby@BlueTreble.com

over 11 years ago

In reply to: wambacher (#13)

Re: autovacuum worker running amok - and me too ;)

On 3/5/15 2:06 PM, wambacher wrote:

crashed:

no idea what to do now.

Crashed? Or hit by the OOM killer? What's the log say?

While this is going on you might as well disable autovac for that table.
It'll keep crashing and will interfere with your manual vacuums.

It sounds at this point like the problem is in vacuuming, not analyze.
Can you confirm? If so, please forgo analyzing the table until we can
get vacuum figured out.

What's the largest memory size that a vacuum/autovac against that table
gets to compared to other backends? You meantioned 80-90% of memory
before, but I don't know if that was for analyze or what.

I wonder if we have some kind of memory leak in GIN's vacuum support...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: Jim Nasby (#14)

Re: autovacuum worker running amok - and me too ;)

Jim Nasby-5 wrote

On 3/5/15 2:06 PM, wambacher wrote:
Crashed? Or hit by the OOM killer? What's the log say?

killed by OOM, but has only 1.2 GB mem, which is ok to me.

While this is going on you might as well disable autovac for that table.
It'll keep crashing and will interfere with your manual vacuums.

did it this morning, the crash was running "vacuum verbose planet_osm_ways"
by cli.

It sounds at this point like the problem is in vacuuming, not analyze.
Can you confirm? If so, please forgo analyzing the table until we can
get vacuum figured out.

yes, it's the vacuum task.

What's the largest memory size that a vacuum/autovac against that table
gets to compared to other backends? You meantioned 80-90% of memory
before, but I don't know if that was for analyze or what.

vacuum

I wonder if we have some kind of memory leak in GIN's vacuum support...

may be.

At least i did:

- droped the gin-index
- cluster
- analyze
- vacuum

all without any problems.

now i'll add the index again and tomorrow do another vacuum by hand.

2:30 in germany, feeling tired ;)

Regards
walter

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840730.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#16

Jim Nasby

Jim.Nasby@BlueTreble.com

over 11 years ago

In reply to: wambacher (#15)

Re: autovacuum worker running amok - and me too ;)

On 3/5/15 7:36 PM, wambacher wrote:

Jim Nasby-5 wrote

On 3/5/15 2:06 PM, wambacher wrote:
Crashed? Or hit by the OOM killer? What's the log say?

killed by OOM, but has only 1.2 GB mem, which is ok to me.

Ok, but...

What's the largest memory size that a vacuum/autovac against that table
gets to compared to other backends? You meantioned 80-90% of memory
before, but I don't know if that was for analyze or what.

vacuum

Which is it? Is the vacuum process is using 1.2GB (5% of memory) or is
it using 90% (~22GB)?

BTW, with 1GB shared buffers and 64MB maintenance_work_mem top reporting
a size of 1.2GB doesn't surprise me at all (assuming it's including
shared memory in there).

This is starting to sound like a regular OOM problem. Have you tried the
steps in
http://postgresql.org/docs/9.4/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: Jim Nasby (#16)

Re: autovacuum worker running amok - and me too ;)

Jim Nasby-5 wrote

Which is it? Is the vacuum process is using 1.2GB (5% of memory) or is
it using 90% (~22GB)?

i ran the job 2-3 times.

- first with 18GB swap too. I heared it thrashing, performance went extremly
down and after 2 hours i killed the job (reboot system, no other way to do
it)

- next without swap: i monitored the system with hmon and the vacuum task
was getting bigger and bigger until oom killed it. VIRT at about 20.x GB,
MEM% at 80-90%

At this time i called for help.

- next: rebuilt the gin-index without fastupdate=off to use the default.
- vacuum planet_osm_ways on console
- VIRT about 1.2 GB, MEM% about 3.4% on HTOP
- crashed again, system logs are attached saying "OOM killed him, but using
about 1.2 GB, which is fine to me (and you)

- dropped index, clustered, vacuum --> no problems
- recreating of gin index is still running. 96/121 GB, some hours to go.

waiting for next test.

reporting
a size of 1.2GB doesn't surprise me at all (assuming it's including
shared memory in there).

This is starting to sound like a regular OOM problem. Have you tried the
steps in
http://postgresql.org/docs/9.4/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

not yet, but i'll check it right now.

Regards
walter

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840765.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: wambacher (#17)

Re: autovacuum worker running amok - and me too ;)

hi,

waiting for the index (104/121GB), i read his document
http://www.postgresql.org/docs/9.4/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
and will do some changes before the next analyze:

some comments:

- the OOM did not kill the Postmaster but the Analyze-Job.
- started with 24GB real and 18GB Swap - that must be enought! --> killed

- will reduce shared_mem from 1GB to 512 MB
- will reduce max_connections to 100 (but no user was actice, the
applications are down)
- will do "sysctl -w vm.overcommit_memory=2" (just did it)
- may do "echo -1000 > /proc/self/oom_score_adj" but only if the other
actions fail
- the last steps for older kernels are not relevant, i'm running ubuntu
14.04 LTS, which is 3.13

i'll keep you informed.

regards
walter

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840772.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19

Karsten Hilbert

Karsten.Hilbert@gmx.net

over 11 years ago

In reply to: wambacher (#18)

Re: autovacuum worker running amok - and me too ;)

On Fri, Mar 06, 2015 at 02:39:34AM -0700, wambacher wrote:

some comments:

- the OOM did not kill the Postmaster but the Analyze-Job.
- started with 24GB real and 18GB Swap - that must be enought! --> killed

"Back in the days" it was conventional wisdom to have twice
as much swap as you've got physical memory. Of course, I am
not suggesting you provide 48GB of swap and your problem is
magically solved _but_ one thing we might take away from that
old adage is that one might "hope things to work better"
(say, while debugging) if there is at least as much swap as
there is physical RAM based on the naive assumption that in
this case '"everything" can be swapped out'.

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#20

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: Karsten Hilbert (#19)

Re: autovacuum worker running amok - and me too ;)

Karsten Hilbert wrote

Of course, I am
not suggesting you provide 48GB of swap and your problem is
magically solved _but_ one thing we might take away from that
old adage is that one might "hope things to work better"
(say, while debugging) if there is at least as much swap as
there is physical RAM based on the naive assumption that in
this case '"everything" can be swapped out'.

no problem at all, got TBytes of free Diskpace.

--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840781.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#21

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: wambacher (#18)

#22

wambacher

wnordmann@gmx.de

over 11 years ago

In reply to: wambacher (#20)

#23

wambacher