problem with large maintenance_work_mem settings and CREATE INDEX

Started by Stefan Kaltenbrunnerabout 20 years ago33 messageshackers

stefan@kaltenbrunner.cc

about 20 years ago

Hi all!

while playing on a new box i noticed that postgresql does not seem to be
able to cope with very large settings for maintenance_work_mem.

For a test I created a single table with 5 integer columns containing
about 1,8B rows 8(about 300M distinct values in the column I want to index):

foo=# select count(*) from testtable;
count
------------
1800201755
(1 row)

I tried to create an index on one of the columns:

foo=# SET maintenance_work_mem to 4000000;
SET
foo=# CREATE INDEX a_idx ON testtable(a);
ERROR: invalid memory alloc request size 1073741824

foo=# SET maintenance_work_mem to 3000000;
SET
foo=# CREATE INDEX a_idx ON testtable(a);
ERROR: invalid memory alloc request size 1073741824

the error is generated pretty fast (a few seconds into the create index)

however:

foo=# SET maintenance_work_mem to 2000000;
SET
foo=# CREATE INDEX a_idx ON testtable(a);

is running now for about 10 hours with nearly no IO but pegging the
CPU-core it is running on at a constent 100%.

watching the process while this happens seems to indicate that the above
error occures after the backend exceeds about 3,1GB in resident size.

The box in question is a Dual Opteron 275 (4 cores @2,2Ghz) with 16GB of
RAM and 24GB of swap. OS is Debian Sarge/AMD64 with a pure 64bit userland.

Stefan

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 20 years ago

In reply to: Stefan Kaltenbrunner (#1)

Re: problem with large maintenance_work_mem settings and

Stefan Kaltenbrunner wrote:

Hi all!

while playing on a new box i noticed that postgresql does not seem to be
able to cope with very large settings for maintenance_work_mem.

For a test I created a single table with 5 integer columns containing
about 1,8B rows 8(about 300M distinct values in the column I want to index):

foo=# select count(*) from testtable;
count
------------
1800201755
(1 row)

I tried to create an index on one of the columns:

foo=# SET maintenance_work_mem to 4000000;
SET
foo=# CREATE INDEX a_idx ON testtable(a);
ERROR: invalid memory alloc request size 1073741824

foo=# SET maintenance_work_mem to 3000000;
SET
foo=# CREATE INDEX a_idx ON testtable(a);
ERROR: invalid memory alloc request size 1073741824

the error is generated pretty fast (a few seconds into the create index)

however:

foo=# SET maintenance_work_mem to 2000000;
SET
foo=# CREATE INDEX a_idx ON testtable(a);

is running now for about 10 hours with nearly no IO but pegging the
CPU-core it is running on at a constent 100%.

watching the process while this happens seems to indicate that the above
error occures after the backend exceeds about 3,1GB in resident size.

The box in question is a Dual Opteron 275 (4 cores @2,2Ghz) with 16GB of
RAM and 24GB of swap. OS is Debian Sarge/AMD64 with a pure 64bit userland.

forgot to mention that this is 8.1.3 compiled from source. Further
testing shows that not only CREATE INDEX has some issue with large
maintenance_work_mem settings :

foo=# set maintenance_work_mem to 2000000;
SET
foo=# VACUUM ANALYZE verbose;
INFO: vacuuming "information_schema.sql_features"
ERROR: invalid memory alloc request size 2047999998

Stefan

hubert depesz lubaczewski

depesz@gmail.com

about 20 years ago

In reply to: Stefan Kaltenbrunner (#2)

Re: problem with large maintenance_work_mem settings and

On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:

forgot to mention that this is 8.1.3 compiled from source. Further
testing shows that not only CREATE INDEX has some issue with large
maintenance_work_mem settings :

what does it show:
cat /proc/sys/kernel/shmmax
?

depesz

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 20 years ago

In reply to: hubert depesz lubaczewski (#3)

Re: problem with large maintenance_work_mem settings and

hubert depesz lubaczewski wrote:

On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:

forgot to mention that this is 8.1.3 compiled from source. Further
testing shows that not only CREATE INDEX has some issue with large
maintenance_work_mem settings :

what does it show:
cat /proc/sys/kernel/shmmax

1421326592

not that I think it is related to the problem at all. It looks like I'm
hitting the MaxAllocSize Limit in src/include/utils/memutils.h.

Stefan

Michael Paesold

mpaesold@gmx.at

about 20 years ago

In reply to: Stefan Kaltenbrunner (#1)

Re: problem with large maintenance_work_mem settings and

Stefan Kaltenbrunner wrote:

hubert depesz lubaczewski wrote:

On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:

forgot to mention that this is 8.1.3 compiled from source. Further
testing shows that not only CREATE INDEX has some issue with large
maintenance_work_mem settings :

what does it show:
cat /proc/sys/kernel/shmmax

1421326592

not that I think it is related to the problem at all.

I can second that. Maintenance work mem is not allocated in shared memory.

It looks like I'm
hitting the MaxAllocSize Limit in src/include/utils/memutils.h.

There are two issues you have mentioned. This one is more obvious: the
limitation of the memory that can be allocated.

The other one is the very bad performance for index creation. I can only
guess, but is sound like this is related to the recent discussion on hackers
about issues with the qsort performance. If the theory is true, your index
creation should be much faster with a much lower setting for
maintenance_work_mem, so that it uses external sort.

See the discussion starting here:
http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php

Best Regards,
Michael Paesold

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 20 years ago

In reply to: Stefan Kaltenbrunner (#4)

Re: problem with large maintenance_work_mem settings and

Stefan Kaltenbrunner wrote:

hubert depesz lubaczewski wrote:

On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:

forgot to mention that this is 8.1.3 compiled from source. Further
testing shows that not only CREATE INDEX has some issue with large
maintenance_work_mem settings :

what does it show:
cat /proc/sys/kernel/shmmax

1421326592

not that I think it is related to the problem at all. It looks like I'm
hitting the MaxAllocSize Limit in src/include/utils/memutils.h.

just tried to increase this limit to 4GB (from the default 1GB) and this
seems to help a fair bit. (ie CREATE INDEX and VACUUM do seem to work
with much higher maintainance_work_mem settings now.
BUT: VACUUM VERBOSE (or VACUUM ANALYZE VERBOSE) breaks(with a rather
gigantic allocation request ;-)):

foo=# VACUUM VERBOSE;
INFO: vacuuming "information_schema.sql_features"
ERROR: invalid memory alloc request size 18446744073709551615

Stefan

Matthew T. O'Connor

matthew@zeut.net

about 20 years ago

In reply to: Stefan Kaltenbrunner (#2)

Re: problem with large maintenance_work_mem settings and

Stefan Kaltenbrunner wrote:

foo=# set maintenance_work_mem to 2000000;
SET
foo=# VACUUM ANALYZE verbose;
INFO: vacuuming "information_schema.sql_features"
ERROR: invalid memory alloc request size 2047999998

Just an FYI, I reported a similar problem on my 8.0.0 database a few
weeks ago. I upgraded to 8.1.3 and just recetly got that message again
in a nighly maintenance email.

vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming of database "postgres" failed: ERROR: out of memory
DETAIL: Failed on request of size 167772156.

The interesting thing is that there aren't any tables in the postgres
database. This happened Thursday night, but vacuum ran fine on Friday
night. I'm on vacation right now, so I can't really look into it much
deeper at the moment, but I thought I would mention it.

Matt

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 20 years ago

In reply to: Michael Paesold (#5)

Re: problem with large maintenance_work_mem settings and

Michael Paesold wrote:

Stefan Kaltenbrunner wrote:

hubert depesz lubaczewski wrote:

On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:

forgot to mention that this is 8.1.3 compiled from source. Further
testing shows that not only CREATE INDEX has some issue with large
maintenance_work_mem settings :

what does it show:
cat /proc/sys/kernel/shmmax

1421326592

not that I think it is related to the problem at all.

I can second that. Maintenance work mem is not allocated in shared memory.

It looks like I'm
hitting the MaxAllocSize Limit in src/include/utils/memutils.h.

There are two issues you have mentioned. This one is more obvious: the
limitation of the memory that can be allocated.

yes

The other one is the very bad performance for index creation. I can only
guess, but is sound like this is related to the recent discussion on
hackers about issues with the qsort performance. If the theory is true,
your index creation should be much faster with a much lower setting for
maintenance_work_mem, so that it uses external sort.

See the discussion starting here:
http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php

I was following this thread - and it was partly a reason why I'm playing
with that(the CREATE INDEX on that table finished after about 12 hours
with a bit less 2GB for maintenance_work_mem(for comparision it took me
only about 2,5hours to create this table) .
I'm currently testing who long it takes with very low settings to force
an external sort.

Stefan

Tom Lane

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Stefan Kaltenbrunner (#6)

Re: problem with large maintenance_work_mem settings and

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

not that I think it is related to the problem at all. It looks like I'm
hitting the MaxAllocSize Limit in src/include/utils/memutils.h.

just tried to increase this limit to 4GB (from the default 1GB) and this
seems to help a fair bit.

s/help a fair bit/break a whole lot of stuff/

There are reasons for that limit, and you can't just arbitrarily
rejigger it.

The sorting code probably needs a defense to keep it from trying to
exceed MaxAllocSize for the SortObject array; AFAIR there is no such
consideration there now, but it's easily added. I'm not sure where your
VACUUM failure is coming from though --- can you get a back trace from
the errfinish call in that case?

regards, tom lane

#10

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 20 years ago

In reply to: Tom Lane (#9)

Re: problem with large maintenance_work_mem settings and

Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

not that I think it is related to the problem at all. It looks like I'm
hitting the MaxAllocSize Limit in src/include/utils/memutils.h.

just tried to increase this limit to 4GB (from the default 1GB) and this
seems to help a fair bit.

s/help a fair bit/break a whole lot of stuff/

There are reasons for that limit, and you can't just arbitrarily
rejigger it.

heh - sure this is just a testbox so it was worth a try and I don't care
for the data anyway ...

The sorting code probably needs a defense to keep it from trying to
exceed MaxAllocSize for the SortObject array; AFAIR there is no such
consideration there now, but it's easily added. I'm not sure where your
VACUUM failure is coming from though --- can you get a back trace from
the errfinish call in that case?

like(with maintenance_work_mem set to 2000000):

(gdb) bt
#0 errfinish (dummy=0) at elog.c:310
#1 0x00000000005c6c93 in elog_finish (elevel=-4145840, fmt=0x84da50
"invalid memory alloc request size %lu")
at elog.c:931
#2 0x00000000005d96a0 in MemoryContextAlloc (context=0x8d9c58,
size=2047999998) at mcxt.c:505
#3 0x00000000004db947 in lazy_space_alloc (vacrelstats=0x8de5b0,
relblocks=6) at vacuumlazy.c:963
#4 0x00000000004dab33 in lazy_scan_heap (onerel=0x2ad69a589cc8,
vacrelstats=0x8de5b0, Irel=0x0, nindexes=0)
at vacuumlazy.c:240
#5 0x00000000004da9d1 in lazy_vacuum_rel (onerel=0x2ad69a589cc8,
vacstmt=0x8c0fd0) at vacuumlazy.c:157
#6 0x00000000004d7325 in vacuum_rel (relid=2589498568,
vacstmt=0x8c0fd0, expected_relkind=-27 'ï¿½')
at vacuum.c:1077
#7 0x00000000004d6990 in vacuum (vacstmt=0x8c0fd0, relids=0x0) at
vacuum.c:449
#8 0x000000000055e871 in PortalRunUtility (portal=0x8e0360,
query=0x8c0e00, dest=0x8c1050,
completionTag=0x7fffffc0c410 "") at pquery.c:987
#9 0x000000000055eb07 in PortalRunMulti (portal=0x8e0360,
dest=0x8c1050, altdest=0x8c1050,
completionTag=0x7fffffc0c410 "") at pquery.c:1054
#10 0x000000000055e28f in PortalRun (portal=0x8e0360,
count=9223372036854775807, dest=0x8c1050,
altdest=0x8c1050, completionTag=0x7fffffc0c410 "") at pquery.c:665
#11 0x000000000055a3a1 in exec_simple_query (query_string=0x8c0cf0
"VACUUM VERBOSE;") at postgres.c:1002
#12 0x000000000055cc2c in PostgresMain (argc=4, argv=0x84c078,
username=0x84c040 "postgres") at postgres.c:3217
#13 0x0000000000538a71 in BackendRun (port=0x86add0) at postmaster.c:2853
#14 0x0000000000538550 in BackendStartup (port=0x86add0) at
postmaster.c:2497
#15 0x0000000000536b4d in ServerLoop () at postmaster.c:1230
#16 0x0000000000535fcf in PostmasterMain (argc=3, argv=0x8493c0) at
postmaster.c:941
#17 0x00000000004fcaa5 in main (argc=3, argv=0x8493c0) at main.c:265

Stefan

#11

Tom Lane

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Stefan Kaltenbrunner (#10)

Re: problem with large maintenance_work_mem settings and

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

Tom Lane wrote:

The sorting code probably needs a defense to keep it from trying to
exceed MaxAllocSize for the SortObject array; AFAIR there is no such
consideration there now, but it's easily added. I'm not sure where your
VACUUM failure is coming from though --- can you get a back trace from
the errfinish call in that case?

like(with maintenance_work_mem set to 2000000):

(gdb) bt
#0 errfinish (dummy=0) at elog.c:310
#1 0x00000000005c6c93 in elog_finish (elevel=-4145840, fmt=0x84da50
"invalid memory alloc request size %lu")
at elog.c:931
#2 0x00000000005d96a0 in MemoryContextAlloc (context=0x8d9c58,
size=2047999998) at mcxt.c:505
#3 0x00000000004db947 in lazy_space_alloc (vacrelstats=0x8de5b0,
relblocks=6) at vacuumlazy.c:963

Oh, OK, it's lazy_space_alloc's fault --- we need to fix that to not try
to allocate chunks larger than MaxAllocSize. Will do ... thanks for the
report.

regards, tom lane

#12

Tom Lane

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Stefan Kaltenbrunner (#8)

Re: problem with large maintenance_work_mem settings and

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

forgot to mention that this is 8.1.3 compiled from source.

See the discussion starting here:
http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php

I was following this thread - and it was partly a reason why I'm playing
with that(the CREATE INDEX on that table finished after about 12 hours
with a bit less 2GB for maintenance_work_mem(for comparision it took me
only about 2,5hours to create this table) .

It would be interesting to try the same test with CVS tip to see if the
sorting improvements Simon and I made over the past few weeks help much.

regards, tom lane

#13

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 20 years ago

In reply to: Tom Lane (#12)

Re: problem with large maintenance_work_mem settings and

Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

forgot to mention that this is 8.1.3 compiled from source.

See the discussion starting here:
http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php

I was following this thread - and it was partly a reason why I'm playing
with that(the CREATE INDEX on that table finished after about 12 hours
with a bit less 2GB for maintenance_work_mem(for comparision it took me
only about 2,5hours to create this table) .

It would be interesting to try the same test with CVS tip to see if the
sorting improvements Simon and I made over the past few weeks help much.

playing with CVS tip right now, it is a bit faster for both the initial
bulkloading (about 5%) and for the CREATE INDEX itself (11h30min vs
11h54min) though not a dramatic improvement.

Stefan

#14

Tom Lane

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Stefan Kaltenbrunner (#1)

Re: problem with large maintenance_work_mem settings and

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

samples % symbol name
24915704 96.2170 ltsReleaseBlock
387265 1.4955 LogicalTapeRead
168725 0.6516 inlineApplySortFunction

Hmm ... the comment in ltsReleaseBlock sez

/*
* Insert blocknum into array, preserving decreasing order (so that
* ltsGetFreeBlock returns the lowest available block number). This could
* get fairly slow if there were many free blocks, but we don't expect
* there to be very many at one time.
*/

We probably need to tweak things so this doesn't get called during the
"final merge" pass. Looking at it now.

regards, tom lane

Import Notes

Reply to msg id not found: 440B4520.4020305@kaltenbrunner.cc

#15

Simon Riggs

simon@2ndQuadrant.com

about 20 years ago

In reply to: Tom Lane (#14)

Re: problem with large maintenance_work_mem settings and

On Sun, 2006-03-05 at 15:15 -0500, Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

samples % symbol name
24915704 96.2170 ltsReleaseBlock
387265 1.4955 LogicalTapeRead
168725 0.6516 inlineApplySortFunction

Hmm ... the comment in ltsReleaseBlock sez

/*
* Insert blocknum into array, preserving decreasing order (so that
* ltsGetFreeBlock returns the lowest available block number). This could
* get fairly slow if there were many free blocks, but we don't expect
* there to be very many at one time.
*/

We probably need to tweak things so this doesn't get called during the
"final merge" pass. Looking at it now.

OK. I also had a report of poor performance, just isolated to the final
merge pass and sucked quite badly; sounds like you've located the cause.

Best Regards, Simon Riggs

#16

Tom Lane

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Tom Lane (#14)

Re: problem with large maintenance_work_mem settings and

I wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

samples % symbol name
24915704 96.2170 ltsReleaseBlock

We probably need to tweak things so this doesn't get called during the
"final merge" pass. Looking at it now.

I've committed a fix for this into CVS HEAD --- please try it out.

regards, tom lane

#17

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 20 years ago

In reply to: Tom Lane (#16)

Re: problem with large maintenance_work_mem settings and

Tom Lane wrote:

I wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

samples % symbol name
24915704 96.2170 ltsReleaseBlock

We probably need to tweak things so this doesn't get called during the
"final merge" pass. Looking at it now.

I've committed a fix for this into CVS HEAD --- please try it out.

just tried that with CVS HEAD (includes the second fix too):

CREATE INDEX on a 1,8B row table (5 int columns - index created on the
first row about 300M distinct values):

before: 11h 51min
after: 3h 11min(!)

Stefan

#18

Tom Lane

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Stefan Kaltenbrunner (#17)

Re: problem with large maintenance_work_mem settings and

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

CREATE INDEX on a 1,8B row table (5 int columns - index created on the
first row about 300M distinct values):

before: 11h 51min
after: 3h 11min(!)

Cool. Does it seem to be I/O bound now? Would you be willing to do it
over with oprofile turned on?

regards, tom lane

#19

Simon Riggs

simon@2ndQuadrant.com

about 20 years ago

In reply to: Tom Lane (#18)

Re: problem with large maintenance_work_mem settings and

On Wed, 2006-03-08 at 10:45 -0500, Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

CREATE INDEX on a 1,8B row table (5 int columns - index created on the
first row about 300M distinct values):

before: 11h 51min
after: 3h 11min(!)

Cool. Does it seem to be I/O bound now? Would you be willing to do it
over with oprofile turned on?

Very.

Any chance of trying it with different maintenance_work_mem settings?

Did you try this with trace_sort=on? If so could we get the logs for
that?

[Results welcome from other hackers...particularly with regard to
queries with sort steps in rather than CREATE INDEX.]

Best Regards, Simon Riggs

#20

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 20 years ago

In reply to: Tom Lane (#18)

Re: problem with large maintenance_work_mem settings and

Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

CREATE INDEX on a 1,8B row table (5 int columns - index created on the
first row about 300M distinct values):

before: 11h 51min
after: 3h 11min(!)

Cool. Does it seem to be I/O bound now? Would you be willing to do it
over with oprofile turned on?

while it now does a fair amount of IO during the whole operation, it is
not yet IObound afaiks.

profile:

samples % symbol name
103520432 47.9018 inlineApplySortFunction
33382738 15.4471 comparetup_index
25296438 11.7054 tuplesort_heap_siftup
10089122 4.6685 btint4cmp
8395676 3.8849 LogicalTapeRead
2873556 1.3297 tuplesort_heap_insert
2796545 1.2940 tuplesort_gettuple_common
2752345 1.2736 AllocSetFree
2233889 1.0337 IndexBuildHeapScan
2035265 0.9418 heapgettup
1571035 0.7270 LWLockAcquire
1498800 0.6935 readtup_index
1213587 0.5616 index_form_tuple
1097172 0.5077 AllocSetAlloc
1056964 0.4891 heap_fill_tuple
1041172 0.4818 btbuildCallback
990005 0.4581 LWLockRelease
897662 0.4154 slot_deform_tuple
858527 0.3973 LogicalTapeWrite
806849 0.3734 PageAddItem
764136 0.3536 LockBuffer

trace_sort:

LOG: begin index sort: unique = f, workMem = 2048000, randomAccess = f
LOG: switching to external sort with 7315 tapes: CPU 4.07s/13.70u sec
elapsed 17.79 sec
LOG: finished writing run 1 to tape 0: CPU 240.07s/3926.66u sec elapsed
4498.49 sec
LOG: performsort starting: CPU 535.66s/8138.92u sec elapsed 9435.11 sec
LOG: finished writing final run 2 to tape 1: CPU 538.54s/8242.23u sec
elapsed 9541.55 sec
LOG: performsort done (except final merge): CPU 539.39s/8254.83u sec
elapsed 9559.75 sec
LOG: external sort ended, 4398827 disk blocks used: CPU
768.38s/10027.39u sec elapsed 11884.63 sec

Stefan

#21

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 20 years ago

In reply to: Simon Riggs (#19)

#22

Tom Lane

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Stefan Kaltenbrunner (#20)

#23