random observations while testing with a 1,8B row table
Hi all!
During my testing of large work_mem and maintainence_work_mem setting
wrt to CREATE INDEX and sorting I encountered a number of things wrt to
doing various operations on such a large table (about 106GB on disk with
no dead tuples).
I will summarize some of the just in case somebody is interested:
-> table used has 5 integer columns non-indexed during the loads
-> hardware is a Dual Opteron 280 with 4 cores@2,4GHz and 16GB RAM, data
is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL
and data.
1. data loading - I'm using COPY with batches of 300M rows it takes
*) with one copy running it takes about 20minutes/batch to load the data
(~250k rows/sec) and virtually no context switches.
*) with two copys running concurrently it takes a bit less then 30
minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall)
*) with three copy it takes about 40min/batch at 140k context
switches/sec (380k rows/sec overall)
the profiles for those runs look very similiar to:
samples % symbol name
5065118 20.9607 XLogInsert
3496868 14.4709 DoCopy
2807313 11.6174 CopyReadLine
1373621 5.6844 PageAddItem
1227069 5.0779 heap_formtuple
1193319 4.9383 LWLockAcquire
894243 3.7006 hash_search
717427 2.9689 LWLockRelease
699359 2.8941 pg_atoi
691385 2.8611 FunctionCall3
640383 2.6501 heap_insert
579331 2.3974 int4in
411286 1.7020 AllocSetReset
376452 1.5579 hash_any
349220 1.4452 RelationGetBufferForTuple
261568 1.0824 AllocSetAlloc
257511 1.0656 ReadBuffer
while the amount of IO going on is quite a lot it looks like we are
still mostly CPU-bound for COPY.
2. updating all of the rows in the table:
I updated all of the rows in the table with a simple UPDATE testtable
set a=a+1;
this took about 2,5 hours (~200rows/sec)
with a profile looking like:
samples % symbol name
27860285 26.5844 XLogInsert
4828077 4.6070 PageAddItem
4490535 4.2849 heap_update
4267647 4.0722 slot_deform_tuple
3996750 3.8137 LWLockAcquire
3716184 3.5460 slot_getattr
3454679 3.2965 hash_search
2998742 2.8614 hash_any
2909261 2.7760 heap_fill_tuple
2825256 2.6959 LWLockRelease
2283086 2.1785 LockBuffer
2135048 2.0373 ExecTargetList
1636017 1.5611 ExecEvalVar
1632377 1.5576 UnpinBuffer
1566087 1.4944 RelationGetBufferForTuple
1561378 1.4899 ExecMakeFunctionResultNoSets
1511366 1.4421 ReadBuffer
1381614 1.3183 heap_compute_data_size
3. vacuuming this table - it turned out that VACUUM FULL is completly
unusable on a table(which i actually expected before) of this size not
only to the locking involved but rather due to a gigantic memory
requirement and unbelievable slowness.
It seems that the heap-scan part of vacuum full completed after about 2
hours ending up with a postmaster having a resident size of about
8,5GB(!!!) with maintainance_work_mem set to 1GB.
profile for this stage looks like:
samples % symbol name
941058 26.0131 scan_heap
444435 12.2852 HeapTupleSatisfiesVacuum
242117 6.6927 TransactionIdIsInProgress
220044 6.0825 _mdfd_getseg
212571 5.8760 hash_search
186963 5.1681 TransactionIdPrecedes
176016 4.8655 SetBufferCommitInfoNeedsSave
137668 3.8055 TransactionIdDidCommit
137068 3.7889 PageRepairFragmentation
111474 3.0814 TransactionLogFetch
103814 2.8697 LWLockAcquire
102925 2.8451 LWLockRelease
102456 2.8321 hash_any
67199 1.8575 BufferAlloc
after that the postmaster started slowly consuming more and more memory,
doing virtually no IO and eating CPU like mad with a profile similiar to:
samples % symbol name
2708391248 94.1869 repair_frag
155395833 5.4040 enough_space
5707137 0.1985 XLogInsert
1410703 0.0491 PageAddItem
691616 0.0241 BgBufferSync
I actually ended up canceling the VACUUM FULL after about 50 hours of
runtime with a resident size of ~11,5GB.
Stefan
Stefan,
On 3/10/06 9:40 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
I will summarize some of the just in case somebody is interested:
I am!
-> table used has 5 integer columns non-indexed during the loads
-> hardware is a Dual Opteron 280 with 4 cores@2,4GHz and 16GB RAM, data
is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL
and data.
How many connections out of the machine? How many disks behind each LUN?
So - about 20 Bytes per row (5*4) unless those are int8, but on disk it's
108GB/1.8B = 60 Bytes per row on disk. I wonder what all that overhead is?
1. data loading - I'm using COPY with batches of 300M rows it takes
*) with one copy running it takes about 20minutes/batch to load the data
(~250k rows/sec) and virtually no context switches.*) with two copys running concurrently it takes a bit less then 30
minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall)*) with three copy it takes about 40min/batch at 140k context
switches/sec (380k rows/sec overall)
So, from 15 MB/s up to about 20 MB/s.
while the amount of IO going on is quite a lot it looks like we are
still mostly CPU-bound for COPY.
It's what we see almost always. In this case if your I/O configuration is
capable of performing at about 3x the 20MB/s max parsing rate, or 60MB/s,
you will be CPU limited.
The 3x is approximate, and based on observations, the reasoning underneath
it is that Postgres is writing the data several times, once to the WAL, then
from the WAL to the heap files.
2. updating all of the rows in the table:
I updated all of the rows in the table with a simple UPDATE testtable
set a=a+1;
this took about 2,5 hours (~200rows/sec)
Ugh. This is where Bizgres MPP shines, I'll try to recreate your test and
post results. This scales linearly in Bizgres MPP with the number of disks
and CPUs available, but I would hope for much more than that.
3. vacuuming this table - it turned out that VACUUM FULL is completly
unusable on a table(which i actually expected before) of this size not
only to the locking involved but rather due to a gigantic memory
requirement and unbelievable slowness.
Simple vacuum should be enough IMO.
- Luke
Luke Lonergan wrote:
Stefan,
On 3/10/06 9:40 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
I will summarize some of the just in case somebody is interested:
I am!
heh - not surprised :-)
-> table used has 5 integer columns non-indexed during the loads
-> hardware is a Dual Opteron 280 with 4 cores@2,4GHz and 16GB RAM, data
is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL
and data.How many connections out of the machine? How many disks behind each LUN?
2 HBAs in the server, 2x2 possible paths to each LUN.
6 disks for the WAL and 12 disks for the data
So - about 20 Bytes per row (5*4) unless those are int8, but on disk it's
108GB/1.8B = 60 Bytes per row on disk. I wonder what all that overhead is?1. data loading - I'm using COPY with batches of 300M rows it takes
*) with one copy running it takes about 20minutes/batch to load the data
(~250k rows/sec) and virtually no context switches.*) with two copys running concurrently it takes a bit less then 30
minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall)*) with three copy it takes about 40min/batch at 140k context
switches/sec (380k rows/sec overall)So, from 15 MB/s up to about 20 MB/s.
while the amount of IO going on is quite a lot it looks like we are
still mostly CPU-bound for COPY.It's what we see almost always. In this case if your I/O configuration is
capable of performing at about 3x the 20MB/s max parsing rate, or 60MB/s,
you will be CPU limited.
the IO-System I use should be capable of doing that if pushed hard
enough :-)
The 3x is approximate, and based on observations, the reasoning underneath
it is that Postgres is writing the data several times, once to the WAL, then
from the WAL to the heap files.2. updating all of the rows in the table:
I updated all of the rows in the table with a simple UPDATE testtable
set a=a+1;
this took about 2,5 hours (~200rows/sec)Ugh. This is where Bizgres MPP shines, I'll try to recreate your test and
post results. This scales linearly in Bizgres MPP with the number of disks
and CPUs available, but I would hope for much more than that.
interesting to know, but still I'm testing/playing with postgresql here
not bizgres MPP ...
3. vacuuming this table - it turned out that VACUUM FULL is completly
unusable on a table(which i actually expected before) of this size not
only to the locking involved but rather due to a gigantic memory
requirement and unbelievable slowness.Simple vacuum should be enough IMO.
sure, that was mostly meant as an experiment, if I had to do this on a
production database I would most likely use CLUSTER to get the desired
effect (which in my case was purely getting back the diskspace wasted by
dead tuples)
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
3. vacuuming this table - it turned out that VACUUM FULL is completly
unusable on a table(which i actually expected before) of this size not
only to the locking involved but rather due to a gigantic memory
requirement and unbelievable slowness.
sure, that was mostly meant as an experiment, if I had to do this on a
production database I would most likely use CLUSTER to get the desired
effect (which in my case was purely getting back the diskspace wasted by
dead tuples)
Yeah, the VACUUM FULL algorithm is really designed for situations where
just a fraction of the rows have to be moved to re-compact the table.
It might be interesting to teach it to abandon that plan and go to a
CLUSTER-like table rewrite once the percentage of dead space is seen to
reach some suitable level. CLUSTER has its own disadvantages though
(2X peak disk space usage, doesn't work on core catalogs, etc).
regards, tom lane
Stefan,
On 3/10/06 11:48 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
2 HBAs in the server, 2x2 possible paths to each LUN.
6 disks for the WAL and 12 disks for the data
So - you have 18 disks worth of potential bandwidth, not factoring loss due
to RAID. That's roughly 18 * 60 = 1,080 MB/s. If we organized that into
four banks, one for each CPU and made each one RAID5 and left two disks for
spares, you'd have 12 disks working for you at 720MB/s, which is possibly
double the number of active FC channels you have, unless they are all
active, in which case you have a nicely matched 800MB/s of FC.
So, from 15 MB/s up to about 20 MB/s.
Gee - seems a long distance from 700 MB/s potential :-)
the IO-System I use should be capable of doing that if pushed hard
enough :-)
I would expect some 10x this if configured well.
interesting to know, but still I'm testing/playing with postgresql here
not bizgres MPP ...
Sure. Still, what I'd expect is something like 10x this update rate using
the parallelism buried in your hardware.
If you configure the same machine with 4 Bizgres MPP segments running on 4
LUNs I think you'd be shocked at the speedups.
- Luke
On Mar 10, 2006, at 11:54 AM, Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
3. vacuuming this table - it turned out that VACUUM FULL is
completly
unusable on a table(which i actually expected before) of this
size not
only to the locking involved but rather due to a gigantic memory
requirement and unbelievable slowness.sure, that was mostly meant as an experiment, if I had to do this
on a
production database I would most likely use CLUSTER to get the
desired
effect (which in my case was purely getting back the diskspace
wasted by
dead tuples)Yeah, the VACUUM FULL algorithm is really designed for situations
where
just a fraction of the rows have to be moved to re-compact the table.
It might be interesting to teach it to abandon that plan and go to a
CLUSTER-like table rewrite once the percentage of dead space is
seen to
reach some suitable level. CLUSTER has its own disadvantages though
(2X peak disk space usage, doesn't work on core catalogs, etc).
I get bitten by this quite often (customer machines, one giant table,
purge out a lot of old data).
CLUSTER is great for that, given the headroom, though I've often
resorted to a dump and restore because I've not had the headroom
for cluster, and it's a lot less downtime than a full vacuum.
While the right fix there is to redo the application engine side to use
table partitioning, I keep wondering whether it would be possible
to move rows near the end of the table to the beginning in one, non-
locking
phase (vacuum to populate FSM with free space near beginning of table,
touch rows starting at end of table, repeat) and then finish off with a
vacuum full to tidy up the remainder and truncate the files (or a
simpler
"lock the table and truncate anything unused at the end").
Cheers,
Steve
Luke Lonergan wrote:
Stefan,
On 3/10/06 11:48 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
2 HBAs in the server, 2x2 possible paths to each LUN.
6 disks for the WAL and 12 disks for the dataSo - you have 18 disks worth of potential bandwidth, not factoring loss due
to RAID. That's roughly 18 * 60 = 1,080 MB/s. If we organized that into
four banks, one for each CPU and made each one RAID5 and left two disks for
spares, you'd have 12 disks working for you at 720MB/s, which is possibly
double the number of active FC channels you have, unless they are all
active, in which case you have a nicely matched 800MB/s of FC.
wrong(or rather extremely optimistic) the array itself only has two
(redundant) FC-loops(@2GB )to the attached expansion chassis. The array
has 2 active/active controllers (with a failover penalty) with two host
interfaces each, furthermore it has write-cache mirroring(to the standby
controller) enabled which means the traffic has to go over the internal
FC-loop too.
beside that the host(as I said) itself only has two HBAs @2GB which are
configured for failover which limits the hosts maximum available
bandwith to less than 200MB/S per LUN.
So, from 15 MB/s up to about 20 MB/s.
Gee - seems a long distance from 700 MB/s potential :-)
well the array is capable of about 110MB/s write per controller head (a
bit more half the possible due to write mirroring enabled which uses
delta-syncronisation).
WAL and data are on different controllers though by default.
the IO-System I use should be capable of doing that if pushed hard
enough :-)I would expect some 10x this if configured well.
see above ...
interesting to know, but still I'm testing/playing with postgresql here
not bizgres MPP ...Sure. Still, what I'd expect is something like 10x this update rate using
the parallelism buried in your hardware.If you configure the same machine with 4 Bizgres MPP segments running on 4
LUNs I think you'd be shocked at the speedups.
that might be true, though it might sound a bit harsh I really prefer to
spend the small amount of spare time I have with testing(and helping to
improve if possible) postgresql than playing with a piece of commercial
software I'm not going to use anyway ...
Stefan
Stefan,
On 3/10/06 12:23 PM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
wrong(or rather extremely optimistic) the array itself only has two
(redundant) FC-loops(@2GB )to the attached expansion chassis. The array
has 2 active/active controllers (with a failover penalty) with two host
interfaces each, furthermore it has write-cache mirroring(to the standby
controller) enabled which means the traffic has to go over the internal
FC-loop too.
beside that the host(as I said) itself only has two HBAs @2GB which are
configured for failover which limits the hosts maximum available
bandwith to less than 200MB/S per LUN.
Wow - the ickiness of SAN fro a performance / value standpoint never ceases
to astound me.
Gee - seems a long distance from 700 MB/s potential :-)
well the array is capable of about 110MB/s write per controller head (a
bit more half the possible due to write mirroring enabled which uses
delta-syncronisation).
WAL and data are on different controllers though by default.
So - you're getting 20MB/s on loading from a potential of 200MB/s?
I would expect some 10x this if configured well.
see above ...
OTOH - configured well could include taking the disks out of the smart (?)
chassis, plugging them into a dumb chassis and deploying 2 dual channel U320
SCSI adapters - total cost of about $3,000.
that might be true, though it might sound a bit harsh I really prefer to
spend the small amount of spare time I have with testing(and helping to
improve if possible) postgresql than playing with a piece of commercial
software I'm not going to use anyway ...
No problem - that's our job anyway - to make the case for Postgres' use in
typical large scale use-cases like the one you describe.
- Luke
Luke Lonergan wrote:
Stefan,
On 3/10/06 12:23 PM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
wrong(or rather extremely optimistic) the array itself only has two
(redundant) FC-loops(@2GB )to the attached expansion chassis. The array
has 2 active/active controllers (with a failover penalty) with two host
interfaces each, furthermore it has write-cache mirroring(to the standby
controller) enabled which means the traffic has to go over the internal
FC-loop too.
beside that the host(as I said) itself only has two HBAs @2GB which are
configured for failover which limits the hosts maximum available
bandwith to less than 200MB/S per LUN.Wow - the ickiness of SAN fro a performance / value standpoint never ceases
to astound me.
Well while make it sound a bit like that, performance is not everything.
One has to factor manageability,scalability (in terms of future upgrades
using the same platform and such) and high-availability features in too.
With that in mind a SAN (or a NAS - depends on the actual usecases)
suddenly looks much more interesting than plain old DASD.
Gee - seems a long distance from 700 MB/s potential :-)
well the array is capable of about 110MB/s write per controller head (a
bit more half the possible due to write mirroring enabled which uses
delta-syncronisation).
WAL and data are on different controllers though by default.So - you're getting 20MB/s on loading from a potential of 200MB/s?
no - I can write 110MB/s on thw WAL LUN and 110MB/s on the other LUN
concurrently.
I would expect some 10x this if configured well.
see above ...
OTOH - configured well could include taking the disks out of the smart (?)
chassis, plugging them into a dumb chassis and deploying 2 dual channel U320
SCSI adapters - total cost of about $3,000.
as i said above even if that would work (it does not because the disks
have FC-connectors) I would loose a LOT of features like being able to
use the SAN for more than a single host (big one!) or doing
firmware-upgrades without downtime, using SAN-replication, having
cable-length exceeding 12m(makes it possible to place parts of the
infrastructure at remote sites),out-of-band management,scriptable(!),...
Beside that, sequential-io as you are propagating everywhere is NOT the
holy grail or the sole solution to a fast database.
While the SAN above really is not a screamer for that kind of
application it is actually a very good performer(compared with some of
the DASD based boxes) under heavy random-io and concurrent load.
This has a direct measurable influence on the overall speed of our
production applications which are mostly OLTP ;-)
that might be true, though it might sound a bit harsh I really prefer to
spend the small amount of spare time I have with testing(and helping to
improve if possible) postgresql than playing with a piece of commercial
software I'm not going to use anyway ...No problem - that's our job anyway - to make the case for Postgres' use in
typical large scale use-cases like the one you describe.
yep
Stefan
Stefan,
On 3/11/06 12:21 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
So - you're getting 20MB/s on loading from a potential of 200MB/s?
no - I can write 110MB/s on thw WAL LUN and 110MB/s on the other LUN
concurrently.
The numbers you published earlier show you are getting a maximum of 20MB/s
on data loading. It's CPU limited by Postgres COPY.
Beside that, sequential-io as you are propagating everywhere is NOT the
holy grail or the sole solution to a fast database.
While the SAN above really is not a screamer for that kind of
application it is actually a very good performer(compared with some of
the DASD based boxes) under heavy random-io and concurrent load.
This has a direct measurable influence on the overall speed of our
production applications which are mostly OLTP ;-)
The same DASD can be configured with RAID10 and will far surpass the
external FC SAN configuration you describe at the same price.
The DASD story is not all about sequential I/O. The main limitation is the
number of devices you can make available using DASD, and that's a Postgres
limitation that we solve.
For OLTP apps, you may be fast enough with the limited bandwidth of your FC
SAN, but it would still be faster with hundreds of channels and disks.
- Luke
Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
3. vacuuming this table - it turned out that VACUUM FULL is completly
unusable on a table(which i actually expected before) of this size not
only to the locking involved but rather due to a gigantic memory
requirement and unbelievable slowness.sure, that was mostly meant as an experiment, if I had to do this on a
production database I would most likely use CLUSTER to get the desired
effect (which in my case was purely getting back the diskspace wasted by
dead tuples)Yeah, the VACUUM FULL algorithm is really designed for situations where
just a fraction of the rows have to be moved to re-compact the table.
It might be interesting to teach it to abandon that plan and go to a
CLUSTER-like table rewrite once the percentage of dead space is seen to
reach some suitable level. CLUSTER has its own disadvantages though
(2X peak disk space usage, doesn't work on core catalogs, etc).
hmm very interesting idea, I for myself like it but from what i have
seen people quite often use vacuum full to get their disk usage down
_because_ they are running low on space (and because it's not that well
known that CLUSTER could be much faster) - maybe we should add a
note/hint about this to the maintenance/vacuum docs at least ?
Stefan
Ühel kenal päeval, R, 2006-03-10 kell 12:23, kirjutas Steve Atkins:
I get bitten by this quite often (customer machines, one giant table,
purge out a lot of old data).CLUSTER is great for that, given the headroom, though I've often
resorted to a dump and restore because I've not had the headroom
for cluster, and it's a lot less downtime than a full vacuum.While the right fix there is to redo the application engine side to use
table partitioning, I keep wondering whether it would be possible
to move rows near the end of the table to the beginning in one, non-
locking
phase (vacuum to populate FSM with free space near beginning of table,
touch rows starting at end of table, repeat) and then finish off with a
vacuum full to tidy up the remainder and truncate the files (or a
simpler
"lock the table and truncate anything unused at the end").
At some point I had to compress a very busily updated table. I used the
following approach:
1) VACUUM buzytable; (lazy not full)
2) SELECT primary_key_value, ctid FROM buzytable;
3) Extract N last records from there and for each keep repeating
3A) UPDATE buzytable
SET primary_key_value = primary_key_value
WHERE primary_key_value = extracted_value
3B) SELECT ctid FROM buzytable
WHERE primary_key_value = extracted_value
until the tuple is moved to another pages, hopefully nearer to
the beginning of table
repeat from 1) until the page for last row (extracted from ctid) is
smaller than some thresold.
This was the only way I was able to get a table back to small enough
size without service interruption.
--------------
Hannu
Hannu Krosing <hannu@skype.net> writes:
At some point I had to compress a very busily updated table. I used the
following approach:
[ move a few rows at a time ]
We could possibly do something similar with VACUUM FULL as well: once
maintenance_work_mem is filled, start discarding per-page data to stay
under the memory limit. This would mean that some area near the middle
of the table remains uncompacted, but it would allow putting an upper
bound on the time and space used by any one pass of VACUUM FULL ...
regards, tom lane
On Sat, Mar 11, 2006 at 10:21:43PM +0200, Hannu Krosing wrote:
table partitioning, I keep wondering whether it would be possible
to move rows near the end of the table to the beginning in one, non-
locking
phase (vacuum to populate FSM with free space near beginning of table,
touch rows starting at end of table, repeat) and then finish off with a
vacuum full to tidy up the remainder and truncate the files (or a
simpler
"lock the table and truncate anything unused at the end").At some point I had to compress a very busily updated table. I used the
following approach:1) VACUUM buzytable; (lazy not full)
2) SELECT primary_key_value, ctid FROM buzytable;
3) Extract N last records from there and for each keep repeating
3A) UPDATE buzytable
SET primary_key_value = primary_key_value
WHERE primary_key_value = extracted_value3B) SELECT ctid FROM buzytable
WHERE primary_key_value = extracted_valueuntil the tuple is moved to another pages, hopefully nearer to
the beginning of tablerepeat from 1) until the page for last row (extracted from ctid) is
smaller than some thresold.
BTW, this is what the following TODO would hopefully fix:
Allow FSM to return free space toward the beginning of the heap file, in
hopes that empty pages at the end can be truncated by VACUUM
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461