Non-linear Performance
I'm noticing that performance in creating a particular index, and
also a little bit in a simple query, seems somewhat non-linear,
and I'm wondering if anybody could give me any clues as to what
might be causing this, and how I might fix it, if that's possible.
I've done a COPY in and index build on three data sets (just integer
data) consisting of 10m rows (500 MB table), 100m rows (5 GB table),
and 500m rows (25 GB table). (This is all on a 1.4 GHz P4, 512 MB RAM,
two 7200 RPM IDE drives, one for data and one for log.)
The COPY command to do the import is pretty linear, at about 230
sec., 2000 sec and 10,000 sec. for the 10m, 100m and 500m row
tables. Neither disk I/O nor CPU seem to be pushed particularly,
though I'm getting lots of small writes to the log files. I tried
using a larger checkpoint size, but it didn't seem to help any.
Recreating the primary key (on an INT, and of course all values
unique) took 123, 1300 and 6700 seconds, again pretty linear.
The next column is a DATE, which is the same value for the first
3.5M records, the next day for the next 3.5M records, and so on,
incrementing a day for every 3.5M records (in order). This index
build took about 60, 700 and 3500 seconds, respectively, again
linear.
But then I get to the next INT column which in every row is filled
in with a random value between 0 and 99,999. This index takes about
175, 3600, and 28,000 seconds seconds, respectively, to generate.
So it take about 2x as long per record going from 10m to 100m
records, and about 1.5x as long again per record when going from
100m to 500m records.
Queries using that index seem to do this too, though not quite as
badly. Using a very simple query such as "SELECT COUNT(*) FROM
table WHERE value = 12345" (where value is the last INT column
above that took ages to index), typical query times (including
connection overhead) for data not in the cache are 0.6 sec., 11
sec. and 72 sec.
This query, as expected, is completely dominated by random IO; the
disk the table is on sits there at 100% usage (i.e., disk requests
outstanding 100% of the time) and not much else is happening at all.
It does seem to do a few more more disk transfers than I would
really expect. I get back a count of around 4000-5000, which to me
implies about 5000 reads plus the index reads (which one would
think would not amount to more than one or two hundred pages), yet
110 I/O requests per second times 70 seconds implies about 7000
reads. Is there something I'm missing here?
(If the query plan and analysis is any help, here it is:
Aggregate (cost=22081.46..22081.46 rows=1 width=0)
(actual time=70119.88..70119.88 rows=1 loops=1) ->
Index Scan using data_3_value on data_3
(cost=0.00..22067.71 rows=5498 width=0)
(actual time=38.70..70090.45 rows=4980 loops=1)
Total runtime: 70121.74 msec
Anyway, I'm open to any thoughts on this. In particular, I'm open
to suggestions for cheap ways of dealing with this horrible random
I/O load. (Yeah, yeah, I know: disk array, and SCSI while I'm about
it. But I've been asked to get this sort of thing working fast on
systems much cheaper than the current Sun/Sybase/EMC or whatever
it is that they're using.)
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes:
I'm noticing that performance in creating a particular index, and
also a little bit in a simple query, seems somewhat non-linear,
Btree index build is primarily a sort, so cannot have better than
O(n*log(n)) performance for random data. Not sure why you'd expect
linearity.
Increasing SORT_MEM would help the constant factor, however...
Queries using that index seem to do this too, though not quite as
badly. Using a very simple query such as "SELECT COUNT(*) FROM
table WHERE value = 12345" (where value is the last INT column
above that took ages to index), typical query times (including
connection overhead) for data not in the cache are 0.6 sec., 11
sec. and 72 sec.
I guess that the smaller datasets would get proportionally more benefit
from kernel disk caching.
It does seem to do a few more more disk transfers than I would
really expect. I get back a count of around 4000-5000, which to me
implies about 5000 reads plus the index reads (which one would
think would not amount to more than one or two hundred pages), yet
110 I/O requests per second times 70 seconds implies about 7000
reads. Is there something I'm missing here?
Can you demonstrate that it actually did 7000 reads, and not 5000+?
That extrapolation technique doesn't look to me like it has the
accuracy to tell the difference. You might try setting show_query_stats
(note the results go into the postmaster log, not to the client;
perhaps we ought to change that someday).
Also, if you've updated the table at all, there might be some fetches of
dead tuples involved.
Anyway, I'm open to any thoughts on this. In particular, I'm open
to suggestions for cheap ways of dealing with this horrible random
I/O load.
More RAM, perhaps.
regards, tom lane
Tom Lane wrote:
Btree index build is primarily a sort, so cannot have better than
O(n*log(n)) performance for random data. Not sure why you'd expect
linearity.Increasing SORT_MEM would help the constant factor, however...
What is the most amount of SORT_MEM it makes sense to allocate? Pretend
the ONLY thing I care about is BTREE index creation time. (2 gigs of
RAM on my DB machine.) Disk IO is not a bottleneck.
-Pete
Peter A. Daly wrote:
Tom Lane wrote:
Btree index build is primarily a sort, so cannot have better than
O(n*log(n)) performance for random data. Not sure why you'd expect
linearity.Increasing SORT_MEM would help the constant factor, however...
What is the most amount of SORT_MEM it makes sense to allocate?
Pretend the ONLY thing I care about is BTREE index creation time. (2
gigs of RAM on my DB machine.) Disk IO is not a bottleneck.
Another though. If postgres has one of my CPU's at near 100%, does that
mean I can't get any more performance out of it? Still, how big can I
make sort_mem?
-Pete
"Peter A. Daly" <petedaly@ix.netcom.com> writes:
Tom Lane wrote:
Increasing SORT_MEM would help the constant factor, however...
What is the most amount of SORT_MEM it makes sense to allocate?
I've never experimented with it, but certainly the standard default
(512K = 0.5M) is pretty small for modern machines. In a 2G machine
I might try settings around 100M-500M to see what works best. (Note
this is just for a one-off btree creation --- for ordinary queries you
need to allow for multiple sorts going on in parallel, which is one
reason the default sort_mem is not very large.)
regards, tom lane
What is the most amount of SORT_MEM it makes sense to allocate?
I've never experimented with it, but certainly the standard default
(512K = 0.5M) is pretty small for modern machines. In a 2G machine
I might try settings around 100M-500M to see what works best. (Note
this is just for a one-off btree creation --- for ordinary queries you
need to allow for multiple sorts going on in parallel, which is one
reason the default sort_mem is not very large.)
I will run some benchmarks and let the list know the results. If this
can speed it up a large amount, I can get another 6 gig of RAM into this
machine which I hope can let me leave the SORT_MEM at a high enough
amount to speed up our huge nightly batch database reload.
It's a Dual Xeon 500Mhz Machine.
-Pete
Another though. If postgres has one of my CPU's at near 100%, does that
mean I can't get any more performance out of it? Still, how big can I
make sort_mem?
I regularly have one CPU pegged at 100% with PostgreSQL on my queries
(tables of 4m-ish rows and up). (Dual P3 1ghz 2GB, Linux 2.2, PG 7.2.1)
I believe it cannot multithread individual queries using several processors.
I believe IBM DB2 Enterprise _can_ do that - but it also costs $20,000 per
CPU and damned if I can figure out how to install it even for the trial.
Let me ask a similar question:
If I ran PostgreSQL with a Linux 2.4 kernel and 6GB of RAM with dual P4 Xeons:
a) Could PostgreSQL use all the RAM?
b) The RAM paging would incur a 2-4% slowdown, any other noted speed
reductions?
c) Could PostgreSQL use this "hyperthreading" that apparently is in these
P4 Xeons making dual processors look like four processors (obviously, on
four separate queries)?
d) How much extra performance does having the log or indices on a different
disk buy you, esp. in the instance where you are inserting millions of
records into a table? An indexed table?
I ask "d" above because as I am considering upgrading to such a box, it
will have an 8-disk RAID controller, and I'm wondering if it is a better
idea to:
a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5
Thanks,
Doug
Increasing SORT_MEM would help the constant factor, however...
What is the most amount of SORT_MEM it makes sense to allocate?
I've never experimented with it, but certainly the standard default
(512K = 0.5M) is pretty small for modern machines. In a 2G machine
I might try settings around 100M-500M to see what works best. (Note
I currently use 1.4 gigs for "shared mem" in my database (out of 2G) - I
couldn't get PostgreSQL to run with more than that (it might be an OS
limit, Linux 2.2).
I also use 64 megs as a SORT_MEM setting, on the theory that 4 sorts at
once will leave just enough RAM for the rest of my system not to swap.
Cheers,
Doug
Doug Fields <dfields-pg-general@pexicom.com> writes:
d) How much extra performance does having the log or indices on a different
disk buy you, esp. in the instance where you are inserting millions of
records into a table? An indexed table?
Keeping the logs on a separate drive is a big win, I believe, for heavy
update situations. (For read-only queries, of course the log doesn't
matter.)
Keeping indexes on a separate drive is also traditional database advice,
but I don't have any feeling for how much it matters in Postgres.
a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5
You could probably get away without mirroring the indices, if you are
willing to incur a little downtime to rebuild them after an index drive
failure. So another possibility is
2-drive mirror for log, 1 plain old drive for indexes, rest for data.
If your data will fit on 2 drives then you could mirror both, still have
your 8th drive as hot spare, and feel pretty secure.
Note that while it is reasonably painless to configure PG with WAL logs
in a special place (after initdb, move the pg_xlog subdirectory and make
a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in the short term is
probably your (b).
regards, tom lane
Bear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:
If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution? Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block written;
one to the data and one to the redundancy algorithm.
Is this wrong?
(I mean no disrespect)
Tom Lane wrote:
Doug Fields <dfields-pg-general@pexicom.com> writes:
d) How much extra performance does having the log or indices on a different
disk buy you, esp. in the instance where you are inserting millions of
records into a table? An indexed table?Keeping the logs on a separate drive is a big win, I believe, for heavy
update situations. (For read-only queries, of course the log doesn't
matter.)Keeping indexes on a separate drive is also traditional database advice,
but I don't have any feeling for how much it matters in Postgres.a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5You could probably get away without mirroring the indices, if you are
willing to incur a little downtime to rebuild them after an index drive
failure. So another possibility is2-drive mirror for log, 1 plain old drive for indexes, rest for data.
If your data will fit on 2 drives then you could mirror both, still have
your 8th drive as hot spare, and feel pretty secure.Note that while it is reasonably painless to configure PG with WAL logs
in a special place (after initdb, move the pg_xlog subdirectory and make
a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in the short term is
probably your (b).regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
--
Kurt Gunderson
Senior Programmer
Applications Development
Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/
"Entropy isn't what is used to be"
Obtaining any information from this message for the purpose of sending
unsolicited commercial Email is strictly prohibited. Receiving this
email does not constitute a request of or consent to send unsolicited
commercial Email.
Your RAID analysis is a bit wrong.
In striping (disk joining) every byte written requires 1 byte sent to 1
disk. This gives you ZERO redundancy: RAID0 is used purely for making a
large partition from smaller disks.
In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of
the 2 mirrored disks, for total disk IO of 2bytes.
In RAID5, the most efficient solution, every 1 byte written requires LESS
then 1 byte written for the CRC. Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.
RAID5 is the fastest from an algorithm, standpoint. There is some gotchas,
RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
because the controller on the SCSI card acts like a parallel processor.
RAID5 also wastes the least amount of disk space.
What is the cheapest is a relative thing, what is certain is that RAID 5
requires more disks (at least 3) then mirroring (exactly 2), but RAID5
wastes less space, so the cost analysis begins with a big "it depends...".
Any disk system will choke under heavy load, especially if the disk write
system is inefficient (like IBM's IDE interface). I think if you did a
test, you would find RAID1 would choke more then RAID5 simply because RAID1
requires MORE disk IO for the same bytes being saved.
Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good
reason: The more the drives, the faster the performance. Here's why:
Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte
to drive 2, etc, and the CRC to the final drive. For high performance SCSI
systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE
faster then the drives they are attached to) the drives actually write in
PARALLEL. I can give you a more detailed example, but suffice to say that
with RAID5 writing 7 bytes to 7 data drives takes about the same time to
write 3 or 4 bytes to a single non raid drive. That my friends, is why
RAID5 (especially when done by hardware) actually improves performance.
Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com
Show quoted text
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson
Sent: Thursday, May 30, 2002 12:59 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planningBear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution?
Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy
but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block
written;
one to the data and one to the redundancy algorithm.Is this wrong?
(I mean no disrespect)
Tom Lane wrote:
Doug Fields <dfields-pg-general@pexicom.com> writes:
d) How much extra performance does having the log or
indices on a different
disk buy you, esp. in the instance where you are inserting
millions of
records into a table? An indexed table?
Keeping the logs on a separate drive is a big win, I
believe, for heavy
update situations. (For read-only queries, of course the
log doesn't
matter.)
Keeping indexes on a separate drive is also traditional
database advice,
but I don't have any feeling for how much it matters in Postgres.
a) Run everything on one 7-drive RAID 5 partition (8th
drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drivemirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a5-drive RAID 5
You could probably get away without mirroring the indices,
if you are
willing to incur a little downtime to rebuild them after an
index drive
failure. So another possibility is
2-drive mirror for log, 1 plain old drive for indexes, rest
for data.
If your data will fit on 2 drives then you could mirror
both, still have
your 8th drive as hot spare, and feel pretty secure.
Note that while it is reasonably painless to configure PG
with WAL logs
in a special place (after initdb, move the pg_xlog
subdirectory and make
a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in theshort term is
probably your (b).
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?
--
Kurt Gunderson
Senior Programmer
Applications Development
Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/"Entropy isn't what is used to be"
Obtaining any information from this message for the purpose of sending
unsolicited commercial Email is strictly prohibited. Receiving this
email does not constitute a request of or consent to send unsolicited
commercial Email.---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?
Keeping indexes on a separate drive is also traditional database advice,
but I don't have any feeling for how much it matters in Postgres.
My gut feeling is that you'd be better off taking two drives and
striping them RAID 0 instead of putting data on one and indexes on
another. Either way, you have the same potential total throughput, but if
they're in the RAID array, then you can use the total throughput more
easily, and in more situations - you don't have to depend on reading two
seperate pieces of data simultaneously to utilize the entire throughput.
steve
If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution? Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block written;
one to the data and one to the redundancy algorithm.Is this wrong?
Here's my take on it...
If you have enough RAM to keep everything buffered/cached, and fsync()
is turned off, then the speed of the disk subsystem becomes vastly less
important - you'll only read the data once (first couple of queries), and
then the disks will sit idle. The lights on the disks on my DB machine
only flicker once per minute or less. If that's the case, then I'd
rather use RAID 5 with a hot-spare or two, to increase storage capacity
over 1+0, and speed at reading as well.
Of course, if you write tremendous amounts of data, and your data set is
larger than you can cache/buffer, that all flies out the window.
steve
I think your undestanding of RAID 5 is wrong also.
For a general N disk RAID 5 the process is:
1)Read sector
2)XOR with data to write
3)Read parity sector
4)XOR with result above
5)write data
6)write parity
So you can see, for every logical write, there is two reads and two
writes.
For a 3 disks RAID 5 the process can be shortened:
1)Write data
2)Read other disk
3)XOR with data
4)Write to parity disk.
So, two writes and one read.
JLL
terry@greatgulfhomes.com wrote:
Show quoted text
Your RAID analysis is a bit wrong.
In striping (disk joining) every byte written requires 1 byte sent to 1
disk. This gives you ZERO redundancy: RAID0 is used purely for making a
large partition from smaller disks.In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of
the 2 mirrored disks, for total disk IO of 2bytes.
In RAID5, the most efficient solution, every 1 byte written requires LESS
then 1 byte written for the CRC. Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.RAID5 is the fastest from an algorithm, standpoint. There is some gotchas,
RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
because the controller on the SCSI card acts like a parallel processor.RAID5 also wastes the least amount of disk space.
What is the cheapest is a relative thing, what is certain is that RAID 5
requires more disks (at least 3) then mirroring (exactly 2), but RAID5
wastes less space, so the cost analysis begins with a big "it depends...".Any disk system will choke under heavy load, especially if the disk write
system is inefficient (like IBM's IDE interface). I think if you did a
test, you would find RAID1 would choke more then RAID5 simply because RAID1
requires MORE disk IO for the same bytes being saved.Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good
reason: The more the drives, the faster the performance. Here's why:
Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte
to drive 2, etc, and the CRC to the final drive. For high performance SCSI
systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE
faster then the drives they are attached to) the drives actually write in
PARALLEL. I can give you a more detailed example, but suffice to say that
with RAID5 writing 7 bytes to 7 data drives takes about the same time to
write 3 or 4 bytes to a single non raid drive. That my friends, is why
RAID5 (especially when done by hardware) actually improves performance.Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson
Sent: Thursday, May 30, 2002 12:59 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planningBear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution?
Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy
but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block
written;
one to the data and one to the redundancy algorithm.Is this wrong?
(I mean no disrespect)
Tom Lane wrote:
Doug Fields <dfields-pg-general@pexicom.com> writes:
d) How much extra performance does having the log or
indices on a different
disk buy you, esp. in the instance where you are inserting
millions of
records into a table? An indexed table?
Keeping the logs on a separate drive is a big win, I
believe, for heavy
update situations. (For read-only queries, of course the
log doesn't
matter.)
Keeping indexes on a separate drive is also traditional
database advice,
but I don't have any feeling for how much it matters in Postgres.
a) Run everything on one 7-drive RAID 5 partition (8th
drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drivemirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a5-drive RAID 5
You could probably get away without mirroring the indices,
if you are
willing to incur a little downtime to rebuild them after an
index drive
failure. So another possibility is
2-drive mirror for log, 1 plain old drive for indexes, rest
for data.
If your data will fit on 2 drives then you could mirror
both, still have
your 8th drive as hot spare, and feel pretty secure.
Note that while it is reasonably painless to configure PG
with WAL logs
in a special place (after initdb, move the pg_xlog
subdirectory and make
a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in theshort term is
probably your (b).
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?
--
Kurt Gunderson
Senior Programmer
Applications Development
Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/"Entropy isn't what is used to be"
Obtaining any information from this message for the purpose of sending
unsolicited commercial Email is strictly prohibited. Receiving this
email does not constitute a request of or consent to send unsolicited
commercial Email.---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
My simplification was intended, anyway it still equates to the same, because
in a performance machine (lots of memory) reads are (mostly) pulled from
cache (not disk IO). So the real cost is disk writes, and 2 = 2.
Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com
Show quoted text
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Jean-Luc
Lachance
Sent: Thursday, May 30, 2002 3:17 PM
To: terry@greatgulfhomes.com
Cc: kgunders@cbnlottery.com; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planningI think your undestanding of RAID 5 is wrong also.
For a general N disk RAID 5 the process is:
1)Read sector
2)XOR with data to write
3)Read parity sector
4)XOR with result above
5)write data
6)write paritySo you can see, for every logical write, there is two reads and two
writes.For a 3 disks RAID 5 the process can be shortened:
1)Write data
2)Read other disk
3)XOR with data
4)Write to parity disk.So, two writes and one read.
JLL
terry@greatgulfhomes.com wrote:
Your RAID analysis is a bit wrong.
In striping (disk joining) every byte written requires 1byte sent to 1
disk. This gives you ZERO redundancy: RAID0 is used
purely for making a
large partition from smaller disks.
In RAID1 (mirroring) Every 1 byte written requires 1 byte
written to EACH of
the 2 mirrored disks, for total disk IO of 2bytes.
In RAID5, the most efficient solution, every 1 byte written
requires LESS
then 1 byte written for the CRC. Roughly (depending on
implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.
RAID5 is the fastest from an algorithm, standpoint. There
is some gotchas,
RAID5 implemented by hardware is faster the RAID5
implemented by OS, simply
because the controller on the SCSI card acts like a
parallel processor.
RAID5 also wastes the least amount of disk space.
What is the cheapest is a relative thing, what is certain
is that RAID 5
requires more disks (at least 3) then mirroring (exactly
2), but RAID5
wastes less space, so the cost analysis begins with a big
"it depends...".
Any disk system will choke under heavy load, especially if
the disk write
system is inefficient (like IBM's IDE interface). I think
if you did a
test, you would find RAID1 would choke more then RAID5
simply because RAID1
requires MORE disk IO for the same bytes being saved.
Referring to what Tom Lane said, he recommends 7 drive
RAID5 for a very good
reason: The more the drives, the faster the performance.
Here's why:
Write 7 bytes on a 7 drive RAID5, the first byte goes to
drive 1, 2nd byte
to drive 2, etc, and the CRC to the final drive. For high
performance SCSI
systems, whose BUS IO is faster then drives (and most SCSI
IO chains ARE
faster then the drives they are attached to) the drives
actually write in
PARALLEL. I can give you a more detailed example, but
suffice to say that
with RAID5 writing 7 bytes to 7 data drives takes about the
same time to
write 3 or 4 bytes to a single non raid drive. That my
friends, is why
RAID5 (especially when done by hardware) actually improves
performance.
Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf OfKurt Gunderson
Sent: Thursday, May 30, 2002 12:59 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planningBear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:If you are looking for the best performance, why go with
a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution?
Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy
but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block
written;
one to the data and one to the redundancy algorithm.Is this wrong?
(I mean no disrespect)
Tom Lane wrote:
Doug Fields <dfields-pg-general@pexicom.com> writes:
d) How much extra performance does having the log or
indices on a different
disk buy you, esp. in the instance where you are inserting
millions of
records into a table? An indexed table?
Keeping the logs on a separate drive is a big win, I
believe, for heavy
update situations. (For read-only queries, of course the
log doesn't
matter.)
Keeping indexes on a separate drive is also traditional
database advice,
but I don't have any feeling for how much it matters in
Postgres.
a) Run everything on one 7-drive RAID 5 partition (8th
drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drivemirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a5-drive RAID 5
You could probably get away without mirroring the indices,
if you are
willing to incur a little downtime to rebuild them after an
index drive
failure. So another possibility is
2-drive mirror for log, 1 plain old drive for indexes, rest
for data.
If your data will fit on 2 drives then you could mirror
both, still have
your 8th drive as hot spare, and feel pretty secure.
Note that while it is reasonably painless to configure PG
with WAL logs
in a special place (after initdb, move the pg_xlog
subdirectory and make
a symlink to its new location), it's not currently easy
to separate
indexes from data. So the most practical approach in the
short term is
probably your (b).
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?
--
Kurt Gunderson
Senior Programmer
Applications Development
Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/"Entropy isn't what is used to be"
Obtaining any information from this message for the
purpose of sending
unsolicited commercial Email is strictly prohibited.
Receiving this
email does not constitute a request of or consent to send
unsolicited
commercial Email.
---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?---------------------------(end of
broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
---------------------------(end of
broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)
This still troubles me.
terry@greatgulfhomes.com wrote:
In striping (disk joining) every byte written requires 1 byte sent to 1
disk. This gives you ZERO redundancy: RAID0 is used purely for making a
large partition from smaller disks.
Not necessarily. RAID0 absolutely shines in 'concurrent' access to
disk. When a hundred people want a hundred different records from disk,
the chance becomes greater that any needle on any disk in the striped
set will be found near the block where the record exists. Small gains
for a small shop but when multiplied with hundreds/thousands of
concurrent users across many Gigs of data the benefits add up.
Likewise, because a data block is written across many disks (depending
on strip size) the task can 'almost' be done concurrently by the controller.
In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of
the 2 mirrored disks, for total disk IO of 2bytes.
I agree and, to go further, on some (most?) RAID disk controllers and
when reading a block of data, the first disk of the mirrored pair (given
the location of the needle on the spindle) to locate the data will
return the block to the controller.
Likewise, when writing to the mirrored pair (and using 'write-through',
never 'write-back'), the controller will pass along the 'data written'
flag to the CPU when the first disk of the pair writes the data. The
second will sync eventually but the controller need not wait for both.
So in combining these technologies and utilizing RAID1+0, you gain all
the benefits of striping and mirroring.
In RAID5, the most efficient solution, every 1 byte written requires LESS
then 1 byte written for the CRC. Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.RAID5 is the fastest from an algorithm, standpoint. There is some gotchas,
RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
because the controller on the SCSI card acts like a parallel processor.RAID5 also wastes the least amount of disk space.
In addition to the additional writes, RAID5 still requires at least the
computation of the CRC based on the amount of data written even if it
can concurrently write that data across many disks. You can achieve the
same efficiency with RAID0 less the cost of calculating the CRC and gain
the integrity/performance by using RAID1.
What is the cheapest is a relative thing, what is certain is that RAID 5
requires more disks (at least 3) then mirroring (exactly 2), but RAID5
wastes less space, so the cost analysis begins with a big "it depends...".Any disk system will choke under heavy load, especially if the disk write
system is inefficient (like IBM's IDE interface). I think if you did a
test, you would find RAID1 would choke more then RAID5 simply because RAID1
requires MORE disk IO for the same bytes being saved.
Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good
reason: The more the drives, the faster the performance. Here's why:
Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte
to drive 2, etc, and the CRC to the final drive. For high performance SCSI
systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE
faster then the drives they are attached to) the drives actually write in
PARALLEL. I can give you a more detailed example, but suffice to say that
with RAID5 writing 7 bytes to 7 data drives takes about the same time to
write 3 or 4 bytes to a single non raid drive. That my friends, is why
RAID5 (especially when done by hardware) actually improves performance.
I fully agree with you that hardware RAID is the way to go.
I would still place log files (or sequential transaction files) on a
RAID1 because movement of the needle on the disk of the mirrored pair is
minimal.
Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson
Sent: Thursday, May 30, 2002 12:59 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planningBear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution?
Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy
but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block
written;
one to the data and one to the redundancy algorithm.Is this wrong?
(I mean no disrespect)
Tom Lane wrote:
Doug Fields <dfields-pg-general@pexicom.com> writes:
d) How much extra performance does having the log or
indices on a different
disk buy you, esp. in the instance where you are inserting
millions of
records into a table? An indexed table?
Keeping the logs on a separate drive is a big win, I
believe, for heavy
update situations. (For read-only queries, of course the
log doesn't
matter.)
Keeping indexes on a separate drive is also traditional
database advice,
but I don't have any feeling for how much it matters in Postgres.
a) Run everything on one 7-drive RAID 5 partition (8th
drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drivemirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a5-drive RAID 5
You could probably get away without mirroring the indices,
if you are
willing to incur a little downtime to rebuild them after an
index drive
failure. So another possibility is
2-drive mirror for log, 1 plain old drive for indexes, rest
for data.
If your data will fit on 2 drives then you could mirror
both, still have
your 8th drive as hot spare, and feel pretty secure.
Note that while it is reasonably painless to configure PG
with WAL logs
in a special place (after initdb, move the pg_xlog
subdirectory and make
a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in theshort term is
probably your (b).
--
Kurt Gunderson
Senior Programmer
Applications Development
Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/
"Entropy isn't what is used to be"
Obtaining any information from this message for the purpose of sending
unsolicited commercial Email is strictly prohibited. Receiving this
email does not constitute a request of or consent to send unsolicited
commercial Email.
On Thu, 30 May 2002 terry@greatgulfhomes.com wrote:
I agree with ALMOST everything you say, but have a few minor nits to pick.
nothing personal, just my own experience on RAID testing and such.
In RAID5, the most efficient solution, every 1 byte written requires LESS
then 1 byte written for the CRC.
This isn't true for any RAID 5 implementation I'm familiar with. The
parity stripe is exactly the same size as the data stripes it shares space
with. But this isn't a real important point, since most RAID arrays write
8k to 128k at a time. Note that it's not CRC (Cyclic Redunancy Check)
that gets written, but straight XOR, hence no space savings.
Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.RAID5 is the fastest from an algorithm, standpoint. There is some gotchas,
RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
because the controller on the SCSI card acts like a parallel processor.
This is most definitely not always true, even given equal hardware specs
(i.e. number and type of drives / interfaces all the same).
My old AMI Megaraid card with 3 Ultra Wide SCSI ports can generate 64 Megs
of parity data per second. My Celeron 1.1GHz machine can generate
2584 Megs of parity data per second. The load on the CPU under VERY
heavy reads and writes is about 0.3% cpu, and the max throughput on reads
on a RAID array of 4 VERY old (non-ultra non-wide 7200RPM) 2 Gig drives is
about 33 Megs a second read speed.
Same setup with 7200 4 Gig Ultra narrow drives on an AMI raid can read at
about 14 Megabytes a second.
The most important part of fast RAID is the drives first, interface
second, and hardware versus software raid last. While it was often true
in the dark past of 33 Megahertz CPUs and such that hardware raid was
always faster, it is often much better to spend the extra money a RAID
controller would cost you and just buy more drives or cheaper controllers
or memory or CPUs.
Generally speaking, I've found RAID5 with 4 or fewer drives to be about
even with RAID1, while RAID5 with 6 or more drives quickly starts to
outrun a two drive mirror set. This is especially true under heavy
parallel access.
On a subject no one's mentioned yet, >2 drives in a RAID1 setup.
I've done some testing with >2 drives in a mirror (NOT 1+0 or 0+1, just
RAID1 with >2 drives) under Linux, and found that if you are doing 90%
reads then it's also a good solution, but for most real world database
apps, it doesn't really help a lot.
What is RAID1 with > 2 drives???
Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com
Show quoted text
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Scott Marlowe
Sent: Thursday, May 30, 2002 5:45 PM
To: terry@greatgulfhomes.com
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planningOn Thu, 30 May 2002 terry@greatgulfhomes.com wrote:
I agree with ALMOST everything you say, but have a few minor
nits to pick.
nothing personal, just my own experience on RAID testing and such.In RAID5, the most efficient solution, every 1 byte written
requires LESS
then 1 byte written for the CRC.
This isn't true for any RAID 5 implementation I'm familiar with. The
parity stripe is exactly the same size as the data stripes it
shares space
with. But this isn't a real important point, since most RAID
arrays write
8k to 128k at a time. Note that it's not CRC (Cyclic
Redunancy Check)
that gets written, but straight XOR, hence no space savings.Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.RAID5 is the fastest from an algorithm, standpoint. There
is some gotchas,
RAID5 implemented by hardware is faster the RAID5
implemented by OS, simply
because the controller on the SCSI card acts like a
parallel processor.
This is most definitely not always true, even given equal
hardware specs
(i.e. number and type of drives / interfaces all the same).My old AMI Megaraid card with 3 Ultra Wide SCSI ports can
generate 64 Megs
of parity data per second. My Celeron 1.1GHz machine can generate
2584 Megs of parity data per second. The load on the CPU under VERY
heavy reads and writes is about 0.3% cpu, and the max
throughput on reads
on a RAID array of 4 VERY old (non-ultra non-wide 7200RPM) 2
Gig drives is
about 33 Megs a second read speed.Same setup with 7200 4 Gig Ultra narrow drives on an AMI raid
can read at
about 14 Megabytes a second.The most important part of fast RAID is the drives first, interface
second, and hardware versus software raid last. While it was
often true
in the dark past of 33 Megahertz CPUs and such that hardware raid was
always faster, it is often much better to spend the extra
money a RAID
controller would cost you and just buy more drives or cheaper
controllers
or memory or CPUs.Generally speaking, I've found RAID5 with 4 or fewer drives
to be about
even with RAID1, while RAID5 with 6 or more drives quickly starts to
outrun a two drive mirror set. This is especially true under heavy
parallel access.On a subject no one's mentioned yet, >2 drives in a RAID1 setup.
I've done some testing with >2 drives in a mirror (NOT 1+0 or
0+1, just
RAID1 with >2 drives) under Linux, and found that if you are
doing 90%
reads then it's also a good solution, but for most real world
database
apps, it doesn't really help a lot.---------------------------(end of
broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org
On Thu, 30 May 2002 terry@greatgulfhomes.com wrote:
What is RAID1 with > 2 drives???
Just that. A mirror set that writes all changes to all drives.
It costs more for writes, but runs faster for reads, so it's not a good
idea to use one in a heavily written environment, but it really flies for
heavy read setups, like data mining or CRM stuff. Plus, a single drive
failure is no big deal, you just replace it at your leisure, since you're
still fully redundant.
This means you could do crazy things like build 9 disks into a RAID 0+1 by
making three sets of RAID 1 with three drives, and then put all three
together to make the RAID 0. Very good read speeds under heavy loads, and
fairly good write speeds as well.
You could run a fairly large database on a setup of 9 36 Gig ultra SCSI
drives like this, and get some nice throughput, and have it be VERY
redundant. I.e. ANY two drives could fail, and you'd still be up, and if
the two drives that failed were in different RAID1s, you'd still be fully
redundant.
Crazy stuff.
Jean-Luc Lachance writes:
I think your undestanding of RAID 5 is wrong also.
For a general N disk RAID 5 the process is:
1)Read sector
2)XOR with data to write
3)Read parity sector
4)XOR with result above
5)write data
6)write parity
Yes, generally. There are a couple of tricks you can do to help get
around this, though.
One, which works very nicely when doing sequential writes, is to attempt
to hold off on the write until you collect an entire stripe's worth
of data. Then you can calculate the parity based on what's in memory,
and write the new blocks across all of the disks without worrying
about what was on them before. 3ware's Escalade IDE RAID controllers
(the 3W-7x50 series, anyway) do this. Their explanation is at
http://www.3ware.com/NewFaq/general_operating_and_troubleshooting.htm#R5
_Fusion_Explained .
Another tactic is just to buffer entire stripes. Baydel does this
with their disk arrays, which are actually RAID-3, not RAID-5.
(Since they do only full-stripe reads and writes, it doesn't really
make any difference which they use.) You want a fair amount of RAM
in your controller for buffering in this case, but it keeps the
computers "read, modify, write" cycle on one block from turning
into "read, read, modify, write".
Terry Fielder writes:
My simplification was intended, anyway it still equates to the same,
because in a performance machine (lots of memory) reads are (mostly)
pulled from cache (not disk IO). So the real cost is disk writes, and
2 = 2.
Well, it really depends on your workload. If you're selecting stuff
almost completely randomly scattered about a large table (like the
25 GB one I'm dealing with right now), it's going to be a bit pricy
to get hold of a machine with enough memory to cache that effectively.
Kurt Gunderson writes:
] Likewise, when writing to the mirrored pair (and using 'write-through',
] never 'write-back'), the controller will pass along the 'data written'
] flag to the CPU when the first disk of the pair writes the data. The
] second will sync eventually but the controller need not wait for both.
I hope not! I think that controller ought to wait for both to be written,
because otherwise you can have this scenario:
1. Write of block X scheduled for drives A and B.
2. Block written to drive A. Still pending on drive B.
3. Controller returns "block committed to stable storage" to application.
4. Power failure. Pending write to drive B is never written.
Now, how do you know, when the system comes back up, that you have
a good copy of the block on drive A, but not on drive B?
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC