Non-linear Performance

Started by Curt Sampsonabout 24 years ago60 messagesgeneral

cjs@cynic.net

about 24 years ago

I'm noticing that performance in creating a particular index, and
also a little bit in a simple query, seems somewhat non-linear,
and I'm wondering if anybody could give me any clues as to what
might be causing this, and how I might fix it, if that's possible.

I've done a COPY in and index build on three data sets (just integer
data) consisting of 10m rows (500 MB table), 100m rows (5 GB table),
and 500m rows (25 GB table). (This is all on a 1.4 GHz P4, 512 MB RAM,
two 7200 RPM IDE drives, one for data and one for log.)

The COPY command to do the import is pretty linear, at about 230
sec., 2000 sec and 10,000 sec. for the 10m, 100m and 500m row
tables. Neither disk I/O nor CPU seem to be pushed particularly,
though I'm getting lots of small writes to the log files. I tried
using a larger checkpoint size, but it didn't seem to help any.

Recreating the primary key (on an INT, and of course all values
unique) took 123, 1300 and 6700 seconds, again pretty linear.

The next column is a DATE, which is the same value for the first
3.5M records, the next day for the next 3.5M records, and so on,
incrementing a day for every 3.5M records (in order). This index
build took about 60, 700 and 3500 seconds, respectively, again
linear.

But then I get to the next INT column which in every row is filled
in with a random value between 0 and 99,999. This index takes about
175, 3600, and 28,000 seconds seconds, respectively, to generate.
So it take about 2x as long per record going from 10m to 100m
records, and about 1.5x as long again per record when going from
100m to 500m records.

Queries using that index seem to do this too, though not quite as
badly. Using a very simple query such as "SELECT COUNT(*) FROM
table WHERE value = 12345" (where value is the last INT column
above that took ages to index), typical query times (including
connection overhead) for data not in the cache are 0.6 sec., 11
sec. and 72 sec.

This query, as expected, is completely dominated by random IO; the
disk the table is on sits there at 100% usage (i.e., disk requests
outstanding 100% of the time) and not much else is happening at all.

It does seem to do a few more more disk transfers than I would
really expect. I get back a count of around 4000-5000, which to me
implies about 5000 reads plus the index reads (which one would
think would not amount to more than one or two hundred pages), yet
110 I/O requests per second times 70 seconds implies about 7000
reads. Is there something I'm missing here?

(If the query plan and analysis is any help, here it is:

Aggregate (cost=22081.46..22081.46 rows=1 width=0)
(actual time=70119.88..70119.88 rows=1 loops=1) ->
Index Scan using data_3_value on data_3
(cost=0.00..22067.71 rows=5498 width=0)
(actual time=38.70..70090.45 rows=4980 loops=1)
Total runtime: 70121.74 msec

Anyway, I'm open to any thoughts on this. In particular, I'm open
to suggestions for cheap ways of dealing with this horrible random
I/O load. (Yeah, yeah, I know: disk array, and SCSI while I'm about
it. But I've been asked to get this sort of thing working fast on
systems much cheaper than the current Sun/Sybase/EMC or whatever
it is that they're using.)

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

Tom Lane

tgl@sss.pgh.pa.us

almost 24 years ago

In reply to: Curt Sampson (#1)

Re: Non-linear Performance

Curt Sampson <cjs@cynic.net> writes:

I'm noticing that performance in creating a particular index, and
also a little bit in a simple query, seems somewhat non-linear,

Btree index build is primarily a sort, so cannot have better than
O(n*log(n)) performance for random data. Not sure why you'd expect
linearity.

Increasing SORT_MEM would help the constant factor, however...

Queries using that index seem to do this too, though not quite as
badly. Using a very simple query such as "SELECT COUNT(*) FROM
table WHERE value = 12345" (where value is the last INT column
above that took ages to index), typical query times (including
connection overhead) for data not in the cache are 0.6 sec., 11
sec. and 72 sec.

I guess that the smaller datasets would get proportionally more benefit
from kernel disk caching.

It does seem to do a few more more disk transfers than I would
really expect. I get back a count of around 4000-5000, which to me
implies about 5000 reads plus the index reads (which one would
think would not amount to more than one or two hundred pages), yet
110 I/O requests per second times 70 seconds implies about 7000
reads. Is there something I'm missing here?

Can you demonstrate that it actually did 7000 reads, and not 5000+?
That extrapolation technique doesn't look to me like it has the
accuracy to tell the difference. You might try setting show_query_stats
(note the results go into the postmaster log, not to the client;
perhaps we ought to change that someday).

Also, if you've updated the table at all, there might be some fetches of
dead tuples involved.

Anyway, I'm open to any thoughts on this. In particular, I'm open
to suggestions for cheap ways of dealing with this horrible random
I/O load.

More RAM, perhaps.

regards, tom lane

Peter A. Daly

petedaly@ix.netcom.com

almost 24 years ago

In reply to: Curt Sampson (#1)

Re: Non-linear Performance

Tom Lane wrote:

Btree index build is primarily a sort, so cannot have better than
O(n*log(n)) performance for random data. Not sure why you'd expect
linearity.

Increasing SORT_MEM would help the constant factor, however...

What is the most amount of SORT_MEM it makes sense to allocate? Pretend
the ONLY thing I care about is BTREE index creation time. (2 gigs of
RAM on my DB machine.) Disk IO is not a bottleneck.

-Pete

Peter A. Daly

petedaly@ix.netcom.com

almost 24 years ago

In reply to: Curt Sampson (#1)

Re: Non-linear Performance

Peter A. Daly wrote:

Tom Lane wrote:

Btree index build is primarily a sort, so cannot have better than
O(n*log(n)) performance for random data. Not sure why you'd expect
linearity.

Increasing SORT_MEM would help the constant factor, however...

What is the most amount of SORT_MEM it makes sense to allocate?
Pretend the ONLY thing I care about is BTREE index creation time. (2
gigs of RAM on my DB machine.) Disk IO is not a bottleneck.

Another though. If postgres has one of my CPU's at near 100%, does that
mean I can't get any more performance out of it? Still, how big can I
make sort_mem?

-Pete

Tom Lane

tgl@sss.pgh.pa.us

almost 24 years ago

In reply to: Peter A. Daly (#3)

Re: Non-linear Performance

"Peter A. Daly" <petedaly@ix.netcom.com> writes:

Tom Lane wrote:

Increasing SORT_MEM would help the constant factor, however...

What is the most amount of SORT_MEM it makes sense to allocate?

I've never experimented with it, but certainly the standard default
(512K = 0.5M) is pretty small for modern machines. In a 2G machine
I might try settings around 100M-500M to see what works best. (Note
this is just for a one-off btree creation --- for ordinary queries you
need to allow for multiple sorts going on in parallel, which is one
reason the default sort_mem is not very large.)

regards, tom lane

Peter A. Daly

petedaly@ix.netcom.com

almost 24 years ago

In reply to: Curt Sampson (#1)

Re: Non-linear Performance

What is the most amount of SORT_MEM it makes sense to allocate?

I've never experimented with it, but certainly the standard default
(512K = 0.5M) is pretty small for modern machines. In a 2G machine
I might try settings around 100M-500M to see what works best. (Note
this is just for a one-off btree creation --- for ordinary queries you
need to allow for multiple sorts going on in parallel, which is one
reason the default sort_mem is not very large.)

I will run some benchmarks and let the list know the results. If this
can speed it up a large amount, I can get another 6 gig of RAM into this
machine which I hope can let me leave the SORT_MEM at a high enough
amount to speed up our huge nightly batch database reload.

It's a Dual Xeon 500Mhz Machine.

-Pete

Doug Fields

dfields-pg-general@pexicom.com

almost 24 years ago

In reply to: Peter A. Daly (#4)

Scaling with memory & disk planning (was Re: Non-linear Performance)

Another though. If postgres has one of my CPU's at near 100%, does that
mean I can't get any more performance out of it? Still, how big can I
make sort_mem?

I regularly have one CPU pegged at 100% with PostgreSQL on my queries
(tables of 4m-ish rows and up). (Dual P3 1ghz 2GB, Linux 2.2, PG 7.2.1)

I believe it cannot multithread individual queries using several processors.

I believe IBM DB2 Enterprise _can_ do that - but it also costs $20,000 per
CPU and damned if I can figure out how to install it even for the trial.

Let me ask a similar question:

If I ran PostgreSQL with a Linux 2.4 kernel and 6GB of RAM with dual P4 Xeons:

a) Could PostgreSQL use all the RAM?
b) The RAM paging would incur a 2-4% slowdown, any other noted speed
reductions?
c) Could PostgreSQL use this "hyperthreading" that apparently is in these
P4 Xeons making dual processors look like four processors (obviously, on
four separate queries)?
d) How much extra performance does having the log or indices on a different
disk buy you, esp. in the instance where you are inserting millions of
records into a table? An indexed table?

I ask "d" above because as I am considering upgrading to such a box, it
will have an 8-disk RAID controller, and I'm wondering if it is a better
idea to:

a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5

Thanks,

Doug

Doug Fields

dfields-pg-general@pexicom.com

almost 24 years ago

In reply to: Tom Lane (#5)

Re: Non-linear Performance

Increasing SORT_MEM would help the constant factor, however...

What is the most amount of SORT_MEM it makes sense to allocate?

I've never experimented with it, but certainly the standard default
(512K = 0.5M) is pretty small for modern machines. In a 2G machine
I might try settings around 100M-500M to see what works best. (Note

I currently use 1.4 gigs for "shared mem" in my database (out of 2G) - I
couldn't get PostgreSQL to run with more than that (it might be an OS
limit, Linux 2.2).

I also use 64 megs as a SORT_MEM setting, on the theory that 4 sorts at
once will leave just enough RAM for the rest of my system not to swap.

Cheers,

Doug

Tom Lane

tgl@sss.pgh.pa.us

almost 24 years ago

In reply to: Doug Fields (#7)

Re: Scaling with memory & disk planning (was Re: Non-linear Performance)

Doug Fields <dfields-pg-general@pexicom.com> writes:

d) How much extra performance does having the log or indices on a different
disk buy you, esp. in the instance where you are inserting millions of
records into a table? An indexed table?

Keeping the logs on a separate drive is a big win, I believe, for heavy
update situations. (For read-only queries, of course the log doesn't
matter.)

Keeping indexes on a separate drive is also traditional database advice,
but I don't have any feeling for how much it matters in Postgres.

a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5

You could probably get away without mirroring the indices, if you are
willing to incur a little downtime to rebuild them after an index drive
failure. So another possibility is

2-drive mirror for log, 1 plain old drive for indexes, rest for data.

If your data will fit on 2 drives then you could mirror both, still have
your 8th drive as hot spare, and feel pretty secure.

Note that while it is reasonably painless to configure PG with WAL logs
in a special place (after initdb, move the pg_xlog subdirectory and make
a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in the short term is
probably your (b).

regards, tom lane

#10

Kurt Gunderson

kgunders@cbnlottery.com

almost 24 years ago

In reply to: Curt Sampson (#1)

Re: Scaling with memory & disk planning

Bear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:

If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution? Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block written;
one to the data and one to the redundancy algorithm.

Is this wrong?

(I mean no disrespect)

Tom Lane wrote:

Doug Fields <dfields-pg-general@pexicom.com> writes:

d) How much extra performance does having the log or indices on a different
disk buy you, esp. in the instance where you are inserting millions of
records into a table? An indexed table?

Keeping the logs on a separate drive is a big win, I believe, for heavy
update situations. (For read-only queries, of course the log doesn't
matter.)

Keeping indexes on a separate drive is also traditional database advice,
but I don't have any feeling for how much it matters in Postgres.

a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5

You could probably get away without mirroring the indices, if you are
willing to incur a little downtime to rebuild them after an index drive
failure. So another possibility is

2-drive mirror for log, 1 plain old drive for indexes, rest for data.

If your data will fit on 2 drives then you could mirror both, still have
your 8th drive as hot spare, and feel pretty secure.

Note that while it is reasonably painless to configure PG with WAL logs
in a special place (after initdb, move the pg_xlog subdirectory and make
a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in the short term is
probably your (b).

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
Kurt Gunderson
Senior Programmer
Applications Development
Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/

"Entropy isn't what is used to be"

Obtaining any information from this message for the purpose of sending
unsolicited commercial Email is strictly prohibited. Receiving this
email does not constitute a request of or consent to send unsolicited
commercial Email.

#11

Terry Fielder

terry@greatgulfhomes.com

almost 24 years ago

In reply to: Kurt Gunderson (#10)

Re: Scaling with memory & disk planning

Your RAID analysis is a bit wrong.
In striping (disk joining) every byte written requires 1 byte sent to 1
disk. This gives you ZERO redundancy: RAID0 is used purely for making a
large partition from smaller disks.

In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of
the 2 mirrored disks, for total disk IO of 2bytes.

In RAID5, the most efficient solution, every 1 byte written requires LESS
then 1 byte written for the CRC. Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.

RAID5 is the fastest from an algorithm, standpoint. There is some gotchas,
RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
because the controller on the SCSI card acts like a parallel processor.

RAID5 also wastes the least amount of disk space.

What is the cheapest is a relative thing, what is certain is that RAID 5
requires more disks (at least 3) then mirroring (exactly 2), but RAID5
wastes less space, so the cost analysis begins with a big "it depends...".

Any disk system will choke under heavy load, especially if the disk write
system is inefficient (like IBM's IDE interface). I think if you did a
test, you would find RAID1 would choke more then RAID5 simply because RAID1
requires MORE disk IO for the same bytes being saved.

Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good
reason: The more the drives, the faster the performance. Here's why:
Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte
to drive 2, etc, and the CRC to the final drive. For high performance SCSI
systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE
faster then the drives they are attached to) the drives actually write in
PARALLEL. I can give you a more detailed example, but suffice to say that
with RAID5 writing 7 bytes to 7 data drives takes about the same time to
write 3 or 4 bytes to a single non raid drive. That my friends, is why
RAID5 (especially when done by hardware) actually improves performance.

Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson
Sent: Thursday, May 30, 2002 12:59 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planning

Bear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:

If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution?
Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy
but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block
written;
one to the data and one to the redundancy algorithm.

Is this wrong?

(I mean no disrespect)

Tom Lane wrote:

Doug Fields <dfields-pg-general@pexicom.com> writes:

d) How much extra performance does having the log or

indices on a different

disk buy you, esp. in the instance where you are inserting

millions of

records into a table? An indexed table?

Keeping the logs on a separate drive is a big win, I

believe, for heavy

update situations. (For read-only queries, of course the

log doesn't

matter.)

Keeping indexes on a separate drive is also traditional

database advice,

but I don't have any feeling for how much it matters in Postgres.

a) Run everything on one 7-drive RAID 5 partition (8th

drive as hot spare)

b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive

mirror, and the rest

on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a

5-drive RAID 5

You could probably get away without mirroring the indices,

if you are

willing to incur a little downtime to rebuild them after an

index drive

failure. So another possibility is

2-drive mirror for log, 1 plain old drive for indexes, rest

for data.

If your data will fit on 2 drives then you could mirror

both, still have

your 8th drive as hot spare, and feel pretty secure.

Note that while it is reasonably painless to configure PG

with WAL logs

in a special place (after initdb, move the pg_xlog

subdirectory and make

a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in the

short term is

probably your (b).

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
Kurt Gunderson
Senior Programmer
Applications Development
Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/

"Entropy isn't what is used to be"

Obtaining any information from this message for the purpose of sending
unsolicited commercial Email is strictly prohibited. Receiving this
email does not constitute a request of or consent to send unsolicited
commercial Email.

---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#12

Steve Wolfe

steve@iboats.com

almost 24 years ago

In reply to: Curt Sampson (#1)

Re: Scaling with memory & disk planning (was Re: Non-linear Performance)

Keeping indexes on a separate drive is also traditional database advice,
but I don't have any feeling for how much it matters in Postgres.

My gut feeling is that you'd be better off taking two drives and
striping them RAID 0 instead of putting data on one and indexes on
another. Either way, you have the same potential total throughput, but if
they're in the RAID array, then you can use the total throughput more
easily, and in more situations - you don't have to depend on reading two
seperate pieces of data simultaneously to utilize the entire throughput.

steve

#13

Steve Wolfe

steve@iboats.com

almost 24 years ago

In reply to: Curt Sampson (#1)

Re: Scaling with memory & disk planning

If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution? Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block written;
one to the data and one to the redundancy algorithm.

Is this wrong?

Here's my take on it...

If you have enough RAM to keep everything buffered/cached, and fsync()
is turned off, then the speed of the disk subsystem becomes vastly less
important - you'll only read the data once (first couple of queries), and
then the disks will sit idle. The lights on the disks on my DB machine
only flicker once per minute or less. If that's the case, then I'd
rather use RAID 5 with a hot-spare or two, to increase storage capacity
over 1+0, and speed at reading as well.

Of course, if you write tremendous amounts of data, and your data set is
larger than you can cache/buffer, that all flies out the window.

steve

#14

Jean-Luc Lachance

jllachan@nsd.ca

almost 24 years ago

In reply to: Terry Fielder (#11)

Re: Scaling with memory & disk planning

I think your undestanding of RAID 5 is wrong also.

For a general N disk RAID 5 the process is:
1)Read sector
2)XOR with data to write
3)Read parity sector
4)XOR with result above
5)write data
6)write parity

So you can see, for every logical write, there is two reads and two
writes.

For a 3 disks RAID 5 the process can be shortened:
1)Write data
2)Read other disk
3)XOR with data
4)Write to parity disk.

So, two writes and one read.

JLL

terry@greatgulfhomes.com wrote:

Show quoted text

Your RAID analysis is a bit wrong.
In striping (disk joining) every byte written requires 1 byte sent to 1
disk. This gives you ZERO redundancy: RAID0 is used purely for making a
large partition from smaller disks.

In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of
the 2 mirrored disks, for total disk IO of 2bytes.

In RAID5, the most efficient solution, every 1 byte written requires LESS
then 1 byte written for the CRC. Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.

RAID5 is the fastest from an algorithm, standpoint. There is some gotchas,
RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
because the controller on the SCSI card acts like a parallel processor.

RAID5 also wastes the least amount of disk space.

What is the cheapest is a relative thing, what is certain is that RAID 5
requires more disks (at least 3) then mirroring (exactly 2), but RAID5
wastes less space, so the cost analysis begins with a big "it depends...".

Any disk system will choke under heavy load, especially if the disk write
system is inefficient (like IBM's IDE interface). I think if you did a
test, you would find RAID1 would choke more then RAID5 simply because RAID1
requires MORE disk IO for the same bytes being saved.

Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good
reason: The more the drives, the faster the performance. Here's why:
Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte
to drive 2, etc, and the CRC to the final drive. For high performance SCSI
systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE
faster then the drives they are attached to) the drives actually write in
PARALLEL. I can give you a more detailed example, but suffice to say that
with RAID5 writing 7 bytes to 7 data drives takes about the same time to
write 3 or 4 bytes to a single non raid drive. That my friends, is why
RAID5 (especially when done by hardware) actually improves performance.

Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson
Sent: Thursday, May 30, 2002 12:59 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planning

Bear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:

If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution?
Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy
but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block
written;
one to the data and one to the redundancy algorithm.

Is this wrong?

(I mean no disrespect)

Tom Lane wrote:

Doug Fields <dfields-pg-general@pexicom.com> writes:

d) How much extra performance does having the log or

indices on a different

disk buy you, esp. in the instance where you are inserting

millions of

records into a table? An indexed table?

Keeping the logs on a separate drive is a big win, I

believe, for heavy

update situations. (For read-only queries, of course the

log doesn't

matter.)

Keeping indexes on a separate drive is also traditional

database advice,

but I don't have any feeling for how much it matters in Postgres.

a) Run everything on one 7-drive RAID 5 partition (8th

drive as hot spare)

b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive

mirror, and the rest

on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a

5-drive RAID 5

You could probably get away without mirroring the indices,

if you are

willing to incur a little downtime to rebuild them after an

index drive

failure. So another possibility is

2-drive mirror for log, 1 plain old drive for indexes, rest

for data.

If your data will fit on 2 drives then you could mirror

both, still have

your 8th drive as hot spare, and feel pretty secure.

Note that while it is reasonably painless to configure PG

with WAL logs

in a special place (after initdb, move the pg_xlog

subdirectory and make

a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in the

short term is

probably your (b).

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
Kurt Gunderson
Senior Programmer
Applications Development
Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/

"Entropy isn't what is used to be"

Obtaining any information from this message for the purpose of sending
unsolicited commercial Email is strictly prohibited. Receiving this
email does not constitute a request of or consent to send unsolicited
commercial Email.

---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

#15

Terry Fielder

terry@greatgulfhomes.com

almost 24 years ago

In reply to: Jean-Luc Lachance (#14)

Re: Scaling with memory & disk planning

My simplification was intended, anyway it still equates to the same, because
in a performance machine (lots of memory) reads are (mostly) pulled from
cache (not disk IO). So the real cost is disk writes, and 2 = 2.

Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Jean-Luc
Lachance
Sent: Thursday, May 30, 2002 3:17 PM
To: terry@greatgulfhomes.com
Cc: kgunders@cbnlottery.com; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planning

I think your undestanding of RAID 5 is wrong also.

For a general N disk RAID 5 the process is:
1)Read sector
2)XOR with data to write
3)Read parity sector
4)XOR with result above
5)write data
6)write parity

So you can see, for every logical write, there is two reads and two
writes.

For a 3 disks RAID 5 the process can be shortened:
1)Write data
2)Read other disk
3)XOR with data
4)Write to parity disk.

So, two writes and one read.

JLL

terry@greatgulfhomes.com wrote:

Your RAID analysis is a bit wrong.
In striping (disk joining) every byte written requires 1

byte sent to 1

disk. This gives you ZERO redundancy: RAID0 is used

purely for making a

large partition from smaller disks.

In RAID1 (mirroring) Every 1 byte written requires 1 byte

written to EACH of

the 2 mirrored disks, for total disk IO of 2bytes.

In RAID5, the most efficient solution, every 1 byte written

requires LESS

then 1 byte written for the CRC. Roughly (depending on

implementation,

number of disks) every 3 bytes written requires 4 bytes of disk IO.

RAID5 is the fastest from an algorithm, standpoint. There

is some gotchas,

RAID5 implemented by hardware is faster the RAID5

implemented by OS, simply

because the controller on the SCSI card acts like a

parallel processor.

RAID5 also wastes the least amount of disk space.

What is the cheapest is a relative thing, what is certain

is that RAID 5

requires more disks (at least 3) then mirroring (exactly

2), but RAID5

wastes less space, so the cost analysis begins with a big

"it depends...".

Any disk system will choke under heavy load, especially if

the disk write

system is inefficient (like IBM's IDE interface). I think

if you did a

test, you would find RAID1 would choke more then RAID5

simply because RAID1

requires MORE disk IO for the same bytes being saved.

Referring to what Tom Lane said, he recommends 7 drive

RAID5 for a very good

reason: The more the drives, the faster the performance.

Here's why:

Write 7 bytes on a 7 drive RAID5, the first byte goes to

drive 1, 2nd byte

to drive 2, etc, and the CRC to the final drive. For high

performance SCSI

systems, whose BUS IO is faster then drives (and most SCSI

IO chains ARE

faster then the drives they are attached to) the drives

actually write in

PARALLEL. I can give you a more detailed example, but

suffice to say that

with RAID5 writing 7 bytes to 7 data drives takes about the

same time to

write 3 or 4 bytes to a single non raid drive. That my

friends, is why

RAID5 (especially when done by hardware) actually improves

performance.

Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of

Kurt Gunderson

Sent: Thursday, May 30, 2002 12:59 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planning

Bear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:

If you are looking for the best performance, why go with

a RAID5 as

opposed to a RAID1+0 (mirrored stripes) solution?
Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy
but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block
written;
one to the data and one to the redundancy algorithm.

Is this wrong?

(I mean no disrespect)

Tom Lane wrote:

Doug Fields <dfields-pg-general@pexicom.com> writes:

d) How much extra performance does having the log or

indices on a different

disk buy you, esp. in the instance where you are inserting

millions of

records into a table? An indexed table?

Keeping the logs on a separate drive is a big win, I

believe, for heavy

update situations. (For read-only queries, of course the

log doesn't

matter.)

Keeping indexes on a separate drive is also traditional

database advice,

but I don't have any feeling for how much it matters in

Postgres.

a) Run everything on one 7-drive RAID 5 partition (8th

drive as hot spare)

b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive

mirror, and the rest

on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a

5-drive RAID 5

You could probably get away without mirroring the indices,

if you are

willing to incur a little downtime to rebuild them after an

index drive

failure. So another possibility is

2-drive mirror for log, 1 plain old drive for indexes, rest

for data.

If your data will fit on 2 drives then you could mirror

both, still have

your 8th drive as hot spare, and feel pretty secure.

Note that while it is reasonably painless to configure PG

with WAL logs

in a special place (after initdb, move the pg_xlog

subdirectory and make

a symlink to its new location), it's not currently easy

to separate

indexes from data. So the most practical approach in the

short term is

probably your (b).

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
Kurt Gunderson
Senior Programmer
Applications Development
Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/

"Entropy isn't what is used to be"

Obtaining any information from this message for the

purpose of sending

unsolicited commercial Email is strictly prohibited.

Receiving this

email does not constitute a request of or consent to send

unsolicited

commercial Email.

---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of

broadcast)---------------------------

TIP 4: Don't 'kill -9' the postmaster

---------------------------(end of
broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)

#16

Kurt Gunderson

kgunders@cbnlottery.com

almost 24 years ago

In reply to: Terry Fielder (#11)

Re: Scaling with memory & disk planning

This still troubles me.

terry@greatgulfhomes.com wrote:

In striping (disk joining) every byte written requires 1 byte sent to 1
disk. This gives you ZERO redundancy: RAID0 is used purely for making a
large partition from smaller disks.

Not necessarily. RAID0 absolutely shines in 'concurrent' access to
disk. When a hundred people want a hundred different records from disk,
the chance becomes greater that any needle on any disk in the striped
set will be found near the block where the record exists. Small gains
for a small shop but when multiplied with hundreds/thousands of
concurrent users across many Gigs of data the benefits add up.

Likewise, because a data block is written across many disks (depending
on strip size) the task can 'almost' be done concurrently by the controller.

In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of
the 2 mirrored disks, for total disk IO of 2bytes.

I agree and, to go further, on some (most?) RAID disk controllers and
when reading a block of data, the first disk of the mirrored pair (given
the location of the needle on the spindle) to locate the data will
return the block to the controller.

Likewise, when writing to the mirrored pair (and using 'write-through',
never 'write-back'), the controller will pass along the 'data written'
flag to the CPU when the first disk of the pair writes the data. The
second will sync eventually but the controller need not wait for both.

So in combining these technologies and utilizing RAID1+0, you gain all
the benefits of striping and mirroring.

In RAID5, the most efficient solution, every 1 byte written requires LESS
then 1 byte written for the CRC. Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.

RAID5 is the fastest from an algorithm, standpoint. There is some gotchas,
RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
because the controller on the SCSI card acts like a parallel processor.

RAID5 also wastes the least amount of disk space.

In addition to the additional writes, RAID5 still requires at least the
computation of the CRC based on the amount of data written even if it
can concurrently write that data across many disks. You can achieve the
same efficiency with RAID0 less the cost of calculating the CRC and gain
the integrity/performance by using RAID1.

What is the cheapest is a relative thing, what is certain is that RAID 5
requires more disks (at least 3) then mirroring (exactly 2), but RAID5
wastes less space, so the cost analysis begins with a big "it depends...".

Any disk system will choke under heavy load, especially if the disk write
system is inefficient (like IBM's IDE interface). I think if you did a
test, you would find RAID1 would choke more then RAID5 simply because RAID1
requires MORE disk IO for the same bytes being saved.

Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good
reason: The more the drives, the faster the performance. Here's why:
Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte
to drive 2, etc, and the CRC to the final drive. For high performance SCSI
systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE
faster then the drives they are attached to) the drives actually write in
PARALLEL. I can give you a more detailed example, but suffice to say that
with RAID5 writing 7 bytes to 7 data drives takes about the same time to
write 3 or 4 bytes to a single non raid drive. That my friends, is why
RAID5 (especially when done by hardware) actually improves performance.

I fully agree with you that hardware RAID is the way to go.

I would still place log files (or sequential transaction files) on a
RAID1 because movement of the needle on the disk of the mirrored pair is
minimal.

Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson
Sent: Thursday, May 30, 2002 12:59 PM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planning

Bear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:

If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution?
Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy
but, from my
experience, RAID5 chokes horribly under heavy disk writing. RAID5
always requires at least two write operations for every block
written;
one to the data and one to the redundancy algorithm.

Is this wrong?

(I mean no disrespect)

Tom Lane wrote:

Doug Fields <dfields-pg-general@pexicom.com> writes:

d) How much extra performance does having the log or

indices on a different

disk buy you, esp. in the instance where you are inserting

millions of

records into a table? An indexed table?

Keeping the logs on a separate drive is a big win, I

believe, for heavy

update situations. (For read-only queries, of course the

log doesn't

matter.)

Keeping indexes on a separate drive is also traditional

database advice,

but I don't have any feeling for how much it matters in Postgres.

a) Run everything on one 7-drive RAID 5 partition (8th

drive as hot spare)

b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive

mirror, and the rest

on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a

5-drive RAID 5

You could probably get away without mirroring the indices,

if you are

willing to incur a little downtime to rebuild them after an

index drive

failure. So another possibility is

2-drive mirror for log, 1 plain old drive for indexes, rest

for data.

If your data will fit on 2 drives then you could mirror

both, still have

your 8th drive as hot spare, and feel pretty secure.

Note that while it is reasonably painless to configure PG

with WAL logs

in a special place (after initdb, move the pg_xlog

subdirectory and make

a symlink to its new location), it's not currently easy to separate
indexes from data. So the most practical approach in the

short term is

probably your (b).

"Entropy isn't what is used to be"

#17

scott.marlowe

scott.marlowe@ihs.com

almost 24 years ago

In reply to: Terry Fielder (#11)

Re: Scaling with memory & disk planning

On Thu, 30 May 2002 terry@greatgulfhomes.com wrote:

I agree with ALMOST everything you say, but have a few minor nits to pick.
nothing personal, just my own experience on RAID testing and such.

In RAID5, the most efficient solution, every 1 byte written requires LESS
then 1 byte written for the CRC.

This isn't true for any RAID 5 implementation I'm familiar with. The
parity stripe is exactly the same size as the data stripes it shares space
with. But this isn't a real important point, since most RAID arrays write
8k to 128k at a time. Note that it's not CRC (Cyclic Redunancy Check)
that gets written, but straight XOR, hence no space savings.

Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.

RAID5 is the fastest from an algorithm, standpoint. There is some gotchas,
RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
because the controller on the SCSI card acts like a parallel processor.

This is most definitely not always true, even given equal hardware specs
(i.e. number and type of drives / interfaces all the same).

My old AMI Megaraid card with 3 Ultra Wide SCSI ports can generate 64 Megs
of parity data per second. My Celeron 1.1GHz machine can generate
2584 Megs of parity data per second. The load on the CPU under VERY
heavy reads and writes is about 0.3% cpu, and the max throughput on reads
on a RAID array of 4 VERY old (non-ultra non-wide 7200RPM) 2 Gig drives is
about 33 Megs a second read speed.

Same setup with 7200 4 Gig Ultra narrow drives on an AMI raid can read at
about 14 Megabytes a second.

The most important part of fast RAID is the drives first, interface
second, and hardware versus software raid last. While it was often true
in the dark past of 33 Megahertz CPUs and such that hardware raid was
always faster, it is often much better to spend the extra money a RAID
controller would cost you and just buy more drives or cheaper controllers
or memory or CPUs.

Generally speaking, I've found RAID5 with 4 or fewer drives to be about
even with RAID1, while RAID5 with 6 or more drives quickly starts to
outrun a two drive mirror set. This is especially true under heavy
parallel access.

On a subject no one's mentioned yet, >2 drives in a RAID1 setup.

I've done some testing with >2 drives in a mirror (NOT 1+0 or 0+1, just
RAID1 with >2 drives) under Linux, and found that if you are doing 90%
reads then it's also a good solution, but for most real world database
apps, it doesn't really help a lot.

#18

Terry Fielder

terry@greatgulfhomes.com

almost 24 years ago

In reply to: scott.marlowe (#17)

Re: Scaling with memory & disk planning

What is RAID1 with > 2 drives???

Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Scott Marlowe
Sent: Thursday, May 30, 2002 5:45 PM
To: terry@greatgulfhomes.com
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Scaling with memory & disk planning

On Thu, 30 May 2002 terry@greatgulfhomes.com wrote:

I agree with ALMOST everything you say, but have a few minor
nits to pick.
nothing personal, just my own experience on RAID testing and such.

In RAID5, the most efficient solution, every 1 byte written

requires LESS

then 1 byte written for the CRC.

This isn't true for any RAID 5 implementation I'm familiar with. The
parity stripe is exactly the same size as the data stripes it
shares space
with. But this isn't a real important point, since most RAID
arrays write
8k to 128k at a time. Note that it's not CRC (Cyclic
Redunancy Check)
that gets written, but straight XOR, hence no space savings.

Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.

RAID5 is the fastest from an algorithm, standpoint. There

is some gotchas,

RAID5 implemented by hardware is faster the RAID5

implemented by OS, simply

because the controller on the SCSI card acts like a

parallel processor.

This is most definitely not always true, even given equal
hardware specs
(i.e. number and type of drives / interfaces all the same).

My old AMI Megaraid card with 3 Ultra Wide SCSI ports can
generate 64 Megs
of parity data per second. My Celeron 1.1GHz machine can generate
2584 Megs of parity data per second. The load on the CPU under VERY
heavy reads and writes is about 0.3% cpu, and the max
throughput on reads
on a RAID array of 4 VERY old (non-ultra non-wide 7200RPM) 2
Gig drives is
about 33 Megs a second read speed.

Same setup with 7200 4 Gig Ultra narrow drives on an AMI raid
can read at
about 14 Megabytes a second.

The most important part of fast RAID is the drives first, interface
second, and hardware versus software raid last. While it was
often true
in the dark past of 33 Megahertz CPUs and such that hardware raid was
always faster, it is often much better to spend the extra
money a RAID
controller would cost you and just buy more drives or cheaper
controllers
or memory or CPUs.

Generally speaking, I've found RAID5 with 4 or fewer drives
to be about
even with RAID1, while RAID5 with 6 or more drives quickly starts to
outrun a two drive mirror set. This is especially true under heavy
parallel access.

On a subject no one's mentioned yet, >2 drives in a RAID1 setup.

I've done some testing with >2 drives in a mirror (NOT 1+0 or
0+1, just
RAID1 with >2 drives) under Linux, and found that if you are
doing 90%
reads then it's also a good solution, but for most real world
database
apps, it doesn't really help a lot.

---------------------------(end of
broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org

#19

scott.marlowe

scott.marlowe@ihs.com

almost 24 years ago

In reply to: Terry Fielder (#18)

Re: Scaling with memory & disk planning

On Thu, 30 May 2002 terry@greatgulfhomes.com wrote:

What is RAID1 with > 2 drives???

Just that. A mirror set that writes all changes to all drives.

It costs more for writes, but runs faster for reads, so it's not a good
idea to use one in a heavily written environment, but it really flies for
heavy read setups, like data mining or CRM stuff. Plus, a single drive
failure is no big deal, you just replace it at your leisure, since you're
still fully redundant.

This means you could do crazy things like build 9 disks into a RAID 0+1 by
making three sets of RAID 1 with three drives, and then put all three
together to make the RAID 0. Very good read speeds under heavy loads, and
fairly good write speeds as well.

You could run a fairly large database on a setup of 9 36 Gig ultra SCSI
drives like this, and get some nice throughput, and have it be VERY
redundant. I.e. ANY two drives could fail, and you'd still be up, and if
the two drives that failed were in different RAID1s, you'd still be fully
redundant.

Crazy stuff.