SCSI vs. IDE performance test
http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20&tid=38&tid=49
--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA
I can't make you have an abortion, but you can *make* me pay
child support for 18 years? However, if I want the child (and
all the expenses that entails) for the *rest*of*my*life*, and you
don't want it for 9 months, tough luck???
The SCSI improvement over IDE seems overrated in the test. I would have
expected at most a 30% improvement. Other reviews seem to point out that IDE
performs just as well or better.
See Tom's hardware:
http://www20.tomshardware.com/storage/20020305/index.html
Stephen
"Ron Johnson" <ron.l.johnson@cox.net> wrote in message
news:1066837102.12532.176.camel@haggis...
http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20
&tid=38&tid=49
Show quoted text
--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USAI can't make you have an abortion, but you can *make* me pay
child support for 18 years? However, if I want the child (and
all the expenses that entails) for the *rest*of*my*life*, and you
don't want it for 9 months, tough luck???---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-----Original Message-----
From: Stephen [mailto:jleelim@xxxxxx.com]
Sent: Wednesday, October 22, 2003 9:02 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] SCSI vs. IDE performance testThe SCSI improvement over IDE seems overrated in the test. I
would have expected at most a 30% improvement. Other reviews
seem to point out that IDE performs just as well or better.See Tom's hardware:
http://www20.tomshardware.com/storage/20020305> /index.html
My own tests show that 15K RPM ultra 320 SCSI drives are considerably
faster than any IDE storage.
This ATA drive:
http://www.wdc.com/en/products/WD360GD.asp
Performs as well or better than many SCSI drives, and are not terribly
expensive. Therefore, these are a very good choice if price performance
is more important than absolute performance.
But if you need absolute horsepower, then one of these (or other 15K
Ultra320 equivalent) won't be beaten:
http://www.storagereview.com/articles/200304/200304068C073x0_1.html
Import Notes
Resolved by subject fallback
Unwrap this link (if your newsreader folds it) and click on it for hard
drive performance:
http://www.storagereview.com/php/benchmark/compare_rtg_2001.php?typeID=1
0&testbedID=3&osID=4&raidconfigID=1&numDrives=1&devID_0=232&devID_1=237&
devID_2=213&devID_3=221&devID_4=216&devID_5=249&devID_6=250&devCnt=7
The important part for database is "Server Suite"
Import Notes
Resolved by subject fallback
On Wed, 2003-10-22 at 11:01, Stephen wrote:
The SCSI improvement over IDE seems overrated in the test. I would have
expected at most a 30% improvement. Other reviews seem to point out that IDE
performs just as well or better.See Tom's hardware:
http://www20.tomshardware.com/storage/20020305/index.html
When TCQ becomes a reality in IDE drives, they'll have a fighting
chance, but the slower seek times and rotational speeds will still
do them in.
Also, does an 8MB cache *really* make that much of a difference?
After all, it can only cache 0.0067% of a 120GB drive, and 0.00267%
of the new 300GB disks.
Speaking of which, that 300GB HDD sounds like a dream for near-
line storage, and even for nightly backups, if it is ever put in
SBB-type packaging.
http://www20.tomshardware.com/storage/20031008/index.html
Imagine a scheme where you rapidly pg_dump to the 300GB drive,
then, at leisure, tar the dump file to tape. Stripe a few together,
and keep a month of backups on-line for quick recovery, along with
the tape archives, in case the stripeset gets wasted, too.
"Ron Johnson" <ron.l.johnson@cox.net> wrote in message
news:1066837102.12532.176.camel@haggis...http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20
&tid=38&tid=49
--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA
"Adventure is a sign of incompetence"
Stephanson, great polar explorer
Dann Corbit wrote:
Unwrap this link (if your newsreader folds it) and click on it for hard
drive performance:
http://www.storagereview.com/php/benchmark/compare_rtg_2001.php?typeID=1
0&testbedID=3&osID=4&raidconfigID=1&numDrives=1&devID_0=232&devID_1=237&
devID_2=213&devID_3=221&devID_4=216&devID_5=249&devID_6=250&devCnt=7The important part for database is "Server Suite"
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
Fairly old data, but it shows AMAZING differences in head seek time. I
didn't know head seeks were below 8ms for anything, even today. Also,
from what I've read, the SATA drives of those days were non existent?
The earliest SATA drives I've read about were just SATA interfaces on
OLDER IDE hardware - the manufacutrers had not really signed up on the
concept enough to put their good hardware underneath the interface.
--
"You are behaving like a man",
is an insult from some women,
a compliment from an good woman.
I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE
drive here:
The results vary quite a bit, and it seems the file system you use
can make a huge difference.
SCSI is obviously faster, but a 20% performance gain for 5x the cost is
only worth it for a very small percentage of people, I would think.
On Wed, 2003-10-22 at 09:01, Stephen wrote:
The SCSI improvement over IDE seems overrated in the test. I would have
expected at most a 30% improvement. Other reviews seem to point out that IDE
performs just as well or better.See Tom's hardware:
http://www20.tomshardware.com/storage/20020305/index.htmlStephen
"Ron Johnson" <ron.l.johnson@cox.net> wrote in message
news:1066837102.12532.176.camel@haggis...http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20
&tid=38&tid=49--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USAI can't make you have an abortion, but you can *make* me pay
child support for 18 years? However, if I want the child (and
all the expenses that entails) for the *rest*of*my*life*, and you
don't want it for 9 months, tough luck???---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
--
Best Regards,
Mike Benoit
Mike Benoit wrote:
I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE
drive here:The results vary quite a bit, and it seems the file system you use
can make a huge difference.SCSI is obviously faster, but a 20% performance gain for 5x the cost is
only worth it for a very small percentage of people, I would think.
Did you turn off the IDE write cache? If not, the SCSI drive is
reliable in case of OS failure, while the IDE is not.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
It seems to me file system journaling should fix the whole problem by giving
you a record of what was actually commited to disk and what was not. I must
not understand journaling correctly. Can anyone explain to me how
journaling works.
----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
To: <mikeb@netnation.com>
Cc: "Stephen" <jleelim@xxxxxx.com>; <pgsql-general@postgresql.org>
Sent: Monday, October 27, 2003 12:14 PM
Subject: Re: [GENERAL] SCSI vs. IDE performance test
Mike Benoit wrote:
I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE
drive here:The results vary quite a bit, and it seems the file system you use
can make a huge difference.SCSI is obviously faster, but a 20% performance gain for 5x the cost is
only worth it for a very small percentage of people, I would think.Did you turn off the IDE write cache? If not, the SCSI drive is
reliable in case of OS failure, while the IDE is not.-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania
19073
Show quoted text
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
"Rick Gigger" <rick@alpinenetworking.com> writes:
It seems to me file system journaling should fix the whole problem by giving
you a record of what was actually commited to disk and what was not.
Nope, a journaling FS has exactly the same problem Postgres does
(because the underlying "WAL" concept is the same: write the log entries
before you change the files they describe). If the drive lies about
write order, the FS can be screwed just as badly. Now the FS code might
have a low-level way to force write order that Postgres doesn't have
access to ... but simply uttering the magic incantation "journaling file
system" will not make this problem disappear.
regards, tom lane
ahhh. "lies about write order" is the phrase that I was looking for. That
seemed to make sense but I didn't know if I could go directly from "lying
about fsync" to that. Obviously I don't understand exactly what fsync is
doing. I assume this means that if you were to turn fsync off you would get
considerably better performance but introduce the possibility of corrupting
the files in your database.
Thank you. This makes a lot more sense now.
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Rick Gigger" <rick@alpinenetworking.com>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, October 27, 2003 3:39 PM
Subject: Re: [GENERAL] SCSI vs. IDE performance test
"Rick Gigger" <rick@alpinenetworking.com> writes:
It seems to me file system journaling should fix the whole problem by
giving
Show quoted text
you a record of what was actually commited to disk and what was not.
Nope, a journaling FS has exactly the same problem Postgres does
(because the underlying "WAL" concept is the same: write the log entries
before you change the files they describe). If the drive lies about
write order, the FS can be screwed just as badly. Now the FS code might
have a low-level way to force write order that Postgres doesn't have
access to ... but simply uttering the magic incantation "journaling file
system" will not make this problem disappear.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
Tom, this discussion brings up something that's been bugging me about the
recommendations for getting more performance out of PG.. in particular the
one that suggests you put your WAL files on a different physical drive from
the database.
Consider the following scenario:
Database on drive1
WAL on drive2
1. PG write of some sort occurs.
2. PG writes out the WAL.
3. PG writes out the data.
4. PG updates the WAL to reflect data actually written.
5. System crashes/reboots/whatever.
With the DB and the WAL on different drives, it seems possible to me that
drive2 could've fsync()'d or otherwise properly written all of the data
out, but drive1 could have failed somewhere along the way and not actually
written the data to the DB.
The next time PG is brought up, the WAL would indicate the transaction, as
it were, was a success.. but the data wouldn't actually be there.
In the case of using only one drive, the rollback (from a FS perspective)
couldn't possibly occur in such a way as to leave step 4 as a success, but
step 3 as a failure -- worst case, the data would be written out but the
WAL wouldn't have been updated (rolled back say by the FS) and thus PG will
roll back the data itself, or use whatever mechanism it uses to insure data
integrity is consistent with the WAL.
Am I smoking something here or is this a real, if rare in practice, risk
that occurs when you have the WAL on a different drive than the data is on?
At 17:39 10/27/2003, Tom Lane wrote:
Show quoted text
"Rick Gigger" <rick@alpinenetworking.com> writes:
It seems to me file system journaling should fix the whole problem by
giving
you a record of what was actually commited to disk and what was not.
Nope, a journaling FS has exactly the same problem Postgres does
(because the underlying "WAL" concept is the same: write the log entries
before you change the files they describe). If the drive lies about
write order, the FS can be screwed just as badly. Now the FS code might
have a low-level way to force write order that Postgres doesn't have
access to ... but simply uttering the magic incantation "journaling file
system" will not make this problem disappear.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
"Rick Gigger" <rick@alpinenetworking.com> writes:
ahhh. "lies about write order" is the phrase that I was looking for. That
seemed to make sense but I didn't know if I could go directly from "lying
about fsync" to that. Obviously I don't understand exactly what fsync is
doing.
What we actually care about is write order: WAL entries have to hit the
platter before the corresponding data-file changes do. Unfortunately we
have no portable means of expressing that exact constraint to the
kernel. We use fsync() (or related constructs) instead: issue the WAL
writes, fsync the WAL file, then issue the data-file writes. This
constrains the write ordering more than is really needed, but it's the
best we can do in a portable Unix application.
The problem is that the kernel thinks fsync is done when the disk drive
reports the writes are complete. When we say a drive lies about this,
we mean it accepts a sector of data into its on-board RAM and then
immediately claims write-complete, when in reality the data hasn't hit
the platter yet and will be lost if power dies before the drive gets
around to writing it.
So we can have a scenario where we think WAL is down to disk and go
ahead with issuing data-file writes. These will also be shoved over to
the drive and stored in its on-board RAM. Now the drive has multiple
sectors pending write in its buffers. If it chooses to write these in
some order other than the order they were given to it, it could write
the data file updates to disk first. If power drops *now*, we lose,
because the data files are inconsistent and there's no WAL entry to tell
us to fix it.
Got it? It's really the combination of "lie about write completion" and
"write pending sectors out of order" that can mess things up.
The reason IDE drives have to do this for reasonable performance is that
the IDE interface is single-threaded: you can only have one read or
write in process at a time, from the point of view of the
kernel-to-drive interface. But in order to schedule reads and writes in
a way that makes sense physically (minimizes seeks), the drive has to
have multiple read and write requests pending that it can pick and
choose from. The only possibility to do that in the IDE world is to
let a write "complete" in interface terms before it's really done ...
that is, lie.
The reason SCSI drives do *not* do this is that the SCSI interface is
logically multi-threaded: you can have multiple reads or writes pending
at once. When you want to write on a SCSI drive, you send over a
command that says "write this data at this sector". Sometime later the
drive sends back a status report "yessir boss, I done did that write".
Similarly, a read consists of a command "read this sector", followed
sometime later by a response that delivers the requested data. But you
can send other commands to read or write other sectors meanwhile, and
the drive is free to reorder them to suit its convenience. So in the
SCSI world, there is no need for the drive to lie in order to do its own
read/write scheduling. The kernel knows the truth about whether a given
sector has hit disk, and so it won't conclude that the WAL file has been
completely fsync'd until it really is all down to the platter.
This is also why SCSI disks shine on the read side when you have lots of
processes doing reads: in an IDE drive, there is no way for the drive to
satisfy read requests in any order but the one they're issued in. If the
kernel guesses wrong about the best ordering for a set of read requests,
then everybody waits for the seeks needed to get the earlier processes'
data. A SCSI drive can fetch the "nearest" data first, and then that
requester is freed to make progress in the CPU while the other guys wait
for their longer seeks. There's no win here with a single active user
process (since it probably wants specific data in a specific order), but
it's a huge win if lots of processes are making unrelated read requests.
Clear now?
(In a previous lifetime I wrote SCSI disk driver code ...)
regards, tom lane
On Mon, 2003-10-27 at 12:44, Mike Benoit wrote:
I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE
drive here:The results vary quite a bit, and it seems the file system you use
can make a huge difference.SCSI is obviously faster, but a 20% performance gain for 5x the cost is
only worth it for a very small percentage of people, I would think.
Running bonnie++ in 4 or 5 parallel runs would be interesting, to
see how IDE & SCSI in a multi-user environment.
On Wed, 2003-10-22 at 09:01, Stephen wrote:
The SCSI improvement over IDE seems overrated in the test. I would have
expected at most a 30% improvement. Other reviews seem to point out that IDE
performs just as well or better.See Tom's hardware:
http://www20.tomshardware.com/storage/20020305/index.htmlStephen
"Ron Johnson" <ron.l.johnson@cox.net> wrote in message
news:1066837102.12532.176.camel@haggis...http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20
&tid=38&tid=49
--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA
"Why should we not accept all in favor of woman suffrage to our
platform and association even though they be rabid pro-slavery
Democrats."
Susan B. Anthony, _History_of_Woman_Suffrage_
http://www.ifeminists.com/introduction/essays/introduction.html
Thanks! Now it is much, much more clear. It leaves me with a few
additional questions though.
Question 1:
"we have no portable means of expressing that exact constraint to the
kernel"
Does this mean that specific operating systems have a better way of dealing
with this? Which ones and how? I'm guessing that it couldn't make to big
of a performance difference or it would probably be implemented already.
Question 2:
Do serial ATA drives suffer from the same issue?
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Rick Gigger" <rick@alpinenetworking.com>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, October 27, 2003 5:05 PM
Subject: Re: [GENERAL] SCSI vs. IDE performance test
"Rick Gigger" <rick@alpinenetworking.com> writes:
ahhh. "lies about write order" is the phrase that I was looking for.
That
seemed to make sense but I didn't know if I could go directly from
"lying
about fsync" to that. Obviously I don't understand exactly what fsync
is
Show quoted text
doing.
What we actually care about is write order: WAL entries have to hit the
platter before the corresponding data-file changes do. Unfortunately we
have no portable means of expressing that exact constraint to the
kernel. We use fsync() (or related constructs) instead: issue the WAL
writes, fsync the WAL file, then issue the data-file writes. This
constrains the write ordering more than is really needed, but it's the
best we can do in a portable Unix application.The problem is that the kernel thinks fsync is done when the disk drive
reports the writes are complete. When we say a drive lies about this,
we mean it accepts a sector of data into its on-board RAM and then
immediately claims write-complete, when in reality the data hasn't hit
the platter yet and will be lost if power dies before the drive gets
around to writing it.So we can have a scenario where we think WAL is down to disk and go
ahead with issuing data-file writes. These will also be shoved over to
the drive and stored in its on-board RAM. Now the drive has multiple
sectors pending write in its buffers. If it chooses to write these in
some order other than the order they were given to it, it could write
the data file updates to disk first. If power drops *now*, we lose,
because the data files are inconsistent and there's no WAL entry to tell
us to fix it.Got it? It's really the combination of "lie about write completion" and
"write pending sectors out of order" that can mess things up.The reason IDE drives have to do this for reasonable performance is that
the IDE interface is single-threaded: you can only have one read or
write in process at a time, from the point of view of the
kernel-to-drive interface. But in order to schedule reads and writes in
a way that makes sense physically (minimizes seeks), the drive has to
have multiple read and write requests pending that it can pick and
choose from. The only possibility to do that in the IDE world is to
let a write "complete" in interface terms before it's really done ...
that is, lie.The reason SCSI drives do *not* do this is that the SCSI interface is
logically multi-threaded: you can have multiple reads or writes pending
at once. When you want to write on a SCSI drive, you send over a
command that says "write this data at this sector". Sometime later the
drive sends back a status report "yessir boss, I done did that write".
Similarly, a read consists of a command "read this sector", followed
sometime later by a response that delivers the requested data. But you
can send other commands to read or write other sectors meanwhile, and
the drive is free to reorder them to suit its convenience. So in the
SCSI world, there is no need for the drive to lie in order to do its own
read/write scheduling. The kernel knows the truth about whether a given
sector has hit disk, and so it won't conclude that the WAL file has been
completely fsync'd until it really is all down to the platter.This is also why SCSI disks shine on the read side when you have lots of
processes doing reads: in an IDE drive, there is no way for the drive to
satisfy read requests in any order but the one they're issued in. If the
kernel guesses wrong about the best ordering for a set of read requests,
then everybody waits for the seeks needed to get the earlier processes'
data. A SCSI drive can fetch the "nearest" data first, and then that
requester is freed to make progress in the CPU while the other guys wait
for their longer seeks. There's no win here with a single active user
process (since it probably wants specific data in a specific order), but
it's a huge win if lots of processes are making unrelated read requests.Clear now?
(In a previous lifetime I wrote SCSI disk driver code ...)
regards, tom lane
On Mon, 2003-10-27 at 17:18, Rick Gigger wrote:
ahhh. "lies about write order" is the phrase that I was looking for. That
seemed to make sense but I didn't know if I could go directly from "lying
about fsync" to that. Obviously I don't understand exactly what fsync is
doing. I assume this means that if you were to turn fsync off you would get
considerably better performance but introduce the possibility of corrupting
the files in your database.
Yes.
There was a recent thread (in -general or -performance) regarding
putting the WAL files on a different disk, and changing wal_sync_-
method to open_sync (or open_datasync, don't remember).
This will allow the device(s) that the database is on to
run asynchronously, while the WAL is synchronous, for safety.
Thank you. This makes a lot more sense now.
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Rick Gigger" <rick@alpinenetworking.com>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, October 27, 2003 3:39 PM
Subject: Re: [GENERAL] SCSI vs. IDE performance test"Rick Gigger" <rick@alpinenetworking.com> writes:
It seems to me file system journaling should fix the whole problem by
giving
you a record of what was actually commited to disk and what was not.
Nope, a journaling FS has exactly the same problem Postgres does
(because the underlying "WAL" concept is the same: write the log entries
before you change the files they describe). If the drive lies about
write order, the FS can be screwed just as badly. Now the FS code might
have a low-level way to force write order that Postgres doesn't have
access to ... but simply uttering the magic incantation "journaling file
system" will not make this problem disappear.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA
Some former UNSCOM officials are alarmed, however. Terry Taylor,
a British senior UNSCOM inspector from 1993 to 1997, says the
figure of 95 percent disarmament is "complete nonsense because
inspectors never learned what 100 percent was. UNSCOM found a
great deal and destroyed a great deal, but we knew [Iraq's] work
was continuing while we were there, and I'm sure it continues,"
says Mr. Taylor, now head of the Washington
http://www.csmonitor.com/2002/0829/p01s03-wosc.html
"Rick Gigger" <rick@alpinenetworking.com> writes:
"we have no portable means of expressing that exact constraint to the
kernel"Does this mean that specific operating systems have a better way of dealing
with this? Which ones and how?
I'm not aware of any that offer a way of expressing "write these
particular blocks before those particular blocks". It doesn't seem like
it would require rocket scientists to devise such an API, but no one's
got round to it yet. Part of the problem is that the issue would have
to be approached at multiple levels: there is no point in offering an
OS-level API for this when the hardware underlying the bus-level API
(IDE) is doing its level best to sabotage the entire semantics.
Do serial ATA drives suffer from the same issue?
Um, not an expert, but I think ATA is the same as IDE except for bus
width and transfer rate. If either one allows for multiple concurrent
read/write transactions I'll be very surprised.
regards, tom lane
On Tue, Oct 28, 2003 at 12:17:59AM -0500, Tom Lane wrote:
"Rick Gigger" <rick@alpinenetworking.com> writes:
Do serial ATA drives suffer from the same issue?
Um, not an expert, but I think ATA is the same as IDE except for bus
width and transfer rate. If either one allows for multiple concurrent
read/write transactions I'll be very surprised.
Well, some googleing around seems to indicate that Serial ATA I/ATA-6 has
Tagged Command Queueing (TCQ) which is adding this feature specifically.
Whether it is a mandatory part of the spec I don't know.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
"All that is needed for the forces of evil to triumph is for enough good
men to do nothing." - Edmond Burke
"The penalty good people pay for not being interested in politics is to be
governed by people worse than themselves." - Plato
Martijn van Oosterhout <kleptog@svana.org> writes:
Well, some googleing around seems to indicate that Serial ATA I/ATA-6 has
Tagged Command Queueing (TCQ) which is adding this feature specifically.
Whether it is a mandatory part of the spec I don't know.
Yeah? If so, and *if fully implemented* on both sides of the interface,
this would eliminate the architectural advantages I was just sketching
for SCSI. I can't claim to be up on what's happening in the IDE/ATA
world though...
regards, tom lane
Allen Landsidel <all@biosys.net> writes:
Tom, this discussion brings up something that's been bugging me about the
recommendations for getting more performance out of PG.. in particular the
one that suggests you put your WAL files on a different physical drive from
the database.
...
With the DB and the WAL on different drives, it seems possible to me that
drive2 could've fsync()'d or otherwise properly written all of the data
out, but drive1 could have failed somewhere along the way and not actually
written the data to the DB.
Drive failure, in terms of losing something the drive claimed it had
written successfully, is not something that we can protect against.
For that, you go to your backup tapes. I don't see that it makes any
difference whether the database is spread across one drive or several;
you could still have a scenario where the claimed-complete write to
a data file failed to happen and then we recorded a checkpoint anyway.
Now, if the data drive fails to write and we can detect that, then we're
OK, because we won't record a checkpoint. We can redo the write based
on the contents of WAL after the problem's been fixed.
This is another reason why the IDE lie-about-write-completion behavior
is a Bad Idea: if the drive accepts data and then later has a problem
writing it, there is no way for it to report that fact --- and it's
too late anyhow since we've already taken other actions on the
assumption that the write is done. I'm not at all sure what IDE drives
do when they have a failure writing out cached buffers; anyone have
experience with that?
regards, tom lane