Maximum transaction rate
I'm using postgresql 8.3.6 through JDBC, and trying to measure the maximum
transaction rate on a given Linux box. I wrote a test program that:
- Creates a table with two int columns and no indexes,
- loads the table through a configurable number of threads, with each
transaction writing one row and then committing, (auto commit is
false), and
- reports transactions/sec.
The postgres configuration regarding syncing is standard: fsync = on,
synchronous_commit = on, wal_sync_method = fsync. My linux kernel is
2.6.27.19-78.2.30.fc9.i686.
The transaction rates I'm getting seem way too high: 2800-2900 with one thread,
5000-7000 with ten threads. I'm guessing that writes aren't really reaching the
disk. Can someone suggest how to figure out where, below postgres, someone is
lying about writes reaching the disk?
Jack
Jack Orenstein <jack.orenstein@hds.com> writes:
The transaction rates I'm getting seem way too high: 2800-2900 with
one thread, 5000-7000 with ten threads. I'm guessing that writes
aren't really reaching the disk. Can someone suggest how to figure out
where, below postgres, someone is lying about writes reaching the
disk?
AFAIK there are two trouble sources in recent Linux machines: LVM and
the disk drive itself. LVM is apparently broken by design --- it simply
fails to pass fsync requests. If you're using it you have to stop.
(Which sucks, because it's exactly the kind of thing DBAs tend to want.)
Otherwise you need to reconfigure your drive to not cache writes.
I forget the incantation for that but it's in the PG list archives.
regards, tom lane
On Fri, 6 Mar 2009, Tom Lane wrote:
Otherwise you need to reconfigure your drive to not cache writes.
I forget the incantation for that but it's in the PG list archives.
There's a dicussion of this in the docs now,
http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html
hdparm -I lets you check if write caching is on, hdparm -W lets you toggle
it off. That's for ATA disks; SCSI ones can use sdparm instead, but
usually it's something you can adjust more permanently in the card
configuration or BIOS instead for those.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 6 Mar 2009, Greg Smith wrote:
On Fri, 6 Mar 2009, Tom Lane wrote:
Otherwise you need to reconfigure your drive to not cache writes.
I forget the incantation for that but it's in the PG list archives.There's a dicussion of this in the docs now,
http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html
How does turning off write caching on the disk stop the problem with LVM?
It still seems like you have to get the data out of the OS buffer, and if
fsync() doesn't do that for you....
On Fri, Mar 6, 2009 at 2:22 PM, Ben Chobot <bench@silentmedia.com> wrote:
On Fri, 6 Mar 2009, Greg Smith wrote:
On Fri, 6 Mar 2009, Tom Lane wrote:
Otherwise you need to reconfigure your drive to not cache writes.
I forget the incantation for that but it's in the PG list archives.There's a dicussion of this in the docs now,
http://www.postgresql.org/docs/8.3/interactive/wal-reliability.htmlHow does turning off write caching on the disk stop the problem with LVM? It
still seems like you have to get the data out of the OS buffer, and if
fsync() doesn't do that for you....
I think he was saying otherwise (if you're not using LVM and you still
have this super high transaction rate) you'll need to turn off the
drive's write caches. I kinda wondered at it for a second too.
On Fri, 6 Mar 2009, Ben Chobot wrote:
How does turning off write caching on the disk stop the problem with LVM?
It doesn't. Linux LVM is awful and broken, I was just suggesting more
details on what you still need to check even when it's not involved.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Scott Marlowe wrote:
On Fri, Mar 6, 2009 at 2:22 PM, Ben Chobot <bench@silentmedia.com> wrote:
On Fri, 6 Mar 2009, Greg Smith wrote:
On Fri, 6 Mar 2009, Tom Lane wrote:
Otherwise you need to reconfigure your drive to not cache writes.
I forget the incantation for that but it's in the PG list archives.There's a dicussion of this in the docs now,
http://www.postgresql.org/docs/8.3/interactive/wal-reliability.htmlHow does turning off write caching on the disk stop the problem with LVM? It
still seems like you have to get the data out of the OS buffer, and if
fsync() doesn't do that for you....I think he was saying otherwise (if you're not using LVM and you still
have this super high transaction rate) you'll need to turn off the
drive's write caches. I kinda wondered at it for a second too.
And I'm still wondering. The problem with LVM, AFAIK, is missing support
for write barriers. Once you disable the write-back cache on the disk,
you no longer need write barriers. So I'm missing something, what else
does LVM do to break fsync()?
It was my understanding that disabling disk caches was enough.
.TM.
Marco Colombo <pgsql@esiway.net> writes:
And I'm still wondering. The problem with LVM, AFAIK, is missing support
for write barriers. Once you disable the write-back cache on the disk,
you no longer need write barriers. So I'm missing something, what else
does LVM do to break fsync()?
I think you're imagining that the disk hardware is the only source of
write reordering, which isn't the case ... various layers in the kernel
can reorder operations before they get sent to the disk.
regards, tom lane
Tom Lane wrote:
Marco Colombo <pgsql@esiway.net> writes:
And I'm still wondering. The problem with LVM, AFAIK, is missing support
for write barriers. Once you disable the write-back cache on the disk,
you no longer need write barriers. So I'm missing something, what else
does LVM do to break fsync()?I think you're imagining that the disk hardware is the only source of
write reordering, which isn't the case ... various layers in the kernel
can reorder operations before they get sent to the disk.regards, tom lane
You mean some layer (LVM) is lying about the fsync()?
write(A);
fsync();
...
write(B);
fsync();
...
write(C);
fsync();
you mean that the process may be awakened after the first fsync() while
A is still somewhere in OS buffers and not sent to disk yet, so it's
possible that B gets to the disk BEFORE A. And if the system crashes,
A never hits the platters while B (possibly) does. Is it this you
mean by "write reodering"?
But doesn't this break any application with transactional-like behavior,
such as sendmail? The problem being 3rd parties, if sendmail declares
"ok, I saved the message" (*after* a fsync()) to the SMTP client,
it's actually lying 'cause the message hasn't hit the platters yet.
Same applies to IMAP/POP server, say. Well, it applies to anything
using fsync().
I mean, all this with disk caches in write-thru modes? It's the OS
lying, not the disks?
Wait, this breaks all journaled FSes as well, a DM device is just
a block device to them, if it's lying about synchronous writes the
whole purpose of the journal is defeated... I find it hard to
believe, I have to say.
.TM.
Marco Colombo <pgsql@esiway.net> writes:
You mean some layer (LVM) is lying about the fsync()?
Got it in one.
regards, tom lane
On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote:
Marco Colombo <pgsql@esiway.net> writes:
You mean some layer (LVM) is lying about the fsync()?
Got it in one.
I wouldn't think this would be a problem with the proper battery backed
raid controller correct?
Joshua D. Drake
regards, tom lane
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997
On Fri, 13 Mar 2009, Joshua D. Drake wrote:
On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote:
Marco Colombo <pgsql@esiway.net> writes:
You mean some layer (LVM) is lying about the fsync()?
Got it in one.
I wouldn't think this would be a problem with the proper battery backed
raid controller correct?
It seems to me that all you get with a BBU-enabled card is the ability to
get burts of writes out of the OS faster. So you still have the problem,
it's just less like to be encountered.
On Fri, 2009-03-13 at 11:17 -0700, Ben Chobot wrote:
On Fri, 13 Mar 2009, Joshua D. Drake wrote:
On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote:
Marco Colombo <pgsql@esiway.net> writes:
You mean some layer (LVM) is lying about the fsync()?
Got it in one.
I wouldn't think this would be a problem with the proper battery backed
raid controller correct?It seems to me that all you get with a BBU-enabled card is the ability to
get burts of writes out of the OS faster. So you still have the problem,
it's just less like to be encountered.
A BBU controller is about more than that. It is also supposed to be
about data integrity. The ability to have unexpected outages and have
the drives stay consistent because the controller remembers the state
(if that is a reasonable way to put it).
Joshua D. Drake
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997
On Fri, 13 Mar 2009, Joshua D. Drake wrote:
It seems to me that all you get with a BBU-enabled card is the ability to
get burts of writes out of the OS faster. So you still have the problem,
it's just less like to be encountered.A BBU controller is about more than that. It is also supposed to be
about data integrity. The ability to have unexpected outages and have
the drives stay consistent because the controller remembers the state
(if that is a reasonable way to put it).
Of course. But if you can't reliably flush the OS buffers (because, say,
you're using LVM so fsync() doesn't work), then you can't say what
actually has made it to the safety of the raid card.
On Fri, 2009-03-13 at 11:41 -0700, Ben Chobot wrote:
On Fri, 13 Mar 2009, Joshua D. Drake wrote:
Of course. But if you can't reliably flush the OS buffers (because, say,
you're using LVM so fsync() doesn't work), then you can't say what
actually has made it to the safety of the raid card.
Good point. So the next question of course is, does EVMS do it right?
This is actually a pretty significant issue.
Joshua D. Drake
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997
On Fri, 2009-03-13 at 11:41 -0700, Ben Chobot wrote:
On Fri, 13 Mar 2009, Joshua D. Drake wrote:
It seems to me that all you get with a BBU-enabled card is the ability to
get burts of writes out of the OS faster. So you still have the problem,
it's just less like to be encountered.A BBU controller is about more than that. It is also supposed to be
about data integrity. The ability to have unexpected outages and have
the drives stay consistent because the controller remembers the state
(if that is a reasonable way to put it).Of course. But if you can't reliably flush the OS buffers (because, say,
you're using LVM so fsync() doesn't work), then you can't say what
actually has made it to the safety of the raid card.
Wait, actually a good BBU RAID controller will disable the cache on the
drives. So everything that is cached is already on the controller vs.
the drives itself.
Or am I missing something?
Joshua D. Drake
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997
On Mar 13, 2009, at 11:59 AM, Joshua D. Drake wrote:
Wait, actually a good BBU RAID controller will disable the cache on
the
drives. So everything that is cached is already on the controller vs.
the drives itself.Or am I missing something?
Maybe I'm missing something, but a BBU controller moves the "safe
point" from the platters to the controller, but it doesn't move it all
the way into the OS.
So, if the software calls fsync, but fsync doesn't actually push the
data to the controller, you are still at risk... right?
On Fri, Mar 13, 2009 at 1:09 PM, Christophe <xof@thebuild.com> wrote:
On Mar 13, 2009, at 11:59 AM, Joshua D. Drake wrote:
Wait, actually a good BBU RAID controller will disable the cache on the
drives. So everything that is cached is already on the controller vs.
the drives itself.Or am I missing something?
Maybe I'm missing something, but a BBU controller moves the "safe point"
from the platters to the controller, but it doesn't move it all the way into
the OS.So, if the software calls fsync, but fsync doesn't actually push the data to
the controller, you are still at risk... right?
Ding!
Scott Marlowe wrote:
On Fri, Mar 13, 2009 at 1:09 PM, Christophe <xof@thebuild.com> wrote:
So, if the software calls fsync, but fsync doesn't actually push the data to
the controller, you are still at risk... right?Ding!
I've been doing some googling, now I'm not sure that not supporting barriers
implies not supporting (of lying) at blkdev_issue_flush(). It seems that
it's pretty common (and well-defined) for block devices to report
-EOPNOTSUPP at BIO_RW_BARRIER requests. device mapper apparently falls in
this category.
See:
http://lkml.org/lkml/2007/5/25/71
this is an interesting discussion on barriers and flushing.
It seems to me that PostgreSQL needs both ordered and synchronous
writes, maybe at different times (not that EVERY write must be both ordered
and synchronous).
You can emulate ordered with single+synchronous althought with a price.
You can't emulate synchronous writes with just barriers.
OPTIMAL: write-barrier-write-barrier-write-barrier-flush
SUBOPTIMAL: write-flush-write-flush-write-flush
As I understand it, fsync() should always issue a real flush: it's unrelated
to the barriers issue.
There's no API to issue ordered writes (or barriers) at user level,
AFAIK. (Uhm... O_DIRECT, maybe implies that?)
FS code may internally issue barrier requests to the block device, for
its own purposes (e.g. journal updates), but there's not useland API for
that.
Yet, there's no reference to DM not supporting flush correctly in the
whole thread... actually there are refereces to the opposite. DM devices
are defined as FLUSHABLE.
Also see:
http://lkml.org/lkml/2008/2/26/41
but it seems to me that all this discussion is under the assuption that
disks have write-back caches.
"The alternative is to disable the disk write cache." says it all.
.TM.
On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote:
Scott Marlowe wrote:
Also see:
http://lkml.org/lkml/2008/2/26/41
but it seems to me that all this discussion is under the assuption that
disks have write-back caches.
"The alternative is to disable the disk write cache." says it all.
If this applies to raid based cache as well then performance is going to
completely tank. For users of Linux + PostgreSQL using LVM.
Joshua D. Drake
.TM.
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997