Solaris Performance (Again)

Started by Mark Kirkwoodover 22 years ago47 messageshackers
Jump to latest
#1Mark Kirkwood
mark.kirkwood@catalyst.net.nz

This is a well-worn thread title - apologies, but these results seemed
interesting, and hopefully useful in the quest to get better performance
on Solaris:

I was curious to see if the rather uninspiring pgbench performance
obtained from a Sun 280R (see General: ATA Disks and RAID controllers
for database servers) could be improved if more time was spent
tuning.

With the help of a fellow workmate who is a bit of a Solaris guy, we
decided to have a go.

The major performance killer appeared to be mounting the filesystem with
the logging option. The next most significant seemed to be the choice of
sync_method for Pg - the default (open_datasync), which we initially
thought should be the best - appears noticeably slower than fdatasync.

We also tried changing some of the tuneable filesystem options using
tunefs - without any measurable effect.

Are Pg/Solaris folks running with logging on and sync_method default out
there ? - or have most of you been through this already ?

Pgbench Results (no. clients and transactions/s ) :

Setup 1: filesystem mounted with logging

No. tps
-----------
1 17
2 17
4 22
8 22
16 28
32 32
64 37

Setup 2: filesystem mounted without logging

No. tps
-----------
1 48
2 55
4 57
8 62
16 65
32 82
64 95

Setup 3 : filesystem mounted without logging, Pg sync_method = fdatasync

No. tps
-----------
1 89
2 94
4 95
8 93
16 99
32 115
64 122

Note : The Pgbench runs were conducted using -s 10 and -t 1000 -c 1->64,
2 - 3 runs of each setup were performed (averaged figures shown).

Mark

#2Jeff
threshar@torgo.978.org
In reply to: Mark Kirkwood (#1)
Re: Solaris Performance (Again)

On Wed, 10 Dec 2003 18:56:38 +1300
Mark Kirkwood <markir@paradise.net.nz> wrote:

The major performance killer appeared to be mounting the filesystem
with the logging option. The next most significant seemed to be the
choice of sync_method for Pg - the default (open_datasync), which we
initially thought should be the best - appears noticeably slower than
fdatasync.

Some interesting stuff, I'll have to play with it. Currently I'm pleased
with my solaris performance.

What version of PG?

If it is before 7.4 PG compiles with _NO_ optimization by default and
was a huge part of the slowness of PG on solaris.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/

#3Neil Conway
neilc@samurai.com
In reply to: Mark Kirkwood (#1)
Re: Solaris Performance (Again)

Mark Kirkwood <markir@paradise.net.nz> writes:

Note : The Pgbench runs were conducted using -s 10 and -t 1000 -c
1->64, 2 - 3 runs of each setup were performed (averaged figures
shown).

FYI, the pgbench docs state:

NOTE: scaling factor should be at least as large as the largest
number of clients you intend to test; else you'll mostly be
measuring update contention.

-Neil

#4Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Jeff (#2)
Re: Solaris Performance (Again)

Good point -

It is Pg 7.4beta1 , compiled with

CFLAGS += -O2 -funroll-loops -fexpensive-optimizations

Jeff wrote:

Show quoted text

What version of PG?

If it is before 7.4 PG compiles with _NO_ optimization by default and
was a huge part of the slowness of PG on solaris.

#5Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Neil Conway (#3)
Re: Solaris Performance (Again)

yes - originally I was going to stop at 8 clients, but once the bit was
between the teeth....If I get another box to myself I will try -s 50 or
100 and see what that shows up.

cheers

Mark

Neil Conway wrote:

Show quoted text

FYI, the pgbench docs state:

NOTE: scaling factor should be at least as large as the largest
number of clients you intend to test; else you'll mostly be
measuring update contention.

-Neil

#6Bruce Momjian
bruce@momjian.us
In reply to: Mark Kirkwood (#1)
fsync method checking

Mark Kirkwood wrote:

This is a well-worn thread title - apologies, but these results seemed
interesting, and hopefully useful in the quest to get better performance
on Solaris:

I was curious to see if the rather uninspiring pgbench performance
obtained from a Sun 280R (see General: ATA Disks and RAID controllers
for database servers) could be improved if more time was spent
tuning.

With the help of a fellow workmate who is a bit of a Solaris guy, we
decided to have a go.

The major performance killer appeared to be mounting the filesystem with
the logging option. The next most significant seemed to be the choice of
sync_method for Pg - the default (open_datasync), which we initially
thought should be the best - appears noticeably slower than fdatasync.

I thought the default was fdatasync, but looking at the code it seems
the default is open_datasync if O_DSYNC is available.

I assume the logic is that we usually do only one write() before
fsync(), so open_datasync should be faster. Why do we not use O_FSYNC
over fsync().

Looking at the code:

#if defined(O_SYNC)
#define OPEN_SYNC_FLAG O_SYNC
#else
#if defined(O_FSYNC)
#define OPEN_SYNC_FLAG O_FSYNC
#endif
#endif

#if defined(OPEN_SYNC_FLAG)
#if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG)
#define OPEN_DATASYNC_FLAG O_DSYNC
#endif
#endif

#if defined(OPEN_DATASYNC_FLAG)
#define DEFAULT_SYNC_METHOD_STR "open_datasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
#define DEFAULT_SYNC_FLAGBIT OPEN_DATASYNC_FLAG
#else
#if defined(HAVE_FDATASYNC)
#define DEFAULT_SYNC_METHOD_STR "fdatasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
#define DEFAULT_SYNC_FLAGBIT 0
#else
#define DEFAULT_SYNC_METHOD_STR "fsync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC
#define DEFAULT_SYNC_FLAGBIT 0
#endif
#endif

I think the problem is that we prefer O_DSYNC over fdatasync, but do not
prefer O_FSYNC over fsync.

Running the attached test program shows on BSD/OS 4.3:

write 0.000360
write & fsync 0.001391
write, close & fsync 0.001308
open o_fsync, write 0.000924

showing O_FSYNC faster than fsync().

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Attachments:

/wrk/tmp/test_sync.ctext/plainDownload
#7Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Bruce Momjian (#6)
Re: [HACKERS] fsync method checking

Running the attached test program shows on BSD/OS 4.3:

write 0.000360
write & fsync 0.001391

I think the "write & fsync" pays for the previous "write" test (same filename).

write, close & fsync 0.001308
open o_fsync, write 0.000924

I have tried to modify the program to more closely resemble WAL
writes (all writes to WAL are 8k), the file is usually already open,
and test larger (16k) transactions.

zeu@a82101002:~> test_sync1
write 0.000625
write & fsync 0.016748
write & fdatasync 0.006650
write, close & fsync 0.017084
write, close & fdatasync 0.006890
open o_dsync, write 0.015997
open o_dsync, one write 0.007128

For the last line xlog.c would need to be modified, but the measurements
seem to imply that it is only worth it on platforms that have O_DSYNC
but not fdatasync.

Andreas

Attachments:

test_sync1.capplication/octet-stream; name=test_sync1.cDownload
#8Manfred Spraul
manfred@colorfullife.com
In reply to: Bruce Momjian (#6)
Re: [HACKERS] fsync method checking

Bruce Momjian wrote:

write 0.000360
write & fsync 0.001391
write, close & fsync 0.001308
open o_fsync, write 0.000924

That's 1 milliseconds vs. 1.3 milliseconds. Neither value is realistic -
I guess the hw cache on and the os doesn't issue cache flush commands.
Realistic values are probably 5 ms vs 5.3 ms - 6%, not 30%. How large is
the syscall latency with BSD/OS 4.3?

One advantage of a seperate write and fsync call is better performance
for the writes that are triggered within AdvanceXLInsertBuffer: I'm not
sure how often that's necessary, but it's a write while holding both the
WALWriteLock and WALInsertLock. If every write contains an implicit
sync, that call would be much more expensive than necessary.

--
Manfred

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Spraul (#8)
Re: [HACKERS] fsync method checking

Manfred Spraul <manfred@colorfullife.com> writes:

One advantage of a seperate write and fsync call is better performance
for the writes that are triggered within AdvanceXLInsertBuffer: I'm not
sure how often that's necessary, but it's a write while holding both the
WALWriteLock and WALInsertLock. If every write contains an implicit
sync, that call would be much more expensive than necessary.

Ideally that path isn't taken very often. But I'm currently having a
discussion off-list with a CMU student who seems to be seeing a case
where it happens a lot. (She reports that both WALWriteLock and
WALInsertLock are causes of a lot of process blockages, which seems to
mean that a lot of the WAL I/O is being done with both held, which would
have to mean that AdvanceXLInsertBuffer is doing the I/O. More when we
figure out what's going on exactly...)

regards, tom lane

#10Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#9)
Re: [HACKERS] fsync method checking

Ideally that path isn't taken very often. But I'm currently having a
discussion off-list with a CMU student who seems to be seeing a case
where it happens a lot. (She reports that both WALWriteLock and
WALInsertLock are causes of a lot of process blockages, which seems to
mean that a lot of the WAL I/O is being done with both held, which would
have to mean that AdvanceXLInsertBuffer is doing the I/O.
More when we figure out what's going on exactly...)

I would figure, that this is in a situation where a large transaction
fills one XLInsertBuffer, and a lot of WAL buffers are not yet written.

Andreas

#11Bruce Momjian
bruce@momjian.us
In reply to: Zeugswetter Andreas SB SD (#7)
Re: [HACKERS] fsync method checking

I have updated my program with your suggested changes and put in
src/tools/fsync. Please see how you like it.

---------------------------------------------------------------------------

Zeugswetter Andreas SB SD wrote:

Running the attached test program shows on BSD/OS 4.3:

write 0.000360
write & fsync 0.001391

I think the "write & fsync" pays for the previous "write" test (same filename).

write, close & fsync 0.001308
open o_fsync, write 0.000924

I have tried to modify the program to more closely resemble WAL
writes (all writes to WAL are 8k), the file is usually already open,
and test larger (16k) transactions.

zeu@a82101002:~> test_sync1
write 0.000625
write & fsync 0.016748
write & fdatasync 0.006650
write, close & fsync 0.017084
write, close & fdatasync 0.006890
open o_dsync, write 0.015997
open o_dsync, one write 0.007128

For the last line xlog.c would need to be modified, but the measurements
seem to imply that it is only worth it on platforms that have O_DSYNC
but not fdatasync.

Andreas

Content-Description: test_sync1.c

[ Attachment, skipping... ]

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#12Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#6)
Re: [HACKERS] fsync method checking

I have been poking around with our fsync default options to see if I can
improve them. One issue is that we never default to O_SYNC, but default
to O_DSYNC if it exists, which seems strange.

What I did was to beef up my test program and get it into CVS for folks
to run. What I found was that different operating systems have
different optimal defaults. On BSD/OS and FreeBSD, fdatasync/fsync was
better, but on Linux, O_DSYNC/O_SYNC was faster.

BSD/OS 4.3:
Simple write timing:
write 0.000055

Compare fsync before and after write's close:
write, fsync, close 0.000707
write, close, fsync 0.000808

Compare one o_sync write to two:
one 16k o_sync write 0.009762
two 8k o_sync writes 0.008799

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000658
(fdatasync unavailable)
write, fsync, 0.000702

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write 0.010402
(fdatasync unavailable)
write, fsync, 0.001025

This shows terrible O_SYNC performance for 2 8k writes, but is faster
for a single 8k write. Strange.

FreeBSD 4.9:
Simple write timing:
write 0.000083

Compare fsync before and after write's close:
write, fsync, close 0.000412
write, close, fsync 0.000453

Compare one o_sync write to two:
one 16k o_sync write 0.000409
two 8k o_sync writes 0.000993

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000683
(fdatasync unavailable)
write, fsync, 0.000405

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.000789
(fdatasync unavailable)
write, fsync, 0.000414

This shows fsync to be fastest in both cases.

Linux 2.4.9:
Simple write timing:
write 0.000061

Compare fsync before and after write's close:
write, fsync, close 0.000398
write, close, fsync 0.000407

Compare one o_sync write to two:
one 16k o_sync write 0.000570
two 8k o_sync writes 0.000340

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000166
write, fdatasync 0.000462
write, fsync, 0.000447

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.000334
write, fdatasync 0.000445
write, fsync, 0.000447

This shows O_SYNC to be fastest, even for 2 8k writes.

This unapplied patch:

ftp://candle.pha.pa.us/pub/postgresql/mypatches/fsync

adds DEFAULT_OPEN_SYNC to the bsdi/freebsd/linux template files, which
controls the default for those platforms. Platforms with no template
default to fdatasync/fsync.

Would other users run src/tools/fsync and report their findings so I can
update the template files for their OS's? This is a process similar to
our thread testing.

Thanks.

---------------------------------------------------------------------------

Bruce Momjian wrote:

Mark Kirkwood wrote:

This is a well-worn thread title - apologies, but these results seemed
interesting, and hopefully useful in the quest to get better performance
on Solaris:

I was curious to see if the rather uninspiring pgbench performance
obtained from a Sun 280R (see General: ATA Disks and RAID controllers
for database servers) could be improved if more time was spent
tuning.

With the help of a fellow workmate who is a bit of a Solaris guy, we
decided to have a go.

The major performance killer appeared to be mounting the filesystem with
the logging option. The next most significant seemed to be the choice of
sync_method for Pg - the default (open_datasync), which we initially
thought should be the best - appears noticeably slower than fdatasync.

I thought the default was fdatasync, but looking at the code it seems
the default is open_datasync if O_DSYNC is available.

I assume the logic is that we usually do only one write() before
fsync(), so open_datasync should be faster. Why do we not use O_FSYNC
over fsync().

Looking at the code:

#if defined(O_SYNC)
#define OPEN_SYNC_FLAG O_SYNC
#else
#if defined(O_FSYNC)
#define OPEN_SYNC_FLAG O_FSYNC
#endif
#endif

#if defined(OPEN_SYNC_FLAG)
#if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG)
#define OPEN_DATASYNC_FLAG O_DSYNC
#endif
#endif

#if defined(OPEN_DATASYNC_FLAG)
#define DEFAULT_SYNC_METHOD_STR "open_datasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
#define DEFAULT_SYNC_FLAGBIT OPEN_DATASYNC_FLAG
#else
#if defined(HAVE_FDATASYNC)
#define DEFAULT_SYNC_METHOD_STR "fdatasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
#define DEFAULT_SYNC_FLAGBIT 0
#else
#define DEFAULT_SYNC_METHOD_STR "fsync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC
#define DEFAULT_SYNC_FLAGBIT 0
#endif
#endif

I think the problem is that we prefer O_DSYNC over fdatasync, but do not
prefer O_FSYNC over fsync.

Running the attached test program shows on BSD/OS 4.3:

write 0.000360
write & fsync 0.001391
write, close & fsync 0.001308
open o_fsync, write 0.000924

showing O_FSYNC faster than fsync().

-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

/*
* test_fsync.c
* tests if fsync can be done from another process than the original write
*/

#include <sys/types.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

void die(char *str);
void print_elapse(struct timeval start_t, struct timeval elapse_t);

int main(int argc, char *argv[])
{
struct timeval start_t;
struct timeval elapse_t;
int tmpfile;
char *strout = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";

/* write only */
gettimeofday(&start_t, NULL);
if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
die("can't open /var/tmp/test_fsync.out");
write(tmpfile, &strout, 200);
close(tmpfile);
gettimeofday(&elapse_t, NULL);
unlink("/var/tmp/test_fsync.out");
printf("write ");
print_elapse(start_t, elapse_t);
printf("\n");

/* write & fsync */
gettimeofday(&start_t, NULL);
if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
die("can't open /var/tmp/test_fsync.out");
write(tmpfile, &strout, 200);
fsync(tmpfile);
close(tmpfile);
gettimeofday(&elapse_t, NULL);
unlink("/var/tmp/test_fsync.out");
printf("write & fsync ");
print_elapse(start_t, elapse_t);
printf("\n");

/* write, close & fsync */
gettimeofday(&start_t, NULL);
if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
die("can't open /var/tmp/test_fsync.out");
write(tmpfile, &strout, 200);
close(tmpfile);
/* reopen file */
if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
die("can't open /var/tmp/test_fsync.out");
fsync(tmpfile);
close(tmpfile);
gettimeofday(&elapse_t, NULL);
unlink("/var/tmp/test_fsync.out");
printf("write, close & fsync ");
print_elapse(start_t, elapse_t);
printf("\n");

/* open_fsync, write */
gettimeofday(&start_t, NULL);
if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT | O_FSYNC)) == -1)
die("can't open /var/tmp/test_fsync.out");
write(tmpfile, &strout, 200);
close(tmpfile);
gettimeofday(&elapse_t, NULL);
unlink("/var/tmp/test_fsync.out");
printf("open o_fsync, write ");
print_elapse(start_t, elapse_t);
printf("\n");

return 0;
}

void print_elapse(struct timeval start_t, struct timeval elapse_t)
{
if (elapse_t.tv_usec < start_t.tv_usec)
{
elapse_t.tv_sec--;
elapse_t.tv_usec += 1000000;
}

printf("%ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec),
(long) (elapse_t.tv_usec - start_t.tv_usec));
}

void die(char *str)
{
fprintf(stderr, "%s", str);
exit(1);
}

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#13Andrew Dunstan
andrew@dunslane.net
In reply to: Bruce Momjian (#12)
Re: fsync method checking

Bruce Momjian wrote:

I have been poking around with our fsync default options to see if I can
improve them. One issue is that we never default to O_SYNC, but default
to O_DSYNC if it exists, which seems strange.

What I did was to beef up my test program and get it into CVS for folks
to run. What I found was that different operating systems have
different optimal defaults. On BSD/OS and FreeBSD, fdatasync/fsync was
better, but on Linux, O_DSYNC/O_SYNC was faster.

[snip]

Linux 2.4.9:

This is a pretty old kernel (I am writing from a machine running 2.4.22)

Maybe before we do this for Linux testing on a more modern kernel might
be wise.

cheers

andrew

#14Bruce Momjian
bruce@momjian.us
In reply to: Andrew Dunstan (#13)
Re: fsync method checking

Andrew Dunstan wrote:

Bruce Momjian wrote:

I have been poking around with our fsync default options to see if I can
improve them. One issue is that we never default to O_SYNC, but default
to O_DSYNC if it exists, which seems strange.

What I did was to beef up my test program and get it into CVS for folks
to run. What I found was that different operating systems have
different optimal defaults. On BSD/OS and FreeBSD, fdatasync/fsync was
better, but on Linux, O_DSYNC/O_SYNC was faster.

[snip]

Linux 2.4.9:

This is a pretty old kernel (I am writing from a machine running 2.4.22)

Maybe before we do this for Linux testing on a more modern kernel might
be wise.

Sure, I am sure someone will post results.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#12)
Re: [HACKERS] fsync method checking

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I have been poking around with our fsync default options to see if I can
improve them. One issue is that we never default to O_SYNC, but default
to O_DSYNC if it exists, which seems strange.

As I recall, that was based on testing on some different platforms.
It's not particularly "strange": O_SYNC implies writing at least two
places on the disk (file and inode). O_DSYNC or fdatasync should
theoretically be the fastest alternatives, O_SYNC and fsync the worst.

Compare fsync before and after write's close:
write, fsync, close 0.000707
write, close, fsync 0.000808

What does that mean? You can't fsync a closed file.

This shows terrible O_SYNC performance for 2 8k writes, but is faster
for a single 8k write. Strange.

I'm not sure I believe these numbers at all... my experience is that
getting trustworthy disk I/O numbers is *not* easy.

regards, tom lane

#16Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#15)
Re: [HACKERS] fsync method checking

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I have been poking around with our fsync default options to see if I can
improve them. One issue is that we never default to O_SYNC, but default
to O_DSYNC if it exists, which seems strange.

As I recall, that was based on testing on some different platforms.
It's not particularly "strange": O_SYNC implies writing at least two
places on the disk (file and inode). O_DSYNC or fdatasync should
theoretically be the fastest alternatives, O_SYNC and fsync the worst.

But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over
fsync?

Compare fsync before and after write's close:
write, fsync, close 0.000707
write, close, fsync 0.000808

What does that mean? You can't fsync a closed file.

You reopen and fsync.

This shows terrible O_SYNC performance for 2 8k writes, but is faster
for a single 8k write. Strange.

I'm not sure I believe these numbers at all... my experience is that
getting trustworthy disk I/O numbers is *not* easy.

These numbers were reproducable on all the platforms I tested.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
In reply to: Bruce Momjian (#16)
Re: [HACKERS] fsync method checking

On Thu, Mar 18, 2004 at 01:50:32PM -0500, Bruce Momjian wrote:

I'm not sure I believe these numbers at all... my experience is that
getting trustworthy disk I/O numbers is *not* easy.

These numbers were reproducable on all the platforms I tested.

It's not because they are reproducable that they mean anything in
the real world.

Kurt

#18Bruce Momjian
bruce@momjian.us
In reply to: Kurt Roeckx (#17)
Re: [HACKERS] fsync method checking

Kurt Roeckx wrote:

On Thu, Mar 18, 2004 at 01:50:32PM -0500, Bruce Momjian wrote:

I'm not sure I believe these numbers at all... my experience is that
getting trustworthy disk I/O numbers is *not* easy.

These numbers were reproducable on all the platforms I tested.

It's not because they are reproducable that they mean anything in
the real world.

OK, what better test do you suggest? Right now, there has been no
testing of these.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#16)
Re: [HACKERS] fsync method checking

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

As I recall, that was based on testing on some different platforms.

But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over
fsync?

It's what tested out as the best bet. I think we were using pgbench
as the test platform, which as you know I have doubts about, but at
least it is testing one actual write/sync pattern Postgres can generate.
The choice between the open flags and fdatasync/fsync depends a whole
lot on your writing patterns (how much data you tend to write between
fsync points), so I don't have a lot of faith in randomly-chosen test
programs as a guide to what to use for Postgres.

What does that mean? You can't fsync a closed file.

You reopen and fsync.

Um. I just looked at that test program, and I think it needs a whole
lot of work yet.

* Some of the test cases count open()/close() overhead, some don't.
This is bad, especially on platforms like Solaris where open() is
notoriously expensive.

* You really cannot put any faith in measuring a single write,
especially on a machine that's not *completely* idle otherwise.
I'd feel somewhat comfortable if you wrote, say, 1000 8K blocks and
measured the time for that. (And you have to think about how far
apart the fsyncs are in that sequence; you probably want to repeat the
measurement with several different fsync spacings.) It would also be
a good idea to compare writing 1000 successive blocks with rewriting
the same block 1000 times --- if the latter does not happen roughly
at the disk RPM rate, then we know the drive is lying and all the
numbers should be discarded as meaningless.

* The program is claimed to test whether you can write from one process
and fsync from another, but it does no such thing AFAICS.

BTW, rather than hard-wiring the test file name, why don't you let it be
specified on the command line? That would make it lots easier for
people to compare the performance of several disk drives, if they have
'em.

regards, tom lane

#20Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#19)
Re: [HACKERS] fsync method checking

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

As I recall, that was based on testing on some different platforms.

But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over
fsync?

It's what tested out as the best bet. I think we were using pgbench
as the test platform, which as you know I have doubts about, but at
least it is testing one actual write/sync pattern Postgres can generate.
The choice between the open flags and fdatasync/fsync depends a whole
lot on your writing patterns (how much data you tend to write between
fsync points), so I don't have a lot of faith in randomly-chosen test
programs as a guide to what to use for Postgres.

I assume pgbench has so much variance that trying to see fsync changes
in there would be hopeless.

What does that mean? You can't fsync a closed file.

You reopen and fsync.

Um. I just looked at that test program, and I think it needs a whole
lot of work yet.

* Some of the test cases count open()/close() overhead, some don't.
This is bad, especially on platforms like Solaris where open() is
notoriously expensive.

The only one I saw that had an extra open() was the fsync after close
test. I add a do-nothing open/close to the previous test so they are
the same.

* You really cannot put any faith in measuring a single write,
especially on a machine that's not *completely* idle otherwise.
I'd feel somewhat comfortable if you wrote, say, 1000 8K blocks and
measured the time for that. (And you have to think about how far

OK, it now measures a loop of 1000.

apart the fsyncs are in that sequence; you probably want to repeat the
measurement with several different fsync spacings.) It would also be
a good idea to compare writing 1000 successive blocks with rewriting
the same block 1000 times --- if the latter does not happen roughly
at the disk RPM rate, then we know the drive is lying and all the
numbers should be discarded as meaningless.

* The program is claimed to test whether you can write from one process
and fsync from another, but it does no such thing AFAICS.

It really just shows whether the fsync fater the close has similar
timing to the one before the close. That was the best way I could think
to test it.

BTW, rather than hard-wiring the test file name, why don't you let it be
specified on the command line? That would make it lots easier for
people to compare the performance of several disk drives, if they have
'em.

I have updated the test program in CVS.

New BSD/OS results:

Simple write timing:
write 0.034801

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 0.868831
write, close, fsync 0.717281

Compare one o_sync write to two:
one 16k o_sync write 10.121422
two 8k o_sync writes 4.405151

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 1.542213
(fdatasync unavailable)
write, fsync, 1.703689

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write 4.498607
(fdatasync unavailable)
write, fsync, 2.473842

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
In reply to: Bruce Momjian (#18)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#20)
#23Bruce Momjian
bruce@momjian.us
In reply to: Kurt Roeckx (#21)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kurt Roeckx (#21)
In reply to: Bruce Momjian (#23)
#26Bruce Momjian
bruce@momjian.us
In reply to: Kurt Roeckx (#25)
#27Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#24)
#28Bruce Momjian
bruce@momjian.us
In reply to: Josh Berkus (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#27)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#28)
In reply to: Bruce Momjian (#26)
#32Kevin Brown
kevin@sysexperts.com
In reply to: Tom Lane (#30)
#33Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#22)
#34Kevin Brown
kevin@sysexperts.com
In reply to: Kevin Brown (#32)
#35Mark Wong
markw@osdl.org
In reply to: Tom Lane (#29)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Wong (#35)
#37Bruce Momjian
bruce@momjian.us
In reply to: Mark Wong (#35)
#38Manfred Spraul
manfred@colorfullife.com
In reply to: Tom Lane (#36)
#39Mark Wong
markw@osdl.org
In reply to: Manfred Spraul (#38)
#40Bruce Momjian
bruce@momjian.us
In reply to: Mark Wong (#39)
#41Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#40)
#42Mark Wong
markw@osdl.org
In reply to: Tom Lane (#36)
#43Manfred Spraul
manfred@colorfullife.com
In reply to: Mark Wong (#42)
#44Mark Wong
markw@osdl.org
In reply to: Manfred Spraul (#43)
#45Bruce Momjian
bruce@momjian.us
In reply to: Mark Wong (#44)
#46Mark Wong
markw@osdl.org
In reply to: Bruce Momjian (#45)
#47Steve Atkins
steve@blighty.com
In reply to: Manfred Spraul (#43)