Simplifying wal_sync_method

Started by Bruce Momjianover 20 years ago60 messages

pgman@candle.pha.pa.us

over 20 years ago

Currently, here are the options available for wal_sync_method:

#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

I don't understand why we support so many values. It seems 'fsync'
should be fdatasync(), and if that is not available, fsync(). Same with
open_sync and open_datasync.

In fact, 8.1 uses O_DIRECT if available, and I don't see why we don't
just use the "data" options automatically if available too, rather than
have users guess which options their OS supports. We might need an
option to print the actual features used, but I am not sure.

Is this something for 8.1 or 8.2?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Marko Kreen

marko@l-t.ee

over 20 years ago

In reply to: Bruce Momjian (#1)

Re: Simplifying wal_sync_method

On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:

Currently, here are the options available for wal_sync_method:

#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

On same topic:

http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php

Why does win32 PostgreSQL allow data corruption by default?

--
marko

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Bruce Momjian (#1)

Re: Simplifying wal_sync_method

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Currently, here are the options available for wal_sync_method:
#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

I don't understand why we support so many values.

Because there are so many platforms with different subsets of these APIs
and different performance characteristics for the ones they do have.

It seems 'fsync' should be fdatasync(), and if that is not available,
fsync().

I have yet to see anyone do any systematic testing of the different
options on different platforms. In the absence of hard data, proposing
that we don't need some of the options is highly premature.

In fact, 8.1 uses O_DIRECT if available,

That's a decision that hasn't got a shred of evidence to justify
imposing it on every platform.

regards, tom lane

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Tom Lane (#3)

Re: Simplifying wal_sync_method

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Currently, here are the options available for wal_sync_method:
#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

I don't understand why we support so many values.

Because there are so many platforms with different subsets of these APIs
and different performance characteristics for the ones they do have.

Right, and our current behavior makes it harder for people to even know
the supported options.

It seems 'fsync' should be fdatasync(), and if that is not available,
fsync().

I have yet to see anyone do any systematic testing of the different
options on different platforms. In the absence of hard data, proposing
that we don't need some of the options is highly premature.

No one is every going to do it, so we might as well make the best guess
we have. I think any platform where the *data* options are slower than
the non-*data* options is broken, and if that logic holds, we might as
well just use *data* by default if we can, which is my proposal.

In fact, 8.1 uses O_DIRECT if available,

That's a decision that hasn't got a shred of evidence to justify
imposing it on every platform.

Right, and there is no evidence it hurts, so we do our best until
someone comes up with data to suggest we are wrong. The same should be
done with *data*.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Marko Kreen (#2)

Re: Simplifying wal_sync_method

Marko Kreen wrote:

On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:

Currently, here are the options available for wal_sync_method:

#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

On same topic:

http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php

Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have battery-backed
cache, you don't need writethrough, so we don't have it as default. I
am going to write a section in the manual for 8.1 about these
reliability issues.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Bruce Momjian (#5)

Re: Simplifying wal_sync_method

In summary, we added all those wal_sync_method values in hopes of
getting some data on which is best on which platform, but having gone
several years with few reports, I am thinking we should just choose the
best ones we can and move on, rather than expose a confusing API to the
users.

Does anyone show a platform where the *data* options are slower than the
non-*data* ones?

---------------------------------------------------------------------------

pgman wrote:

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Currently, here are the options available for wal_sync_method:
#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

I don't understand why we support so many values.

Because there are so many platforms with different subsets of these APIs
and different performance characteristics for the ones they do have.

Right, and our current behavior makes it harder for people to even know
the supported options.

It seems 'fsync' should be fdatasync(), and if that is not available,
fsync().

I have yet to see anyone do any systematic testing of the different
options on different platforms. In the absence of hard data, proposing
that we don't need some of the options is highly premature.

No one is every going to do it, so we might as well make the best guess
we have. I think any platform where the *data* options are slower than
the non-*data* options is broken, and if that logic holds, we might as
well just use *data* by default if we can, which is my proposal.

In fact, 8.1 uses O_DIRECT if available,

That's a decision that hasn't got a shred of evidence to justify
imposing it on every platform.

Right, and there is no evidence it hurts, so we do our best until
someone comes up with data to suggest we are wrong. The same should be
done with *data*.
-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Import Notes

Reply to msg id not found:  | Resolved by subject fallback

Marko Kreen

marko@l-t.ee

over 20 years ago

In reply to: Bruce Momjian (#5)

Re: Simplifying wal_sync_method

On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote:

Marko Kreen wrote:

On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:

Currently, here are the options available for wal_sync_method:

#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

On same topic:

http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php

Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have battery-backed
cache, you don't need writethrough, so we don't have it as default. I
am going to write a section in the manual for 8.1 about these
reliability issues.

For some reason I don't see "corruped database after crash"
reports on Unixen. Why?

Also, why can't win32 be safe without battery-backed cache?
I can't see such requirement on other platforms.

--
marko

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Marko Kreen (#7)

Re: Simplifying wal_sync_method

Marko Kreen wrote:

On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote:

Marko Kreen wrote:

On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:

Currently, here are the options available for wal_sync_method:

#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

On same topic:

http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php

Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have battery-backed
cache, you don't need writethrough, so we don't have it as default. I
am going to write a section in the manual for 8.1 about these
reliability issues.

For some reason I don't see "corruped database after crash"
reports on Unixen. Why?

They use SCSI or battery-backed RAID cards more often?

Also, why can't win32 be safe without battery-backed cache?
I can't see such requirement on other platforms.

If it uses SCSI, it is secure, just like Unix.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Bruce Momjian (#8)

Re: Simplifying wal_sync_method

Alvaro Herrera wrote:

On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote:

Marko Kreen wrote:

On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:

Currently, here are the options available for wal_sync_method:

#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

On same topic:

http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php

Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have battery-backed
cache, you don't need writethrough, so we don't have it as default. I
am going to write a section in the manual for 8.1 about these
reliability issues.

I think we should offer the reliable option by default, and mention the
fast option for those who have battery-backed cache in the manual.

But only on Win32?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Import Notes

Reply to msg id not found: 20050808215349.GB12487@alvh.no-ip.org | Resolved by subject fallback

#10

Josh Berkus

josh@agliodbs.com

over 20 years ago

In reply to: Marko Kreen (#7)

Re: Simplifying wal_sync_method

Marko,

Also, why can't win32 be safe without battery-backed cache?
I can't see such requirement on other platforms.

Read the referenced message again. It's only an issue if you want to use
open_datasync. fsync_writethrough should be safe.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

#11

Alvaro Herrera

alvherre@alvh.no-ip.org

over 20 years ago

In reply to: Bruce Momjian (#9)

Re: Simplifying wal_sync_method

On Mon, Aug 08, 2005 at 06:02:37PM -0400, Bruce Momjian wrote:

Alvaro Herrera wrote:

On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote:

Marko Kreen wrote:

On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:

Currently, here are the options available for wal_sync_method:

#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, fsync_writethrough,
# open_sync, open_datasync

On same topic:

http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php

Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have battery-backed
cache, you don't need writethrough, so we don't have it as default. I
am going to write a section in the manual for 8.1 about these
reliability issues.

I think we should offer the reliable option by default, and mention the
fast option for those who have battery-backed cache in the manual.

But only on Win32?

Yes, because that's the only place where that option works, right?

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"I dream about dreams about dreams", sang the nightingale
under the pale moon (Sandman)

#12

Joshua D. Drake

jd@commandprompt.com

over 20 years ago

In reply to: Alvaro Herrera (#11)

Re: Simplifying wal_sync_method

I think we should offer the reliable option by default, and mention the
fast option for those who have battery-backed cache in the manual.

But only on Win32?

Yes, because that's the only place where that option works, right?

fsync_writethrough only works on Win32 the postgresql.conf should
reflect that.

Show quoted text

#13

Marko Kreen

marko@l-t.ee

over 20 years ago

In reply to: Bruce Momjian (#9)

Re: Simplifying wal_sync_method

On Mon, Aug 08, 2005 at 06:02:37PM -0400, Bruce Momjian wrote:

Alvaro Herrera wrote:

I think we should offer the reliable option by default, and mention the
fast option for those who have battery-backed cache in the manual.

But only on Win32?

We should do what's possible with what's given to us.

On Win32:

1. We can write through cache.
2. We have unreliable OS with unreliable filesystem.
3. The probability of mediocre hardware is higher.

Regular POSIX:
1. We can't write through cache.
2. We have good OS with good filesystem (probably even
journaled).
3. The probably of mediocre hardware is lower.

Why shouldn't we offer reliable option to win32?

Options:

- Win32 guy complains that PG is bit slow.
We tell him to RTFM.
- Win32 guy complains he lost database.
We tell him he didn't RTFM.

Which way you make more friends?

--
marko

PS. Yeah, I was the guy who helped him to restore what's left.
I'd say he wasn't exactly happy.

#14

Marko Kreen

marko@l-t.ee

over 20 years ago

In reply to: Josh Berkus (#10)

Re: Simplifying wal_sync_method

On Mon, Aug 08, 2005 at 03:10:54PM -0700, Josh Berkus wrote:

Marko,

Also, why can't win32 be safe without battery-backed cache?
I can't see such requirement on other platforms.

Read the referenced message again. It's only an issue if you want to use
open_datasync. fsync_writethrough should be safe.

But thats the point. Why isn't fsync_writethrough default?

--
marko

#15

Josh Berkus

josh@agliodbs.com

over 20 years ago

In reply to: Bruce Momjian (#4)

Re: Simplifying wal_sync_method

Bruce,

No one is every going to do it, so we might as well make the best guess
we have. I think any platform where the *data* options are slower than
the non-*data* options is broken, and if that logic holds, we might as
well just use *data* by default if we can, which is my proposal.

Changing the defaults is fine with me. I just don't think that we can
afford to prune options without more testing. And we will be getting
more testing (from companies) in the future, so I don't think this is
completely out of the question.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

#16

Simon Riggs

simon@2ndquadrant.com

over 20 years ago

In reply to: Bruce Momjian (#6)

Re: Simplifying wal_sync_method

On Mon, 2005-08-08 at 17:44 -0400, Bruce Momjian wrote:

In summary, we added all those wal_sync_method values in hopes of
getting some data on which is best on which platform, but having gone
several years with few reports, I am thinking we should just choose the
best ones we can and move on, rather than expose a confusing API to the
users.

I agree this should be attempted over the 8.1 beta period.

This is a good case for having a Port Coordinator assigned for each
port, so we could ask them to hunt out the solution for their platform.
Maybe this is something that we can broadcast to the BuildFarm team, so
each person can reflect on the appropriate settings?

Best Regards, Simon Riggs

#17

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Bruce Momjian (#4)

Re: Simplifying wal_sync_method

Bruce Momjian <pgman@candle.pha.pa.us> writes:

No one is every going to do it, so we might as well make the best guess
we have. I think any platform where the *data* options are slower than
the non-*data* options is broken, and if that logic holds, we might as
well just use *data* by default if we can, which is my proposal.

Adjusting the default settings I don't have a problem with. Removing
options I have a problem with --- and that appeared to be what you
were proposing.

regards, tom lane

#18

Andrew Dunstan

andrew@dunslane.net

over 20 years ago

In reply to: Simon Riggs (#16)

Re: Simplifying wal_sync_method

Simon Riggs wrote:

On Mon, 2005-08-08 at 17:44 -0400, Bruce Momjian wrote:

In summary, we added all those wal_sync_method values in hopes of
getting some data on which is best on which platform, but having gone
several years with few reports, I am thinking we should just choose the
best ones we can and move on, rather than expose a confusing API to the
users.

I agree this should be attempted over the 8.1 beta period.

This is a good case for having a Port Coordinator assigned for each
port, so we could ask them to hunt out the solution for their platform.
Maybe this is something that we can broadcast to the BuildFarm team, so
each person can reflect on the appropriate settings?

It might be possible to build a new set of tests that we could perform.
That would have to be built into the buildfarm script, as the PL tests
were, but they were picked up pretty quickly by the community.
Unfortunately it doesn't sound like these would fit into the pg_regress
setup, so we'll have to devise a different test harness - probably not a
bad idea for automated performance testing anyway.

So the short answer is possibly "You build the tests and we'll run 'em."

cheers

andrew

#19

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Andrew Dunstan (#18)

Re: Simplifying wal_sync_method

Andrew Dunstan <andrew@dunslane.net> writes:

So the short answer is possibly "You build the tests and we'll run 'em."

The availability of the buildfarm certainly makes it a lot more feasible
to do performance tests on a variety of platforms. So, who wants to
knock something together?

I suppose we would usually be interested in one-time tests, rather than
something repeated every time CVS is touched. How might that sort of
requirement fit into the buildfarm software design?

regards, tom lane

#20

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Bruce Momjian (#5)

Re: Simplifying wal_sync_method

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Marko Kreen wrote:

On same topic:
http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have battery-backed
cache, you don't need writethrough, so we don't have it as default. I
am going to write a section in the manual for 8.1 about these
reliability issues.

I thought we had changed the default for Windows to be fsync_writethrough
in 8.1? We didn't have that code in 8.0, but now that we do, it surely
seems like the sanest default.

regards, tom lane

#21

Andrew Dunstan

andrew@dunslane.net

over 20 years ago

In reply to: Tom Lane (#20)

Re: Simplifying wal_sync_method

Tom Lane said:

Kris Jurka <books@ejurka.com> writes:

Automated performance testing seems like a bad idea for the buildfarm.
Consider in my particular case I've got three members that all
happen to be running in virtual machines on the same host. What
virtualization does for performance and what happens when all three
members are running at the same time renders any results beyond
useless.

Certainly a good point --- but as I noted to Andrew, we'd probably be
more interested in one-off tests than repetitive testing anyway. So
possibly this could be handled with a different protocol, and buildfarm
machine owners could be careful to schedule slots for such tests at
times when their machine is otherwise idle.

Anyway it all needs some thought ...

Well, of course running tests would be optional.

But it's also possible that we would create a similar but separate setup to
run performance tests. Creating it would be lots easier this time around ;-)

Let's come up with something we can run by hand, decide the parameters, and
set set about automating and distributing it.

cheers

andrew

Import Notes

Reply to msg id not found: 10741.1123547374@sss.pgh.pa.usReference msg id not found: 10741.1123547374@sss.pgh.pa.us | Resolved by subject fallback

#22

Kris Jurka

books@ejurka.com

over 20 years ago

In reply to: Andrew Dunstan (#18)

Re: Simplifying wal_sync_method

On Mon, 8 Aug 2005, Andrew Dunstan wrote:

So the short answer is possibly "You build the tests and we'll run 'em."

Automated performance testing seems like a bad idea for the buildfarm.
Consider in my particular case I've got three members that all happen to
be running in virtual machines on the same host. What virtualization does
for performance and what happens when all three members are running at the
same time renders any results beyond useless. Certainly soliciting the
pgbuildfarm-members@pgfoundry.org list is good idea, but I don't think
automating this testing is a good idea without more knowledge of the
machines and their other workloads.

Kris Jurka

#23

Andrew Dunstan

andrew@dunslane.net

over 20 years ago

In reply to: Tom Lane (#19)

Re: Simplifying wal_sync_method

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

So the short answer is possibly "You build the tests and we'll run 'em."

The availability of the buildfarm certainly makes it a lot more feasible
to do performance tests on a variety of platforms. So, who wants to
knock something together?

I suppose we would usually be interested in one-time tests, rather than
something repeated every time CVS is touched. How might that sort of
requirement fit into the buildfarm software design?

I'll give it some thought. Maybe a unique name would do the trick.

cheers

andrew

#24

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Kris Jurka (#22)

Re: Simplifying wal_sync_method

Kris Jurka <books@ejurka.com> writes:

Automated performance testing seems like a bad idea for the buildfarm.
Consider in my particular case I've got three members that all happen to
be running in virtual machines on the same host. What virtualization does
for performance and what happens when all three members are running at the
same time renders any results beyond useless.

Certainly a good point --- but as I noted to Andrew, we'd probably be
more interested in one-off tests than repetitive testing anyway. So
possibly this could be handled with a different protocol, and buildfarm
machine owners could be careful to schedule slots for such tests at
times when their machine is otherwise idle.

Anyway it all needs some thought ...

regards, tom lane

#25

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Joshua D. Drake (#12)

Re: Simplifying wal_sync_method

Joshua D. Drake wrote:

I think we should offer the reliable option by default, and mention the
fast option for those who have battery-backed cache in the manual.

But only on Win32?

Yes, because that's the only place where that option works, right?

fsync_writethrough only works on Win32 the postgresql.conf should
reflect that.

Right now what wal_sync_method supports isn't clear at all. If you have
fdatasync or O_DSYNC (and it has a different value from O_SYNC/O_FSYNC),
you have those, if not, you get an error. For example, my system
doesn't have fdatasync(), so if I try to use that value I get this in my
server logs:

FATAL: invalid value for parameter "wal_sync_method": "fdatasync"

and the server does not start. Also, writethrough is supported in 8.1
by both Win32 and OS X.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#26

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Bruce Momjian (#25)

Re: Simplifying wal_sync_method

Bruce Momjian <pgman@candle.pha.pa.us> writes:

fsync_writethrough only works on Win32 the postgresql.conf should
reflect that.

Right now what wal_sync_method supports isn't clear at all.

Yeah. I think we had a TODO to figure out a way for the assign_hook to
report back exactly which values *are* allowed on the current platform.
Constructing the message for this doesn't seem very difficult, but the
rules about when assign_hooks can issue their own elog message seem
to constrain the usefulness...

regards, tom lane

#27

Jeffrey W. Baker

jwbaker@acm.org

over 20 years ago

In reply to: Tom Lane (#3)

Re: Simplifying wal_sync_method

On Mon, 2005-08-08 at 17:03 -0400, Tom Lane wrote:

That's a decision that hasn't got a shred of evidence to justify
imposing it on every platform.

This option has its uses on Linux, however. In my testing it's good for
a large speedup (20%) on a 10-client pgbench, and a minor improvement
with 100 clients. See my mail of July 14th "O_DIRECT for WAL writes".

-jwb

#28

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Tom Lane (#17)

Re: Simplifying wal_sync_method

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

No one is every going to do it, so we might as well make the best guess
we have. I think any platform where the *data* options are slower than
the non-*data* options is broken, and if that logic holds, we might as
well just use *data* by default if we can, which is my proposal.

Adjusting the default settings I don't have a problem with. Removing
options I have a problem with --- and that appeared to be what you
were proposing.

Well, right now we support:

* open_datasync (write WAL files with open() option O_DSYNC)
* fdatasync (call fdatasync() at each commit),
* fsync (call fsync() at each commit)
* fsync_writethrough (force write-through of any disk write cache)
* open_sync (write WAL files with open() option O_SYNC)

and we pick the first supported item as the default. I have updated our
documentation to clarify this.

My proposal is to remove fdatasync and open_datasync, and have have
fsync _prefer_ fdatasync, and open_sync prefer open_datastync, but fall
back to fsync and open_sync if the *data* version are not supported.

We have flexibility by having more options, but we also have complexity
of having options that have never proven to be useful in the years we
have had them, namely using fsync if fdatasync is supported.

If we remove the *data* spellings, we can probably support both
open_sync and fsync on all platforms because the *data* varieties are
the ones that are not always supported.

One problem is that by removing the *data* versions, you would never
know if you were calling fsync or fdatasync internally.

We also need to re-test these defaults because we now have O_DIRECT and
groups writes of WAL.

If we test using the build farm, if we test two options and alternate
the tests, and one is always faster than the other, I think we can
conclude that that one is faster, even if there are other loads on the
system.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#29

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Tom Lane (#20)

Re: Simplifying wal_sync_method

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Marko Kreen wrote:

On same topic:
http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have battery-backed
cache, you don't need writethrough, so we don't have it as default. I
am going to write a section in the manual for 8.1 about these
reliability issues.

I thought we had changed the default for Windows to be fsync_writethrough
in 8.1? We didn't have that code in 8.0, but now that we do, it surely
seems like the sanest default.

Well, 8.0 shipped with commit() for fsync(), which in fact is
writethrough, but we decided that that wasn't a good default because:

o it didn't match Unix
o Oracle doesn't use that method for fsync
o we would be slower than Oracle on Win32
o it is a loss for battery backed RAID

so we moved commit() to fsync_writethrough, and found a way to do real
fdatasync as the default on Win32 in 8.0.2. This is clearly mentioned
in the release notes:

* Enable the wal_sync_method setting of "open_datasync" on Windows, and
make it the default for that platform (Magnus, Bruce) Because the
default is no longer "fsync_writethrough", data loss is possible during
a power failure if the disk drive has write caching enabled. To turn off
the write cache on Windows, from the Device Manager, choose the drive
properties, then Policies.

This was discussed on the lists extensively.

One problem with writethrough is that drives that don't do writethrough
by default are often the ones with the worst performance for this,
namely IDE drives.

Also, in FreeBSD, if you add "hw.ata.wc=0" to /boot/loader.conf, you get
write-through, but for all ATA drives. Should we recommend that?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#30

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Bruce Momjian (#28)

Re: Simplifying wal_sync_method

Bruce Momjian <pgman@candle.pha.pa.us> writes:

My proposal is to remove fdatasync and open_datasync, and have have
fsync _prefer_ fdatasync, and open_sync prefer open_datastync, but fall
back to fsync and open_sync if the *data* version are not supported.

And this will buy us what, other than lack of flexibility?

The "data" options already are the default when available, I think
(if not, I have no objection to making them so). That does not
equate to saying we should remove access to the other options.
Your argument that they are useless only holds up in a perfect
world where there are no hardware bugs and no kernel bugs ...
and last I checked, we do not live in such a world.

regards, tom lane

#31

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Tom Lane (#30)

Re: Simplifying wal_sync_method

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

My proposal is to remove fdatasync and open_datasync, and have have
fsync _prefer_ fdatasync, and open_sync prefer open_datastync, but fall
back to fsync and open_sync if the *data* version are not supported.

And this will buy us what, other than lack of flexibility?

Clarity in testing options.

The "data" options already are the default when available, I think
(if not, I have no objection to making them so). That does not

They are.

equate to saying we should remove access to the other options.
Your argument that they are useless only holds up in a perfect
world where there are no hardware bugs and no kernel bugs ...
and last I checked, we do not live in such a world.

Is it useful to have the option of using non-*data* options when *data*
options are available? I have never heard of anyone wanting to do that,
nor do I imagine anyone doing that. Is there a real use case?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#32

Marko Kreen

marko@l-t.ee

over 20 years ago

In reply to: Tom Lane (#20)

Re: Simplifying wal_sync_method

On Mon, Aug 08, 2005 at 08:04:44PM -0400, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Marko Kreen wrote:

On same topic:
http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have battery-backed
cache, you don't need writethrough, so we don't have it as default. I
am going to write a section in the manual for 8.1 about these
reliability issues.

I thought we had changed the default for Windows to be fsync_writethrough
in 8.1? We didn't have that code in 8.0, but now that we do, it surely
seems like the sanest default.

Seems it _was_ default in 8.0 and 8.0.1 (called fsync) but
renamed to fsync_writethrough in 8.0.2 and moved away from being
default.

Now, 8.0.2 was released on 2005-04-07 and first destruction
happened in 2005-07-20. If this says anything about future,
I don't think PostgreSQL will stay known as 'reliable' database.

--
marko

#33

Magnus Hagander

mha@sollentuna.net

over 20 years ago

In reply to: Marko Kreen (#32)

Re: Simplifying wal_sync_method

Currently, here are the options available for wal_sync_method:

#wal_sync_method = fsync # the default

varies across platforms:

# fsync,

fdatasync, fsync_writethrough,

# open_sync,

open_datasync

On same topic:

http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php

Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have

battery-backed

cache, you don't need writethrough, so we don't have it as

default. I

Correction, if you have bbwc, you *should not* have writethrough. Not
only do you not need it, enabling it will drastically lower performance.

am going to write a section in the manual for 8.1 about these
reliability issues.

For some reason I don't see "corruped database after crash"
reports on Unixen. Why?

Because you don't read the lists often enough? I see it happen quite
often.

Also, why can't win32 be safe without battery-backed cache?
I can't see such requirement on other platforms.

It can, you just need to learn how to configure your system. There are
two different options to make it safe on win32 without battery backed
cache:

1) Use the postgresql option for fsync write through

2) Configure windows to disable write caching. If you do this, which you
of course already do on all your windows servers without write cache I
hope since it affects all windows operations including the filesystem
itself, you are safe with the default settings in postgresql.

I think what a lot of people don't realise is how easy option 2 is. It's
in traditional windows style *a single checkbox* in the harddisk
configuration.
(Granted, you need a modern windows for that. On older windows it's a
registry key)

I have some code floating in my tree to issue a WARNING on startup if
write cache is enabled and postgresql is not using writethrough. It's
not quite ready yet, but if such a thing would be accepted post
feature-freeze I can have it finished in good time before 8.1. It would
be quite simple (looking at just the main data directory for example,
ignoring tablespaces), but if you're dealing with complex installations
you'd better have a clue about how windows works anyway...

//Magnus

Import Notes

Resolved by subject fallback

#34

Magnus Hagander

mha@sollentuna.net

over 20 years ago

In reply to: Magnus Hagander (#33)

Re: Simplifying wal_sync_method

I think we should offer the reliable option by default,

and mention

the fast option for those who have battery-backed cache

in the manual.

But only on Win32?

We should do what's possible with what's given to us.

On Win32:

1. We can write through cache.

Yes.

2. We have unreliable OS with unreliable filesystem.

That can definitly be debated. Properly maintaned on proper hardware,
it's quite reliable these days.
Most filesystem corruptions that happen on windows are because people
enable write caching on drives without battery backup. The same issue
we're facing here, it's *not* a problem in the fs, it's a problem in the
admin. Sure, there are lots of things that could be better with ntfs,
but I would definitly not call it unreliable.

3. The probability of mediocre hardware is higher.

I would say it's actually *lower*. If you look in the average
datacenter, I bet you'll find a lot more linux boxes running on
built-at-home-with-the-cheapest-parts boxes. Whereas your windows boxes
will run on HP or IBM or whatever real server-grade hardware.

I don't know anybody who claims to run a professional business who uses
IDE drives in a Windows server, for example. I know several who run
linux or freebsd on it.

Regular POSIX:
1. We can't write through cache.
2. We have good OS with good filesystem (probably even
journaled).

NTFS is journaled, BTW. And I've seen a lot more corruption on ext2,
extr3 or reiser than I'ev seen on NTFS in my datacenter - and I have
about 5 times more Windows server than linux...
Granted other unixen might be more stable, I don't run any of those..

3. The probably of mediocre hardware is lower.

See above.

Why shouldn't we offer reliable option to win32?

*we do offer a reliabel option*.
Same as on POSIX, we don't enable it by default for *non-server
hardware*.

Options:

- Win32 guy complains that PG is bit slow.
We tell him to RTFM.

What most often happens here is:
Win32 guy notices PG is very slow, changes to mysql or mssql.

PS. Yeah, I was the guy who helped him to restore what's left.
I'd say he wasn't exactly happy.

I bet. Has he looked over all his other windows servers that are
improperly configured with regards to write cache?

//Magnus

Import Notes

Resolved by subject fallback

#35

Marko Kreen

marko@l-t.ee

over 20 years ago

In reply to: Magnus Hagander (#33)

Re: Simplifying wal_sync_method

On Tue, Aug 09, 2005 at 10:02:44AM +0200, Magnus Hagander wrote:

It behaves the same on Unix as Win32, and if you have

battery-backed

cache, you don't need writethrough, so we don't have it as

default. I

Correction, if you have bbwc, you *should not* have writethrough. Not
only do you not need it, enabling it will drastically lower performance.

So what? User should read docs how to get good performance.

Also, why can't win32 be safe without battery-backed cache?
I can't see such requirement on other platforms.

It can, you just need to learn how to configure your system. There are
two different options to make it safe on win32 without battery backed
cache:

I personally do not use PostgreSQL in win32 (yet - this may
change). I just felt the pain of a guy who tried...

in traditional windows style *a single checkbox* in the harddisk
configuration.
(Granted, you need a modern windows for that. On older windows it's a
registry key)

I think PostgreSQL should reliable by default.

Now with the Windows port there are lot of people who just try it out
on regular desktop machine.

With point-n-click installer there's no need to read docs and
after experiencing the unreliability they won't take it as
serious database.

I have some code floating in my tree to issue a WARNING on startup if
write cache is enabled and postgresql is not using writethrough. It's
not quite ready yet, but if such a thing would be accepted post
feature-freeze I can have it finished in good time before 8.1. It would
be quite simple (looking at just the main data directory for example,
ignoring tablespaces), but if you're dealing with complex installations
you'd better have a clue about how windows works anyway...

Hey, thats a good idea, irrespective whether the default changes or not.

I think if it's just couple of checks and then printf, it should
not meet much resistance.

--
marko

#36

Magnus Hagander

mha@sollentuna.net

over 20 years ago

In reply to: Marko Kreen (#35)

Re: Simplifying wal_sync_method

Also, why can't win32 be safe without battery-backed cache?
I can't see such requirement on other platforms.

It can, you just need to learn how to configure your

system. There are

two different options to make it safe on win32 without

battery backed

cache:

I personally do not use PostgreSQL in win32 (yet - this may
change). I just felt the pain of a guy who tried...

Didn't mean "you" as in you personally, meant "you" as in the user.
Sorry.

in traditional windows style *a single checkbox* in the harddisk
configuration.
(Granted, you need a modern windows for that. On older

windows it's a

registry key)

I think PostgreSQL should reliable by default.

For that I think we need to set it to fsync() on all platforms. it's the
least unsafe one on POSIX and it's the safe one on Win32.

Now with the Windows port there are lot of people who just
try it out on regular desktop machine.

Sure, but if you're just trying it out, it's not going to kill you if
you lose the data...

With point-n-click installer there's no need to read docs and
after experiencing the unreliability they won't take it as
serious database.

Well the same reasoning applies to the fact that they won't take it as a
serious database because it's too slow.

Perhaps we need to provide an option in the installer to controll what
goes in the initialized database. With an explanation ("don't enable
this if you use IDE disks and care about your data").

I have some code floating in my tree to issue a WARNING on

startup if

write cache is enabled and postgresql is not using

writethrough. It's

not quite ready yet, but if such a thing would be accepted post
feature-freeze I can have it finished in good time before 8.1. It
would be quite simple (looking at just the main data directory for
example, ignoring tablespaces), but if you're dealing with complex
installations you'd better have a clue about how windows

works anyway...

Hey, thats a good idea, irrespective whether the default
changes or not.

I think if it's just couple of checks and then printf, it
should not meet much resistance.

That's the general idea - I'm hoping it will be that simpel at least :-)

//Magnus

Import Notes

Resolved by subject fallback

#37

Marko Kreen

marko@l-t.ee

over 20 years ago

In reply to: Magnus Hagander (#34)

Re: Simplifying wal_sync_method

On Tue, Aug 09, 2005 at 10:08:25AM +0200, Magnus Hagander wrote:

That can definitly be debated. Properly maintaned on proper hardware,
it's quite reliable these days.
Most filesystem corruptions that happen on windows are because people
enable write caching on drives without battery backup. The same issue
we're facing here, it's *not* a problem in the fs, it's a problem in the
admin. Sure, there are lots of things that could be better with ntfs,
but I would definitly not call it unreliable.

People enable? Isn't it the default?

3. The probability of mediocre hardware is higher.

I would say it's actually *lower*. If you look in the average
datacenter, I bet you'll find a lot more linux boxes running on
built-at-home-with-the-cheapest-parts boxes. Whereas your windows boxes
will run on HP or IBM or whatever real server-grade hardware.

I don't know anybody who claims to run a professional business who uses
IDE drives in a Windows server, for example. I know several who run
linux or freebsd on it.

The professional probably tests it on his own desktop. I don't
think PostgreSQL reaches the data center before passing the run
on desktop.

Regular POSIX:
1. We can't write through cache.
2. We have good OS with good filesystem (probably even
journaled).

NTFS is journaled, BTW. And I've seen a lot more corruption on ext2,
extr3 or reiser than I'ev seen on NTFS in my datacenter - and I have
about 5 times more Windows server than linux...
Granted other unixen might be more stable, I don't run any of those..

3. The probably of mediocre hardware is lower.

See above.

Ok, comparing impressions is not productive.

Why shouldn't we offer reliable option to win32?

*we do offer a reliabel option*.
Same as on POSIX, we don't enable it by default for *non-server
hardware*.

What do you mean here? AFAIK we try to be reliable on POSIX too.

Options:

- Win32 guy complains that PG is bit slow.
We tell him to RTFM.

What most often happens here is:
Win32 guy notices PG is very slow, changes to mysql or mssql.

But lost database is no problem?

--
marko

#38

Magnus Hagander

mha@sollentuna.net

over 20 years ago

In reply to: Marko Kreen (#37)

Re: Simplifying wal_sync_method

That can definitly be debated. Properly maintaned on proper

hardware,

it's quite reliable these days.
Most filesystem corruptions that happen on windows are

because people

enable write caching on drives without battery backup. The

same issue

we're facing here, it's *not* a problem in the fs, it's a

problem in

the admin. Sure, there are lots of things that could be better with
ntfs, but I would definitly not call it unreliable.

People enable? Isn't it the default?

I dunno about workstation OS, but on the server OSes it certainly isn't
default.

3. The probability of mediocre hardware is higher.

I would say it's actually *lower*. If you look in the average
datacenter, I bet you'll find a lot more linux boxes running on
built-at-home-with-the-cheapest-parts boxes. Whereas your windows
boxes will run on HP or IBM or whatever real server-grade hardware.

I don't know anybody who claims to run a professional business who
uses IDE drives in a Windows server, for example. I know

several who

run linux or freebsd on it.

The professional probably tests it on his own desktop. I
don't think PostgreSQL reaches the data center before passing
the run on desktop.

I can't speak for others, but I would always test a server product on a
server OS on server hardware. Certainly not as beefy as eventual
production server, but the same level. Otherwise the test is not fully
relevant.

Why shouldn't we offer reliable option to win32?

*we do offer a reliabel option*.
Same as on POSIX, we don't enable it by default for *non-server
hardware*.

What do you mean here? AFAIK we try to be reliable on POSIX too.

AFAIK fsync is slightly safer than open_sync, because it also flushes
the metadata. We don't default to that.

Options:

- Win32 guy complains that PG is bit slow.
We tell him to RTFM.

What most often happens here is:
Win32 guy notices PG is very slow, changes to mysql or mssql.

But lost database is no problem?

It certainly is. That's not what I'm arguing. What I'm saying is that
you shouldn't expect server grade reliabilty on desktop hardware and
desktop OS. Regardless of platform.

//Magnus

Import Notes

Resolved by subject fallback

#39

Marko Kreen

marko@l-t.ee

over 20 years ago

In reply to: Magnus Hagander (#38)

Re: Simplifying wal_sync_method

On Tue, Aug 09, 2005 at 12:14:09PM +0200, Magnus Hagander wrote:

That can definitly be debated. Properly maintaned on proper

hardware,

it's quite reliable these days.
Most filesystem corruptions that happen on windows are

because people

enable write caching on drives without battery backup. The

same issue

we're facing here, it's *not* a problem in the fs, it's a

problem in

the admin. Sure, there are lots of things that could be better with
ntfs, but I would definitly not call it unreliable.

People enable? Isn't it the default?

I dunno about workstation OS, but on the server OSes it certainly isn't
default.

At least on XP Pro it is default.

The professional probably tests it on his own desktop. I
don't think PostgreSQL reaches the data center before passing
the run on desktop.

I can't speak for others, but I would always test a server product on a
server OS on server hardware. Certainly not as beefy as eventual
production server, but the same level. Otherwise the test is not fully
relevant.

You are right, but it always does not happen so. Also think of
developers who run a dev-server on a desktop.

Why shouldn't we offer reliable option to win32?

*we do offer a reliabel option*.
Same as on POSIX, we don't enable it by default for *non-server
hardware*.

What do you mean here? AFAIK we try to be reliable on POSIX too.

AFAIK fsync is slightly safer than open_sync, because it also flushes
the metadata. We don't default to that.

At least for WAL, the metadata does not change so it should not matter.

Now thinking about it, the guy had corrupt table, not WAL log.
How is WAL->tables synched? Does the 'wal_sync_method' affect
it or not?

Ofcourse, postgres could get corrupt data from WAL and put it
into table. (AFAIK NTFS does not log data, so we are back on
wal_sync_method.)

Options:

- Win32 guy complains that PG is bit slow.
We tell him to RTFM.

What most often happens here is:
Win32 guy notices PG is very slow, changes to mysql or mssql.

But lost database is no problem?

It certainly is. That's not what I'm arguing. What I'm saying is that
you shouldn't expect server grade reliabilty on desktop hardware and
desktop OS. Regardless of platform.

But we should expect server-grade speed? ;)

--
marko

#40

Magnus Hagander

mha@sollentuna.net

over 20 years ago

In reply to: Marko Kreen (#39)

Re: Simplifying wal_sync_method

I dunno about workstation OS, but on the server OSes it certainly
isn't default.

At least on XP Pro it is default.

Yuck.

The professional probably tests it on his own desktop. I don't
think PostgreSQL reaches the data center before passing

the run on

desktop.

I can't speak for others, but I would always test a server

product on

a server OS on server hardware. Certainly not as beefy as eventual
production server, but the same level. Otherwise the test

is not fully

relevant.

You are right, but it always does not happen so. Also think
of developers who run a dev-server on a desktop.

Well, with developers losing your data really isn't all that bad. It's a lot easier to deal with than losing a server :-)

Why shouldn't we offer reliable option to win32?

*we do offer a reliabel option*.
Same as on POSIX, we don't enable it by default for *non-server
hardware*.

What do you mean here? AFAIK we try to be reliable on POSIX too.

AFAIK fsync is slightly safer than open_sync, because it

also flushes

the metadata. We don't default to that.

At least for WAL, the metadata does not change so it should
not matter.

In most cases, right. In some cases it does (create a new WAL log segment for example). It's not a very common scenario, but I've seen error reports saying that an entire WAL segment is missing which is probably from metadata not being on disk at crash time.
(This is one thing that's "better" with the dbs that stuff evrything in a single precreated file (for example mssql) - the only metadata in the filesystem there is the "latest write time", which is completely irrelevant to the data)

Now thinking about it, the guy had corrupt table, not WAL log.
How is WAL->tables synched? Does the 'wal_sync_method'
affect it or not?

I *think* it always fsyncs() there as it is now, but I'm not 100% sure.

Ofcourse, postgres could get corrupt data from WAL and put it
into table. (AFAIK NTFS does not log data, so we are back on
wal_sync_method.)

Correct, and I beleive that's true for most Unix journaling fs:s as well - they only journal metadata.
Also, once a checkpoint has occured, postgresql will discard the WAL log. If the sync came through for the checkpoint record in the WAL file but not in the contents of the datafile, the recovery process will think that the file is ok even though it isn't.

It certainly is. That's not what I'm arguing. What I'm

saying is that

you shouldn't expect server grade reliabilty on desktop

hardware and

desktop OS. Regardless of platform.

But we should expect server-grade speed? ;)

Touché :-)

//Magnus

Import Notes

Resolved by subject fallback

#41

Alvaro Herrera

alvherre@alvh.no-ip.org

over 20 years ago

In reply to: Magnus Hagander (#40)

Re: Simplifying wal_sync_method

On Tue, Aug 09, 2005 at 12:58:31PM +0200, Magnus Hagander wrote:

Now thinking about it, the guy had corrupt table, not WAL log.
How is WAL->tables synched? Does the 'wal_sync_method'
affect it or not?

I *think* it always fsyncs() there as it is now, but I'm not 100% sure.

No. If fsync is off, then no fsync is done to the data files on
checkpoint either. (See mdsync() on src/backend/storage/smgr/md.c)

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
A male gynecologist is like an auto mechanic who never owned a car.
(Carrie Snow)

#42

Magnus Hagander

mha@sollentuna.net

over 20 years ago

In reply to: Alvaro Herrera (#41)

Re: Simplifying wal_sync_method

Now thinking about it, the guy had corrupt table, not WAL log.
How is WAL->tables synched? Does the 'wal_sync_method'
affect it or not?

I *think* it always fsyncs() there as it is now, but I'm

not 100% sure.

No. If fsync is off, then no fsync is done to the data files
on checkpoint either. (See mdsync() on src/backend/storage/smgr/md.c)

Right, but we're not talking fsync=off, we're talking when you are using
fdatasync, O_SYNC etc.

If you turn off fsync you're on your own, no matter the OS or other
settings...

//Magnus

Import Notes

Resolved by subject fallback

#43

Alvaro Herrera

alvherre@alvh.no-ip.org

over 20 years ago

In reply to: Magnus Hagander (#42)

Re: Simplifying wal_sync_method

On Tue, Aug 09, 2005 at 04:05:28PM +0200, Magnus Hagander wrote:

Now thinking about it, the guy had corrupt table, not WAL log.
How is WAL->tables synched? Does the 'wal_sync_method'
affect it or not?

I *think* it always fsyncs() there as it is now, but I'm

not 100% sure.

No. If fsync is off, then no fsync is done to the data files
on checkpoint either. (See mdsync() on src/backend/storage/smgr/md.c)

Right, but we're not talking fsync=off, we're talking when you are using
fdatasync, O_SYNC etc.

Oh, sorry :-) At that point, pg_fsync is called, which can invoke
commit() or fsync() depending on whether you have writethrough enabled.

pg_fsync() on storage/file/fd.c

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
FOO MANE PADME HUM

#44

Noname

mark@mark.mielke.cc

over 20 years ago

In reply to: Marko Kreen (#37)

Re: Simplifying wal_sync_method

On Tue, Aug 09, 2005 at 12:25:36PM +0300, Marko Kreen wrote:

On Tue, Aug 09, 2005 at 10:08:25AM +0200, Magnus Hagander wrote:

Most filesystem corruptions that happen on windows are because people
enable write caching on drives without battery backup. The same issue
we're facing here, it's *not* a problem in the fs, it's a problem in the
admin. Sure, there are lots of things that could be better with ntfs,
but I would definitly not call it unreliable.

People enable? Isn't it the default?

I think a little too much speculation in this thread, and not enough real
data... :-)

I only have Windows notebooks, and pre-configured systems by the company
I work for to judge. The notebooks of course have it 'on' (battery packed,
and if it wasn't on, I would have enabled it myself). I won't bother to
check the corporate systems, as whatever they are, they may not be the
Windows system default. Who knows for real?

In any case - I disagreed with the conclusions presented that
suggested that Windows had a poor file system, or should be linked
with poor hardware. Seems like FUD to me, and doesn't match my
experiences. I agree with the other poster that Windows hardware is
usually better in actual professional server environments. It might be
because people feel Windows requires better hardware to be stable, or
it might be that Windows applications tend to use more memory and disk
space, therefore the recommended entry level system is of higher
quality. It doesn't matter why people do it - or even if their reasons
are valid - what does matter, is that it isn't a fair conclusion that
Windows boxes will use poorer hardware. The opposite may be true, or
neither may be true.

I don't know anybody who claims to run a professional business who uses
IDE drives in a Windows server, for example. I know several who run
linux or freebsd on it.

The professional probably tests it on his own desktop. I don't
think PostgreSQL reaches the data center before passing the run
on desktop.

I don't know why this would be relevant. The 'professional' may do
some sort of local testing, but this doesn't negate the requirement
for server testing, as it should be well known that the environment is
sufficiently different, and therefore the expectations should be
sufficiently different. The 'professional' may choose to enable write
caching, because they don't care about reliability on their local
system. If it crashes, they re-clone their system, and re-populate the
database. In any case, this is more speculation, and not productive.

Options:
- Win32 guy complains that PG is bit slow.
We tell him to RTFM.

What most often happens here is:
Win32 guy notices PG is very slow, changes to mysql or mssql.

But lost database is no problem?

Personally, my only complaint regarding either choice is the
assumption that a 'WIN32' guy is stupid, and that 'WIN32' itself is
deficient. As long as the default is well documented, I don't have a
problem with either 'faster but less reliable on systems configured
for speed over reliability at the operating system level (write
caching enabled)' or 'slower, but reliable, just in case the system is
configured for speed over reliability at the operating system level
(write caching enabled)'. As long as it is well documented, either is
fine. I'm not convinced that Linux is really that much safer anyways,
and when it comes to a standard WIN32 configuration option, I assume
that the WIN32 administrator is somewhat competent.

You guys are too deep-routed in UNIX-land. I can't entirely blame you
- but the world is bigger than UNIX. :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#45

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Magnus Hagander (#40)

Re: Simplifying wal_sync_method

Magnus Hagander wrote:

I dunno about workstation OS, but on the server OSes it certainly
isn't default.

At least on XP Pro it is default.

Yuck.

I see "enable write caching" as enabled by default on my XP Pro laptop,
though laptops can be said to already have battery-backed disks.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#46

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Magnus Hagander (#40)

Re: Simplifying wal_sync_method

Magnus Hagander wrote:

Now thinking about it, the guy had corrupt table, not WAL log.
How is WAL->tables synched? Does the 'wal_sync_method'
affect it or not?

I *think* it always fsyncs() there as it is now, but I'm not 100% sure.

wal_sync_method is also used to flush pages during a checkpoint, so it
could lead to table corruption too, not just WAL corruption.

However, on Unix, 99% of corruption is caused by bad disk or RAM.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#47

Magnus Hagander

mha@sollentuna.net

over 20 years ago

In reply to: Bruce Momjian (#46)

Re: Simplifying wal_sync_method

Now thinking about it, the guy had corrupt table, not WAL log.
How is WAL->tables synched? Does the 'wal_sync_method'
affect it or not?

I *think* it always fsyncs() there as it is now, but I'm

not 100% sure.

wal_sync_method is also used to flush pages during a
checkpoint, so it could lead to table corruption too, not
just WAL corruption.

However, on Unix, 99% of corruption is caused by bad disk or RAM.

... or iDE disks with write cache enabled. I've certainly seen more than
what I'd call 1% (though I haven't studied it to be sure) that's because
of write-cached disks...

//Magnus

Import Notes

Resolved by subject fallback

#48

Bruce Momjian

pgman@candle.pha.pa.us

over 20 years ago

In reply to: Magnus Hagander (#47)

Re: Simplifying wal_sync_method

Magnus Hagander wrote:

Now thinking about it, the guy had corrupt table, not WAL log.
How is WAL->tables synched? Does the 'wal_sync_method'
affect it or not?

I *think* it always fsyncs() there as it is now, but I'm

not 100% sure.

wal_sync_method is also used to flush pages during a
checkpoint, so it could lead to table corruption too, not
just WAL corruption.

However, on Unix, 99% of corruption is caused by bad disk or RAM.

... or iDE disks with write cache enabled. I've certainly seen more than
what I'd call 1% (though I haven't studied it to be sure) that's because
of write-cached disks...

Personally, I can't remember a case that was caused by something other
than bad RAM or bad disk.

Let me write up a section in the manual on this for 8.1, and link it to
the wal_sync_method documentation section, and see how it looks. Even
re-ordering the items in the docs and making bullets has made it clearer
to me what is happening, and what is the default.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#49

Andrew - Supernews

andrew+nonews@supernews.com

over 20 years ago

In reply to: Magnus Hagander (#47)

Re: Simplifying wal_sync_method

On 2005-08-09, "Magnus Hagander" <mha@sollentuna.net> wrote:

... or iDE disks with write cache enabled. I've certainly seen more than
what I'd call 1% (though I haven't studied it to be sure) that's because
of write-cached disks...

Every SCSI disk I've looked at recently has had write cache enabled by
default, fwiw.

Turning it off isn't quite the performance killer that it is on IDE, of
course, but it is there.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#50

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Andrew - Supernews (#49)

Re: Simplifying wal_sync_method

Andrew - Supernews <andrew+nonews@supernews.com> writes:

On 2005-08-09, "Magnus Hagander" <mha@sollentuna.net> wrote:

... or iDE disks with write cache enabled. I've certainly seen more than
what I'd call 1% (though I haven't studied it to be sure) that's because
of write-cached disks...

Every SCSI disk I've looked at recently has had write cache enabled by
default, fwiw.

On SCSI, write cacheing is default because the protocol is actually
designed to support it: the drive can take the data, and then take some
more, without giving the impression that the write has been done.

If a SCSI drive reports write complete when it hasn't actually put the
bits on the platter yet, then it's simply broken.

regards, tom lane

#51

Andrew - Supernews

andrew+nonews@supernews.com

over 20 years ago

In reply to: Magnus Hagander (#47)

Re: Simplifying wal_sync_method

On 2005-08-09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew - Supernews <andrew+nonews@supernews.com> writes:

On 2005-08-09, "Magnus Hagander" <mha@sollentuna.net> wrote:

... or iDE disks with write cache enabled. I've certainly seen more than
what I'd call 1% (though I haven't studied it to be sure) that's because
of write-cached disks...

Every SCSI disk I've looked at recently has had write cache enabled by
default, fwiw.

On SCSI, write cacheing is default because the protocol is actually
designed to support it: the drive can take the data, and then take some
more, without giving the impression that the write has been done.

Wrong. Write caching as controlled by the WCE parameter on mode page 8
for direct-access devices does in fact report the write operation as
complete before the bits are on the disk. The protocol supplies a number
of additional commands to flush the cache, etc., for which you'll have
to consult the specs.

The reason it's not so much of a performance killer to turn it off is that
tag-queueing (which is what you are referring to) provides for some
optimization of concurrent requests even with the cache off.

If a SCSI drive reports write complete when it hasn't actually put the
bits on the platter yet, then it's simply broken.

I guess you haven't read the spec much, then.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#52

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Andrew - Supernews (#51)

Re: Simplifying wal_sync_method

Andrew - Supernews <andrew+nonews@supernews.com> writes:

If a SCSI drive reports write complete when it hasn't actually put the
bits on the platter yet, then it's simply broken.

I guess you haven't read the spec much, then.

[ shrug... ] I have seen that spec before: I was making a living by
implementing SCSI device drivers in the mid-80's. I think that anyone
who uses WCE in place of tagged command queueing is not someone whose
code I would care to rely on for mission-critical applications. TCQ
is a design that just works; WCE is someone's attempt to emulate all
the worst features of IDE.

regards, tom lane

#53

Noname

mark@mark.mielke.cc

over 20 years ago

In reply to: Tom Lane (#52)

Re: Simplifying wal_sync_method

On Tue, Aug 09, 2005 at 11:01:36PM -0400, Tom Lane wrote:

Andrew - Supernews <andrew+nonews@supernews.com> writes:

If a SCSI drive reports write complete when it hasn't actually put the
bits on the platter yet, then it's simply broken.

I guess you haven't read the spec much, then.

[ shrug... ] I have seen that spec before: I was making a living by
implementing SCSI device drivers in the mid-80's. I think that anyone
who uses WCE in place of tagged command queueing is not someone whose
code I would care to rely on for mission-critical applications. TCQ
is a design that just works; WCE is someone's attempt to emulate all
the worst features of IDE.

They're relying on you, not you on them.

Is their reliance founded upon reasonable logic, or are they unreasonably
putting the fault in your court? Depends on the issue...

Many people would not like to need to know these 'under the hood' type
issues. This doesn't mean they deserve to have their databases
corrupted to teach them the hard way why these 'under the hood' type
details are useful to know... :-)

Cheers,
mark

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#54

Adrian Maier

adrian.maier@gmail.com

over 20 years ago

In reply to: Noname (#44)

Re: Simplifying wal_sync_method

On 8/9/05, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote:

Personally, my only complaint regarding either choice is the
assumption that a 'WIN32' guy is stupid, and that 'WIN32' itself is
deficient. As long as the default is well documented, I don't have a
problem with either 'faster but less reliable on systems configured
for speed over reliability at the operating system level (write
caching enabled)' or 'slower, but reliable, just in case the system is
configured for speed over reliability at the operating system level
(write caching enabled)'. As long as it is well documented, either is
fine. I'm not convinced that Linux is really that much safer anyways,
and when it comes to a standard WIN32 configuration option, I assume
that the WIN32 administrator is somewhat competent.

Hello guys,

There seem to be arguments for both possible default configurations
"faster but less reliable" and "slower but reliable". I personally think
that the safer configuration is better.

Anyway, i have an idea :

What do you think about letting the person who installs PostgreSQL
on Win32 decide? For Windows, we have the graphical installer
that can be improved so that the user is asked to choose between
the two possible configurations.

This way the user will be aware of this choice even if he/she does not
read the docs.

If we let this choice be made at installation time, it would be less
important which is the default value because i think that the users
who install PostgreSQL from sources on Win32 are fewer.
And we can expect that, after bothering to install mingw and compile
PostgreSQL, they will also bother to configure it according to
their needs.

Cheers,
Adrian Maier

#55

Thomas F. O'Connell

tfo@sitening.com

over 20 years ago

In reply to: Bruce Momjian (#6)

Re: Simplifying wal_sync_method

I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein
it was apparently demonstrated that fsync was the fastest option
among the 7.4.x wal_sync_method options.

If there's a way to make this information more useful by providing
more data, please let me know, and I'll see what I can do.

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Strategic Open Source: Open Your i™

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)

On Aug 8, 2005, at 4:44 PM, Bruce Momjian wrote:

Show quoted text

In summary, we added all those wal_sync_method values in hopes of
getting some data on which is best on which platform, but having gone
several years with few reports, I am thinking we should just choose
the
best ones we can and move on, rather than expose a confusing API to
the
users.

Does anyone show a platform where the *data* options are slower
than the
non-*data* ones?

#56

Andrew - Supernews

andrew+nonews@supernews.com

over 20 years ago

In reply to: Magnus Hagander (#47)

Re: Simplifying wal_sync_method

On 2005-08-10, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew - Supernews <andrew+nonews@supernews.com> writes:

If a SCSI drive reports write complete when it hasn't actually put the
bits on the platter yet, then it's simply broken.

I guess you haven't read the spec much, then.

[ shrug... ] I have seen that spec before: I was making a living by
implementing SCSI device drivers in the mid-80's. I think that anyone
who uses WCE in place of tagged command queueing is not someone whose
code I would care to rely on for mission-critical applications. TCQ
is a design that just works; WCE is someone's attempt to emulate all
the worst features of IDE.

1) Tag queueing and WCE are orthogonal concepts. It's not a question of
using one "in place of" the other. My comment was that my recent
observation of actual SCSI drives is that WCE is enabled by default and
as such _will_ be used unless either you disable it manually, or the host
OS does so.

2) What OSes in common use adapt to the WCE setting, either by turning it
off, or using FUA or issuing SYNCHRONIZE CACHE commands? Since it is
entirely transparent to the host OS, I do not believe any are, though it
looks like very recent Linux development is moving in this direction.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#57

Andrew Sullivan

ajs@crankycanuck.ca

over 20 years ago

In reply to: Thomas F. O'Connell (#55)

Re: Simplifying wal_sync_method

On Wed, Aug 10, 2005 at 02:11:48AM -0500, Thomas F. O'Connell wrote:

I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein
it was apparently demonstrated that fsync was the fastest option
among the 7.4.x wal_sync_method options.

If there's a way to make this information more useful by providing
more data, please let me know, and I'll see what I can do.

What would be really interesting to me to know is what Sun did
between 8 and 9 to make that so. We don't use Solaris for databases
any more, but fsync was a lot slower than whatever we ended up using
on 8. I wouldn't be surprised if they'd wired fsync directly to
something else; but I can hardly believe it'd be faster than any
other option. (Mind, we were using Veritas filesyste with this, as
well, which was at least half the headache.)

--
Andrew Sullivan | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun

#58

Thomas F. O'Connell

tfo@sitening.com

over 20 years ago

In reply to: Andrew Sullivan (#57)

Re: Simplifying wal_sync_method

UFS was the filesystem on the Solaris 9 box.

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Strategic Open Source: Open Your i™

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)

On Aug 11, 2005, at 4:18 PM, Andrew Sullivan wrote:

Show quoted text

On Wed, Aug 10, 2005 at 02:11:48AM -0500, Thomas F. O'Connell wrote:

I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein
it was apparently demonstrated that fsync was the fastest option
among the 7.4.x wal_sync_method options.

If there's a way to make this information more useful by providing
more data, please let me know, and I'll see what I can do.

What would be really interesting to me to know is what Sun did
between 8 and 9 to make that so. We don't use Solaris for databases
any more, but fsync was a lot slower than whatever we ended up using
on 8. I wouldn't be surprised if they'd wired fsync directly to
something else; but I can hardly believe it'd be faster than any
other option. (Mind, we were using Veritas filesyste with this, as
well, which was at least half the headache.)

A

#59

Jim C. Nasby

jnasby@pervasive.com

over 20 years ago

In reply to: Andrew Dunstan (#18)

Re: Simplifying wal_sync_method

On Mon, Aug 08, 2005 at 07:45:38PM -0400, Andrew Dunstan wrote:

So the short answer is possibly "You build the tests and we'll run 'em."

Would some version of dbt2/3 work for this?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com 512-569-9461

#60

Mark Wong

markw@osdl.org

over 20 years ago

In reply to: Jim C. Nasby (#59)

Re: Simplifying wal_sync_method

On Sun, 21 Aug 2005 19:27:35 -0500
"Jim C. Nasby" <jnasby@pervasive.com> wrote:

On Mon, Aug 08, 2005 at 07:45:38PM -0400, Andrew Dunstan wrote:

So the short answer is possibly "You build the tests and we'll run 'em."

Would some version of dbt2/3 work for this?

Yeah, trying... On the larger system I'm using I'm not seeing much of a
performance difference but I'm looking for a way to see if we can
identify any benefit to bypassing the kernel cache. I've been
re-arranging disks due to failures and trying to tweak a couple of
profiling things, but I'll try to get some data to share within a few
days.

Mark