RE: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

Started by Mikheev, Vadimabout 25 years ago18 messages
#1Mikheev, Vadim
vmikheev@SECTORBASE.COM

New CHECKPOINT command.
Auto removing of offline log files and creating new file
at checkpoint time.

Can you tell me how to use CHECKPOINT please?

You shouldn't normally use it - postmaster will start backend
each 3-5 minutes to do this automatically.

Is this the same as a SAVEPOINT?

No. Checkpoints are to speedup after crash recovery and
to remove/archive log files. With WAL server doesn't write
any datafiles on commit, only commit record goes to log
(and log fsync-ed). Dirty buffers remains in memory long

Is log fsynced even I turn of -F?

Yes, though we can change this. We also can implement now
feature that Bruce wanted so long and so much -:) -
fsync log not on each commit but each ~ 5sec, if
losing some recent commits is acceptable.

Nevertheless, when bufmgr replaces dirty buffer it must
ensure first that log record of last buffer update is
on disk already and so bufmgr forces log fsync if required.
This cannot be changed - rule is simple: log before applying
changes to permanent storage.

Vadim

#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Mikheev, Vadim (#1)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

[ Charset ISO-8859-1 unsupported, converting... ]

New CHECKPOINT command.
Auto removing of offline log files and creating new file
at checkpoint time.

Can you tell me how to use CHECKPOINT please?

You shouldn't normally use it - postmaster will start backend
each 3-5 minutes to do this automatically.

Is this the same as a SAVEPOINT?

No. Checkpoints are to speedup after crash recovery and
to remove/archive log files. With WAL server doesn't write
any datafiles on commit, only commit record goes to log
(and log fsync-ed). Dirty buffers remains in memory long

Is log fsynced even I turn of -F?

Yes, though we can change this. We also can implement now
feature that Bruce wanted so long and so much -:) -
fsync log not on each commit but each ~ 5sec, if
losing some recent commits is acceptable.

Great. I think this middle ground is something we could never address
before.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#3Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Mikheev, Vadim (#1)

Can you tell me how to use CHECKPOINT please?

You shouldn't normally use it - postmaster will start backend
each 3-5 minutes to do this automatically.

Oh, I see.

Is this the same as a SAVEPOINT?

No. Checkpoints are to speedup after crash recovery and
to remove/archive log files. With WAL server doesn't write
any datafiles on commit, only commit record goes to log
(and log fsync-ed). Dirty buffers remains in memory long

Ok, so with CHECKPOINTS, we could move the offline log files to
somewhere else so that we could archive them, in my
undertstanding. Now question is, how we could recover from disaster
like losing every table files except log files. Can we do this with
WAL? If so, how can we do it?

Is log fsynced even I turn of -F?

Yes, though we can change this. We also can implement now
feature that Bruce wanted so long and so much -:) -
fsync log not on each commit but each ~ 5sec, if
losing some recent commits is acceptable.

Sounds great.
--
Tatsuo Ishii

#4Alfred Perlstein
bright@wintelcom.net
In reply to: Tatsuo Ishii (#3)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

* Tatsuo Ishii <t-ishii@sra.co.jp> [001110 18:42] wrote:

Yes, though we can change this. We also can implement now
feature that Bruce wanted so long and so much -:) -
fsync log not on each commit but each ~ 5sec, if
losing some recent commits is acceptable.

Sounds great.

Not really, I thought an ack on a commit would mean that the data
is actually in stable storage, breaking that would be pretty bad
no? Or are you only talking about when someone is running with
async Postgresql?

Although this doesn't have an effect on my current application,
when running Postgresql with sync commits and WAL can one expect
the old behavior, ie. success only after data and meta data (log)
are written?

Another question I had was what would the effect of a mid-fsync
crash have on a system using WAL, let's say someone yanks the
power while the OS in the midst of fsync, will all be ok?

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

#5Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Alfred Perlstein (#4)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

* Tatsuo Ishii <t-ishii@sra.co.jp> [001110 18:42] wrote:

Yes, though we can change this. We also can implement now
feature that Bruce wanted so long and so much -:) -
fsync log not on each commit but each ~ 5sec, if
losing some recent commits is acceptable.

Sounds great.

Not really, I thought an ack on a commit would mean that the data
is actually in stable storage, breaking that would be pretty bad
no? Or are you only talking about when someone is running with
async Postgresql?

The default is to sync on commit, but we need to give people options of
several seconds delay for performance reasons. Inforimx calls it
buffered logging, and it is used by most of the sites I know because it
has much better performance that sync on commit.

If the machine crashes five seconds after commit, many people don't have
a problem with just re-entering the data.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Alfred Perlstein (#4)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

* Tatsuo Ishii <t-ishii@sra.co.jp> [001110 18:42] wrote:

Yes, though we can change this. We also can implement now
feature that Bruce wanted so long and so much -:) -
fsync log not on each commit but each ~ 5sec, if
losing some recent commits is acceptable.

Sounds great.

Not really, I thought an ack on a commit would mean that the data
is actually in stable storage, breaking that would be pretty bad
no? Or are you only talking about when someone is running with
async Postgresql?

Although this doesn't have an effect on my current application,
when running Postgresql with sync commits and WAL can one expect
the old behavior, ie. success only after data and meta data (log)
are written?

Probably you misunderstand what Bruce expected to have. He wished to
have not-everytime-fsync as an *option*. I believe we wil do strict
fsync in default.
--
Tatsuo Ishii

#7Alfred Perlstein
bright@wintelcom.net
In reply to: Bruce Momjian (#5)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

* Bruce Momjian <pgman@candle.pha.pa.us> [001111 00:16] wrote:

* Tatsuo Ishii <t-ishii@sra.co.jp> [001110 18:42] wrote:

Yes, though we can change this. We also can implement now
feature that Bruce wanted so long and so much -:) -
fsync log not on each commit but each ~ 5sec, if
losing some recent commits is acceptable.

Sounds great.

Not really, I thought an ack on a commit would mean that the data
is actually in stable storage, breaking that would be pretty bad
no? Or are you only talking about when someone is running with
async Postgresql?

The default is to sync on commit, but we need to give people options of
several seconds delay for performance reasons. Inforimx calls it
buffered logging, and it is used by most of the sites I know because it
has much better performance that sync on commit.

If the machine crashes five seconds after commit, many people don't have
a problem with just re-entering the data.

We have several critical tables and running certain updates/deletes/inserts
on them in async mode worries me. Would it be possible to add a
'set' command to force a backend into fsync mode and perhaps back
into non-fsync mode as well?

What about setting an attribute on a table that could mean
a) anyone updating me better fsync me.
b) anyone updating me better fsync me as well as fsyncing
anything else they touch.

I swear one of these days I'm going to get more familiar with the
codebase and actually submit some useful patches for the backend.
:(

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#5)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Not really, I thought an ack on a commit would mean that the data
is actually in stable storage, breaking that would be pretty bad
no?

The default is to sync on commit, but we need to give people options of
several seconds delay for performance reasons. Inforimx calls it
buffered logging, and it is used by most of the sites I know because it
has much better performance that sync on commit.

I have to agree with Alfred here: this does not sound like a feature,
it sounds like a horrid hack. You're giving up *all* consistency
guarantees for a performance gain that is really going to be pretty
minimal in the WAL context.

Earlier, Vadim was talking about arranging to share fsyncs of the WAL
log file across transactions (after writing your commit record to the
log, sleep a few milliseconds to see if anyone else fsyncs before you
do; if not, issue the fsync yourself). That would offer less-than-
one-fsync-per-transaction performance without giving up any guarantees.
If people feel a compulsion to have a tunable parameter, let 'em tune
the length of the pre-fsync sleep ...

regards, tom lane

#9Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#8)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Not really, I thought an ack on a commit would mean that the data
is actually in stable storage, breaking that would be pretty bad
no?

The default is to sync on commit, but we need to give people options of
several seconds delay for performance reasons. Inforimx calls it
buffered logging, and it is used by most of the sites I know because it
has much better performance that sync on commit.

I have to agree with Alfred here: this does not sound like a feature,
it sounds like a horrid hack. You're giving up *all* consistency
guarantees for a performance gain that is really going to be pretty
minimal in the WAL context.

It does not give up consistency. The db is still consistent, it is just
consistent from a few seconds ago, rather than commit time. This is
standard Informix practice at most law firms I work with.

Earlier, Vadim was talking about arranging to share fsyncs of the WAL
log file across transactions (after writing your commit record to the
log, sleep a few milliseconds to see if anyone else fsyncs before you
do; if not, issue the fsync yourself). That would offer less-than-
one-fsync-per-transaction performance without giving up any guarantees.
If people feel a compulsion to have a tunable parameter, let 'em tune
the length of the pre-fsync sleep ...

That would work.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#9)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I have to agree with Alfred here: this does not sound like a feature,
it sounds like a horrid hack. You're giving up *all* consistency
guarantees for a performance gain that is really going to be pretty
minimal in the WAL context.

It does not give up consistency. The db is still consistent, it is just
consistent from a few seconds ago, rather than commit time.

No, it isn't consistent. Without the fsync you don't know what order
the kernel will choose to plop down WAL log blocks in; you could end up
with a corrupt log. (Actually, perhaps that could be worked around if
the log blocks are suitably marked so that you can tell where the last
sequentially valid one is. I haven't looked at the log structure in
any detail...)

regards, tom lane

#11Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#10)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I have to agree with Alfred here: this does not sound like a feature,
it sounds like a horrid hack. You're giving up *all* consistency
guarantees for a performance gain that is really going to be pretty
minimal in the WAL context.

It does not give up consistency. The db is still consistent, it is just
consistent from a few seconds ago, rather than commit time.

No, it isn't consistent. Without the fsync you don't know what order
the kernel will choose to plop down WAL log blocks in; you could end up
with a corrupt log. (Actually, perhaps that could be worked around if
the log blocks are suitably marked so that you can tell where the last
sequentially valid one is. I haven't looked at the log structure in
any detail...)

I am just suggesting that instead of flushing the log on every
transaction end, just do it every X seconds.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#12Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#10)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I have to agree with Alfred here: this does not sound like a feature,
it sounds like a horrid hack. You're giving up *all* consistency
guarantees for a performance gain that is really going to be pretty
minimal in the WAL context.

It does not give up consistency. The db is still consistent, it is just
consistent from a few seconds ago, rather than commit time.

No, it isn't consistent. Without the fsync you don't know what order
the kernel will choose to plop down WAL log blocks in; you could end up
with a corrupt log. (Actually, perhaps that could be worked around if
the log blocks are suitably marked so that you can tell where the last
sequentially valid one is. I haven't looked at the log structure in
any detail...)

Well, WAL already has to be careful in the order it plops down the log
blocks because a single transaction can span multiple log blocks.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#13Alfred Perlstein
bright@wintelcom.net
In reply to: Tom Lane (#10)
Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

* Tom Lane <tgl@sss.pgh.pa.us> [001111 12:06] wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I have to agree with Alfred here: this does not sound like a feature,
it sounds like a horrid hack. You're giving up *all* consistency
guarantees for a performance gain that is really going to be pretty
minimal in the WAL context.

It does not give up consistency. The db is still consistent, it is just
consistent from a few seconds ago, rather than commit time.

No, it isn't consistent. Without the fsync you don't know what order
the kernel will choose to plop down WAL log blocks in; you could end up
with a corrupt log. (Actually, perhaps that could be worked around if
the log blocks are suitably marked so that you can tell where the last
sequentially valid one is. I haven't looked at the log structure in
any detail...)

This could be fixed by using O_FSYNC on the open call for the WAL
data files on *BSD, i'm not sure of the sysV equivelant, but I know
it exists.

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

#14Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Alfred Perlstein (#13)
AW: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

I am just suggesting that instead of flushing the log on every
transaction end, just do it every X seconds.

Or maybe more practical is, when the log buffer fills.
And of course during checkpoints.

Andreas

#15Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Zeugswetter Andreas SB (#14)
Re: AW: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

[ Charset ISO-8859-1 unsupported, converting... ]

I am just suggesting that instead of flushing the log on every
transaction end, just do it every X seconds.

Or maybe more practical is, when the log buffer fills.
And of course during checkpoints.

Log filling is too abritrary. If I commit something, and nothing
happens for 2 hours, we should commit that transaction.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#16Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Zeugswetter Andreas SB (#14)

I am just suggesting that instead of flushing the log on every
transaction end, just do it every X seconds.

Or maybe more practical is, when the log buffer fills.
And of course during checkpoints.

Also before backend's going to write dirty buffer from pool
to system cache - changes must be logged before reflected
in data files.

Vadim

#17Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Bruce Momjian (#15)
RE: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

You are going to kernel call/yield anyway to fsync, so why
not try and if someone does the fsync, we don't need to do it.
I am suggesting re-checking the need for fsync after the return
from sleep(0).

It might make more sense to keep a private copy of the last time
the file was modified per-backend by that particular backend and
a timestamp of the last fsync shared globally so one can forgo the
fsync if "it hasn't been dirtied by me since the last fsync"

This would provide a rendevous point for the fsync call although
cost more as one would need to periodically call gettimeofday to
set the modified by me timestamp as well as the post-fsync shared
timestamp.

Already made, but without timestamps. WAL maintains last byte of log
written/fsynced in shmem, so XLogFlush(_last_byte_to_be_flushed_)
will do nothing if data are already on disk.

Vadim

#18Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Mikheev, Vadim (#17)
AW: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

Ewe, so we have this 1/200 second delay for every transaction. Seems
bad to me.

I think as long as it becomes a tunable this isn't a bad idea at
all. Fixing it at 1/200 isn't so great because people not wrapping
large amounts of inserts/updates with transaction blocks will
suffer.

I think the default should probably be no delay, and the documentation
on enabling this needs to be clear and obvious (i.e. hard to miss).

I just talked to Tom Lane about this. I think a sleep(0) just before
the flush would be the best. It would reliquish the cpu slice if
another process is ready to run. If no other backend is running, it
probably just returns. If there is another one, it gives it
a chance to
complete. On return from sleep(0), it can check if it still needs to
flush. This would tend to bunch up flushers so they flush only once,
while not delaying cases where only one backend is running.

I don't think anything that simply yields the processor works on
multiprocessor machines.

The point is, that fsync is so expensive, that a wait time in the
milliseconds is needed, and not micro seconds, to really improve
tx throughput for many clients.

I support the default to not delay point, since only a very heavily loaded
database will see a lot of fsyncs in the same millisecond timeslice.
A dba coping with a very heavily loaded database will need to tune
anyway, so for him one additional config is no problem.

Andreas