AW: AW: AW: AW: AW: WAL-based allocation of XIDs is ins ecur e
I do not however see how the current solution fixes the original problem,
that we don't have a rollback for index modifications.
The index would potentially point to an empty heaptuple slot.How? There will be an XLOG entry inserting the heap tuple before the
XLOG entry that updates the index. Rollforward will redo both. The
heap tuple might not get committed, but it'll be there.
Before commit or rollback the xlog is not flushed to disk, thus you can loose
those xlog entries, but the index page might already be on disk because of
LRU buffer reuse, no ?
Another example would be a btree reorg, like adding a level, that is partway
through before a crash.
Additionally I do not see how this all works for userland index types.
None of it works for index types that don't do XLOG entries (which I
think may currently be true for everything except btree :-( ...). I
don't see how that changes if we alter the way this bit is done.
I really think that xlog entries should be done by a layer below the userland
functions. I would not like to risc WAL integrity by allowing userland to
write a messed up log record. The record would be something like:
called userland index insert for "key" and "ctid". With that info you can
easily redo, but undo would probably be hard. Thus the physical log.
Actually I am not sure index changes need to be (or are currently) logged at all.
You can deduce all necessary info from the heap xlog record
(plus maybe the original record from disk).
Andreas
Before commit or rollback the xlog is not flushed to disk, thus you can loose
those xlog entries, but the index page might already be on disk because of
LRU buffer reuse, no ?
No. Buffer page is written to disk *only after corresponding records are flushed
to log* (WAL means Write-Ahead-Log - write log before modifying data pages).
Another example would be a btree reorg, like adding a level, that is partway
through before a crash.
And this is what I hopefully fixed recently with btree runtime recovery.
Vadim
Before commit or rollback the xlog is not flushed to disk, thus you can loose
those xlog entries, but the index page might already be on disk because of
LRU buffer reuse, no ?No. Buffer page is written to disk *only after corresponding records are flushed
to log* (WAL means Write-Ahead-Log - write log before modifying data pages).
You mean, that for each dirty buffer that is reused, the reusing backend fsyncs
the xlog before writing the buffer to disk ?
Andreas
Import Notes
Resolved by subject fallback
Before commit or rollback the xlog is not flushed to disk, thus you can loose
those xlog entries, but the index page might already be on disk because of
LRU buffer reuse, no ?No. Buffer page is written to disk *only after corresponding records are flushed
to log* (WAL means Write-Ahead-Log - write log before modifying data pages).You mean, that for each dirty buffer that is reused, the reusing backend fsyncs
the xlog before writing the buffer to disk ?
In short - yes.
To be accurate - XLogFlush is called to ensure that records reflecting buffer' modifications
are on disk. That's how it works everywhere. And that's why LRU is not good policy for
bufmgr anymore (we discussed this already).
Vadim
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:
I really think that xlog entries should be done by a layer below the
userland functions.
That seems somewhere between impractical and impossible: how will you
tie the functional xlog entries ("insert foo into index bar") to the
resulting page modifications, unless the entries are made from code
that knows all about which pages contain what index entries? Don't
forget these things need to go into the xlog atomically.
I would not like to risc WAL integrity by allowing
userland to write a messed up log record.
Index access method code is just as critical a part of the system as
anything else. The above makes no more sense than saying that you don't
want to trust heapam.c to generate correct WAL records.
Actually I am not sure index changes need to be (or are currently)
logged at all. You can deduce all necessary info from the heap xlog
record (plus maybe the original record from disk).
This assumes that pg_index, pg_am and friends are (a) not corrupt; (b)
in the same state that they were in when the portion of the XLOG being
replayed was made. Neither of these assumptions is acceptable for WAL
recovery.
I do think there's something to your notion that XLOG should be logging
the pre-modification pages rather than post-modification, but that's
something we will have to come back to in 7.2 or later. For 7.1's
purposes there is nothing wrong with the current scheme, and I have no
desire to postpone release another few months to change it.
regards, tom lane