WAL-based allocation of XIDs is insecure
Consider the following scenario:
1. A new transaction inserts a tuple. The tuple is entered into its
heap file with the new transaction's XID, and an associated WAL log
entry is made. Neither one of these are on disk yet --- the heap tuple
is in a shmem disk buffer, and the WAL entry is in the shmem WAL buffer.
2. Now do a lot of read-only operations, in the same or another backend.
The WAL log stays where it is, but eventually the shmem disk buffer will
get flushed to disk so that the buffer can be re-used for some other
disk page.
3. Assume we now crash. Now, we have a heap tuple on disk with an XID
that does not correspond to any XID visible in the on-disk WAL log.
4. Upon restart, WAL will initialize the XID counter to the first XID
not seen in the WAL log. Guess which one that is.
5. We will now run a new transaction with the same XID that was in use
before the crash. If that transaction commits, then we have a tuple on
disk that will be considered valid --- and should not be.
After thinking about this for a little, it seems to me that XID
assignment should be handled more like OID assignment: rather than
handing out XIDs one-at-a-time, varsup.c should allocate them in blocks,
and should write an XLOG record to reflect the allocation of each block
of XIDs. Furthermore, the above example demonstrates that *we must
flush that XLOG entry to disk* before we can start to actually hand out
the XIDs. This ensures that the next system cycle won't re-use any XIDs
that may have been in use at the time of a crash.
OID assignment is not quite so critical. Consider again the scenario
above: we don't really care if after restart we reuse the OID that was
assigned to the crashed transaction's inserted tuple. As long as the
tuple itself is not considered committed, it doesn't matter what OID it
contains. So, it's not necessary to force XLOG flush for OID-assignment
XLOG entries.
In short then: make the XID allocation machinery just like the OID
allocation machinery presently is, plus an XLogFlush() after writing
the NEXTXID XLOG record.
Comments?
regards, tom lane
PS: oh, another thing: redo of a checkpoint record ought to advance the
XID and OID counters to be at least what the checkpoint record shows.
Tom Lane <tgl@sss.pgh.pa.us> writes:
After thinking about this for a little, it seems to me that XID
assignment should be handled more like OID assignment: rather than
handing out XIDs one-at-a-time, varsup.c should allocate them in blocks,
and should write an XLOG record to reflect the allocation of each block
of XIDs. Furthermore, the above example demonstrates that *we must
flush that XLOG entry to disk* before we can start to actually hand out
the XIDs. This ensures that the next system cycle won't re-use any XIDs
that may have been in use at the time of a crash.
I think your example demonstrates something slightly different. I
think it demonstrates that Postgres must flush the XLOG entry to disk
before it flushes any buffer to disk which uses an XID which was just
allocated.
For each buffer, heap_update could record the last XID stored into
that buffer. When a buffer is forced out to disk, Postgres could make
sure that the XLOG entry which uses the XID is previously forced out
to disk.
A simpler and less accurate approach: when any dirty buffer is forced
to disk in order to allocate a buffer, make sure that any XLOG entry
which allocates new XIDs is flushed to disk first.
I don't know if these are better. I raise them because you are
suggesting putting an occasional fsync at transaction start to avoid
an unlikely scenario. A bit of bookkeeping can be used instead to
notice the unlikely scenario when it occurs.
Ian
Import Notes
Reply to msg id not found: TomLanesmessageofMon05Mar2001133150-0500
Ian Lance Taylor <ian@airs.com> writes:
I think your example demonstrates something slightly different. I
think it demonstrates that Postgres must flush the XLOG entry to disk
before it flushes any buffer to disk which uses an XID which was just
allocated.
That would be an alternative solution, but it's considerably more
complex to implement and I'm not convinced it is more efficient.
The above could result, worst case, in double the normal number of
fsyncs --- each new transaction might need an fsync to dump its first
few XLOG records (in addition to the fsync for its commit), if the
shmem disk buffer traffic is not in your favor. This worst case is
not even difficult to produce: consider a series of standalone
transactions that each touch more than -B pages (-B = # of buffers).
In contrast, syncing NEXTXID records will require exactly one extra
fsync every few thousand transactions. That seems quite acceptable
to me, and better than an fsync load that we can't predict. Perhaps
the average case of fsync-on-buffer-flush would be better than that,
or perhaps not, but the worst case is definitely far worse.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
Ian Lance Taylor <ian@airs.com> writes:
I think your example demonstrates something slightly different. I
think it demonstrates that Postgres must flush the XLOG entry to disk
before it flushes any buffer to disk which uses an XID which was just
allocated.That would be an alternative solution, but it's considerably more
complex to implement and I'm not convinced it is more efficient.The above could result, worst case, in double the normal number of
fsyncs --- each new transaction might need an fsync to dump its first
few XLOG records (in addition to the fsync for its commit), if the
shmem disk buffer traffic is not in your favor. This worst case is
not even difficult to produce: consider a series of standalone
transactions that each touch more than -B pages (-B = # of buffers).In contrast, syncing NEXTXID records will require exactly one extra
fsync every few thousand transactions. That seems quite acceptable
to me, and better than an fsync load that we can't predict. Perhaps
the average case of fsync-on-buffer-flush would be better than that,
or perhaps not, but the worst case is definitely far worse.
I described myself unclearly. I was suggesting an addition to what
you are suggesting. The worst case can not be worse.
If you are going to allocate a few thousand XIDs each time, then I
agree that my suggested addition is not worth it. But how do you deal
with XID wraparound on an unstable system?
Ian
Import Notes
Reply to msg id not found: TomLanesmessageofMon05Mar2001150236-0500
Ian Lance Taylor <ian@airs.com> writes:
I described myself unclearly. I was suggesting an addition to what
you are suggesting. The worst case can not be worse.
Then I didn't (and still don't) understand your suggestion. Want to
try again?
If you are going to allocate a few thousand XIDs each time, then I
agree that my suggested addition is not worth it. But how do you deal
with XID wraparound on an unstable system?
About the same as we do now: not very well. But if your system is that
unstable, XID wrap is the least of your worries, I think.
Up through 7.0, Postgres allocated XIDs a thousand at a time, and not
only did the not-yet-used XIDs get lost in a crash, they'd get lost in
a normal shutdown too. What I propose will waste XIDs in a crash but
not in a normal shutdown, so it's still an improvement over prior
versions as far as XID consumption goes.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
Ian Lance Taylor <ian@airs.com> writes:
I described myself unclearly. I was suggesting an addition to what
you are suggesting. The worst case can not be worse.Then I didn't (and still don't) understand your suggestion. Want to
try again?
Your suggestion requires an obligatory fsync at an occasional
transaction start.
I was observing that in most cases, that fsync is not needed. It can
be avoided with a bit of additional bookkeeping.
I was assuming, incorrectly, that you would not want to allocate that
many XIDs at once. If you allocate 1000s of XIDs at once, the
obligatory fsync is not that bad, and my suggestion should be ignored.
If you are going to allocate a few thousand XIDs each time, then I
agree that my suggested addition is not worth it. But how do you deal
with XID wraparound on an unstable system?About the same as we do now: not very well. But if your system is that
unstable, XID wrap is the least of your worries, I think.Up through 7.0, Postgres allocated XIDs a thousand at a time, and not
only did the not-yet-used XIDs get lost in a crash, they'd get lost in
a normal shutdown too. What I propose will waste XIDs in a crash but
not in a normal shutdown, so it's still an improvement over prior
versions as far as XID consumption goes.
I find this somewhat troubling, since I like to think in terms of
long-running systems--like, decades. But I guess it's OK (for me) if
it is fixed in the next couple of years.
Ian
Import Notes
Reply to msg id not found: TomLanesmessageofMon05Mar2001151540-0500
Ian Lance Taylor <ian@airs.com> writes:
Tom Lane <tgl@sss.pgh.pa.us> writes:
Up through 7.0, Postgres allocated XIDs a thousand at a time, and not
only did the not-yet-used XIDs get lost in a crash, they'd get lost in
a normal shutdown too. What I propose will waste XIDs in a crash but
not in a normal shutdown, so it's still an improvement over prior
versions as far as XID consumption goes.
I find this somewhat troubling, since I like to think in terms of
long-running systems--like, decades. But I guess it's OK (for me) if
it is fixed in the next couple of years.
Agreed, we need to do something about the XID-wrap problem pretty soon.
But we're not solving it for 7.1, and in the meantime I don't think
these changes make much difference either way.
regards, tom lane
1. A new transaction inserts a tuple. The tuple is entered into its
heap file with the new transaction's XID, and an associated WAL log
entry is made. Neither one of these are on disk yet --- the heap tuple
is in a shmem disk buffer, and the WAL entry is in the shmem
WAL buffer.2. Now do a lot of read-only operations, in the same or another backend.
The WAL log stays where it is, but eventually the shmem disk buffer will
get flushed to disk so that the buffer can be re-used for some other
disk page.3. Assume we now crash. Now, we have a heap tuple on disk with an XID
that does not correspond to any XID visible in the on-disk WAL log.4. Upon restart, WAL will initialize the XID counter to the first XID
not seen in the WAL log. Guess which one that is.5. We will now run a new transaction with the same XID that was in use
before the crash. If that transaction commits, then we have a tuple on
disk that will be considered valid --- and should not be.
I do not think this is true. Before any modification to a page the original page will be
written to the log (aka physical log).
On startup rollforward this original page, that does not contain the inserted
tuple with the stale XID is rewritten over the modified page.
Andreas
PS: I thus object to your proposed XID allocation change
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB wrote:
1. A new transaction inserts a tuple. The tuple is entered into its
heap file with the new transaction's XID, and an associated WAL log
entry is made. Neither one of these are on disk yet --- the heap tuple
is in a shmem disk buffer, and the WAL entry is in the shmem
WAL buffer.2. Now do a lot of read-only operations, in the same or another backend.
The WAL log stays where it is, but eventually the shmem disk buffer will
get flushed to disk so that the buffer can be re-used for some other
disk page.3. Assume we now crash. Now, we have a heap tuple on disk with an XID
that does not correspond to any XID visible in the on-disk WAL log.4. Upon restart, WAL will initialize the XID counter to the first XID
not seen in the WAL log. Guess which one that is.5. We will now run a new transaction with the same XID that was in use
before the crash. If that transaction commits, then we have a tuple on
disk that will be considered valid --- and should not be.I do not think this is true. Before any modification to a page the original page will be
written to the log (aka physical log).
Yes there must be XLogFlush() before writing buffers.
BTW how do we get the next XID if WAL files are corrupted ?
Regards,
Hiroshi Inoue
5. We will now run a new transaction with the same XID that was in use
before the crash. If that transaction commits, then we have a tuple on
disk that will be considered valid --- and should not be.I do not think this is true. Before any modification to a page the original page will be
written to the log (aka physical log).Yes there must be XLogFlush() before writing buffers.
BTW how do we get the next XID if WAL files are corrupted ?
Normally:
1. pg_control checkpoint info
2. checkpoint record in WAL ?
3. then rollforward of WAL
If WAL is corrupt the only way to get a consistent state is to bring the
db into a state as it was during last good checkpoint. But this is only possible
if you can at least read all "physical log" records from WAL.
Failing that, the only way would probably be to scan all heap files for XID's that are
greater than the XID from checkpoint.
I think the utility Tom has in mind, that resets WAL, will allow you to dump the db
so you can initdb and reload. I don't think it is intended that you can immediately
resume operation, (unless of course for the mentioned case of an upgrade with
a good checkpoint as last WAL record (== proper shutdown)).
Andreas
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:
5. We will now run a new transaction with the same XID that was in use
before the crash. If that transaction commits, then we have a tuple on
disk that will be considered valid --- and should not be.
I do not think this is true. Before any modification to a page the
original page will be written to the log (aka physical log).
Hmm. Actually, what is written to the log is the *modified* page not
its original contents. However, on studying the buffer manager I see
that it tries to fsync the log entry describing the last mod to a data
page before it writes out the page itself. So perhaps that can be
relied on to ensure all XIDs known in the heap are known in the log.
However, I'd just as soon have the NEXTXID log records too to be doubly
sure. I do now agree that we needn't fsync the NEXTXID records,
however.
regards, tom lane
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
Yes there must be XLogFlush() before writing buffers.
BTW how do we get the next XID if WAL files are corrupted ?
My not-yet-committed changes include storing the latest CheckPoint
record in pg_control (as well as in the WAL files). Recovery from
XLOG disaster will consist of generating a new XLOG that's empty
except for a CheckPoint record based on the one cached in pg_control.
In particular we can extract the nextOid and nextXid fields.
It might be that writing NEXTXID or NEXTOID log records should update
pg_control too with new nextXid/nextOid values --- what do you think?
Otherwise there's a possibility that the stored checkpoint is too far
back to cover all the values used since then. OTOH, we are not going
to be able to guarantee absolute consistency in this disaster recovery
scenario anyway; duplicate XIDs may be the least of one's worries.
Of course, if you lose both XLOG and pg_control, you're still in big
trouble. So it seems we should minimize the number of writes to
pg_control, which is an argument not to update it more than we must.
regards, tom lane
5. We will now run a new transaction with the same XID that was in use
before the crash. If that transaction commits, then we have a tuple on
disk that will be considered valid --- and should not be.I do not think this is true. Before any modification to a page the
original page will be written to the log (aka physical log).Hmm. Actually, what is written to the log is the *modified* page not
its original contents.
Well, that sure is not what was discussed on the list for implementation !!
The physical log page should be the page as it was during the last checkpoint.
Anything else would also not have the benefit of fixing the index page problem
this solution was intended to fix in the first place. I thus really doubt above statement.
However, on studying the buffer manager I see
that it tries to fsync the log entry describing the last mod to a data
page before it writes out the page itself. So perhaps that can be
relied on to ensure all XIDs known in the heap are known in the log.
Each page about to be modified should be written to the txlog once,
and only once before the first modification after each checkpoint.
During rollforward the pages are written back to the heap, thus no open
XIDs can be in heap pages.
However, I'd just as soon have the NEXTXID log records too to be doubly
sure. I do now agree that we needn't fsync the NEXTXID records,
however.
I do not really see an additional benefit. If the WAL is busted those records are
likely busted too.
Andreas
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:
Hmm. Actually, what is written to the log is the *modified* page not
its original contents.
I thus really doubt above statement.
Read the code.
Each page about to be modified should be written to the txlog once,
and only once before the first modification after each checkpoint.
Yes, there's only one page dump per page per checkpoint. But the
sequence is (1) make the modification in shmem buffers then (2) make
the XLOG entry.
I believe this is OK since the XLOG entry is flushed before any of
the pages it affects are written out from shmem. Since we have not
changed the storage management policy, it's OK if heap pages contain
changes from uncommitted transactions --- all we must avoid is
inconsistencies (eg not all three pages of a btree split written out),
and redo of the XLOG entry will ensure that for us.
However, I'd just as soon have the NEXTXID log records too to be doubly
sure. I do now agree that we needn't fsync the NEXTXID records,
however.
I do not really see an additional benefit. If the WAL is busted those
records are likely busted too.
The point is to make the allocation of XIDs and OIDs work the same way.
In particular, if we are forced to reset the XLOG using what's stored in
pg_control, it would be good if what's stored in pg_control is a value
beyond the last-used XID/OID, not a value less than the last-used ones.
regards, tom lane
Hmm. Actually, what is written to the log is the *modified* page not
its original contents.Well, that sure is not what was discussed on the list for implementation !!
I thus really doubt above statement.Read the code.
Ok, sad.
Each page about to be modified should be written to the txlog once,
and only once before the first modification after each checkpoint.Yes, there's only one page dump per page per checkpoint. But the
sequence is (1) make the modification in shmem buffers then (2) make
the XLOG entry.I believe this is OK since the XLOG entry is flushed before any of
the pages it affects are written out from shmem. Since we have not
changed the storage management policy, it's OK if heap pages contain
changes from uncommitted transactions
Sure, but the other way would be a lot less complex.
--- all we must avoid is inconsistencies (eg not all three pages of a btree split written out), and redo of the XLOG entry will ensure that for us.
Is it so hard to swap ? First write page to log then modify in shmem.
Then those pages would have additional value, because
then utilities could do all sorts of things with those pages.
1. Create a consistent state of the db by only applying "physical log" pages
after checkpoint (in case a complete WAL rollforward bails out)
2. Create a consistent online backup snapshot, by first doing something like an
ordinary tar, and after that save all "physical log" pages.
Andreas
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:
Is it so hard to swap ? First write page to log then modify in shmem.
Then those pages would have additional value, because
then utilities could do all sorts of things with those pages.
After thinking about this a little, I believe I see why Vadim did it
the way he did. Suppose we tried to make the code sequence be
obtain write lock on buffer;
XLogOriginalPage(buffer); // copy page to xlog if first since ckpt
modify buffer;
XLogInsert(xlog entry for modification);
mark buffer dirty and release write lock;
so that the saving of the original page is a separate xlog entry from
the modification data. Looks easy, and it'd sure simplify XLogInsert
a lot. The only problem is it's wrong. What if a checkpoint occurs
between the two XLOG records?
The decision whether to log the whole buffer has to be atomic with the
actual entry of the xlog record. Unless we want to hold the xlog insert
lock for the entire time that we're (eg) splitting a btree page, that
means we log the buffer after the modification work is done, not before.
regards, tom lane
I wrote:
The decision whether to log the whole buffer has to be atomic with the
actual entry of the xlog record. Unless we want to hold the xlog insert
lock for the entire time that we're (eg) splitting a btree page, that
means we log the buffer after the modification work is done, not before.
On third thought --- we could still log the original page contents and
the modification log record atomically, if what were logged in the xlog
record were (essentially) the parameters to the operation being logged,
not its results. That is, make the log entry before you start doing the
mod work, not after. This might also simplify redo, since redo would be
no different from the normal case. I'm not sure why Vadim didn't choose
to do it that way; maybe there's some other fine point I'm missing.
In any case, it'd be a big code change and not something I'd want to
undertake at this point in the release cycle ... maybe we can revisit
this issue for 7.2.
regards, tom lane
Consider the following scenario:
1. A new transaction inserts a tuple. The tuple is entered into its
heap file with the new transaction's XID, and an associated WAL log
entry is made. Neither one of these are on disk yet --- the heap tuple
is in a shmem disk buffer, and the WAL entry is in the shmem WAL buffer.2. Now do a lot of read-only operations, in the same or another backend.
The WAL log stays where it is, but eventually the shmem disk buffer will
get flushed to disk so that the buffer can be re-used for some other
disk page.3. Assume we now crash. Now, we have a heap tuple on disk with an XID
that does not correspond to any XID visible in the on-disk WAL log.
Impossible (with fsync ON -:)).
Seems my description of core WAL rule was bad, I'm sorry -:(
WAL = Write-*Ahead*-Log = Write data pages *only after* log records
reflecting data pages modifications are *flushed* on disk =
If a modification was not logged then it's neither in data pages.
No matter when bufmgr writes data buffer (at commit time or to re-use
it) bufmgr first ensures that buffer' modifications are logged.
Vadim
Import Notes
Resolved by subject fallback
The point is to make the allocation of XIDs and OIDs work the same way.
In particular, if we are forced to reset the XLOG using what's stored in
pg_control, it would be good if what's stored in pg_control is a value
beyond the last-used XID/OID, not a value less than the last-used ones.
If we're forced to reset log (ie it's corrupted/lost) then we're forced
to dump, and only dump, data *because of they are not consistent*.
So, I wouldn't worry about XID/OID/anything - we can only provide user
with way to restore data ... *manually*.
If user really cares about his data he must
U1. Buy good disks for WAL (data may be on not so good disks).
U2. Set up distributed DB if U1. is not enough.
To help user with above we must
D1. Avoid bugs in WAL
D2. Implement WAL based BAR (so U1 will have sence).
D3. Implement distributed DB.
There will be no D2 & D3 in 7.1, and who knows about D1.
So, manual restoring data is the best we can do for 7.1.
And actually, "manual restoring" is what we had before,
anyway.
Vadim
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> wrote:
In short I do not think that the current implementation of
"physical log" does what it was intended to do :-(
Hm, wasn't it handling non-atomic disk writes, Andreas?
And for what else "physical log" could be used?
The point was - copy entire page content on first after
checkpoint modification, so on recovery first restore page
to consistent state, so all subsequent logged modifications
could be applied without fear about page inconsistency.
Now, why should we log page as it was *before* modification?
We would log modification anyway (yet another log record!) and
would apply it to page, so result would be the same as now when
we log page after modification - consistent *modifyed* page.
?
Vadim
Import Notes
Resolved by subject fallback
On third thought --- we could still log the original page contents and
the modification log record atomically, if what were logged in the xlog
record were (essentially) the parameters to the operation being logged,
not its results. That is, make the log entry before you start doing the
mod work, not after. This might also simplify redo, since redo would be
no different from the normal case. I'm not sure why Vadim didn't choose
to do it that way; maybe there's some other fine point I'm missing.
There is one - indices over user defined data types: catalog is not
available at the time of recovery, so, eg, we can't know how to order
keys of "non-standard" types. (This is also why we have to recover
aborted index split ops at runtime, when catalog is already available.)
Also, there is no point why should we log original page content and
the next modification record separately.
Vadim
Import Notes
Resolved by subject fallback
In short I do not think that the current implementation of
"physical log" does what it was intended to do :-(Hm, wasn't it handling non-atomic disk writes, Andreas?
Yes, but for me, that was only one (for me rather minor) issue.
I still think that the layout of PostgreSQL pages was designed to
reduce the risc of a (heap) page beeing inconsistent because it is
only partly written to an acceptable minimum. If your hw and os can
guarantee that it does not overwrite one [OS] block with data that was
not supplied (== junk data), the risc is zero.
And for what else "physical log" could be used?
1. create a consistent state if rollforward bails out for some reason
but log is still readable
2. have an easy way to handle index rollforward/abort
(might need to block some index modifications during checkpoint though)
3. ease the conversion to overwrite smgr
4. ease the creation of BAR to create consistent snapshot without
need for log rollforward
Now, why should we log page as it was *before* modification?
We would log modification anyway (yet another log record!) and
Oh, so currently you only do eighter ? I would at least add the
info which slot was inserted/modified (maybe that is already there (XID)).
would apply it to page, so result would be the same as now when
we log page after modification - consistent *modifyed* page.
Maybe I am too focused on the implementation of one particular db,
that I am not able to see this without prejudice, and all is well as is :-)
Andreas
Import Notes
Resolved by subject fallback
Hm, wasn't it handling non-atomic disk writes, Andreas?
Yes, but for me, that was only one (for me rather minor) issue.
I still think that the layout of PostgreSQL pages was designed to
reduce the risc of a (heap) page beeing inconsistent because it is
only partly written to an acceptable minimum. If your hw and os can
I believe that I explained why it's not minor issue (and never was).
Eg - PageRepaireFragmentation "compacts" page exactly like other,
overwriting, DBMSes do and partial write of modified page means
lost page content.
And for what else "physical log" could be used?
1. create a consistent state if rollforward bails out for some reason
but log is still readable
What difference between consistent state as it was before checkpoint and
after that? Why should we log old page images? New (after modification) page
images are also consistent and can be used to create consistent state.
2. have an easy way to handle index rollforward/abort
(might need to block some index modifications during checkpoint though)
There is no problems now. Page is either splitted (new page created/
properly initialized, right sibling updated) or not.
3. ease the conversion to overwrite smgr
?
4. ease the creation of BAR to create consistent snapshot without
need for log rollforward
Isn't it the same as 1. with "snapshot" == "state"?
Now, why should we log page as it was *before* modification?
We would log modification anyway (yet another log record!) andOh, so currently you only do eighter ? I would at least add the
info which slot was inserted/modified (maybe that is already there (XID)).
Relfilenode + TID are saved, as well as anything else that would required
to UNDO operation, in future.
would apply it to page, so result would be the same as now when
we log page after modification - consistent *modifyed* page.Maybe I am too focused on the implementation of one particular db,
that I am not able to see this without prejudice,
and all is well as is :-)
^^^^^^^^^^^^^^^^^^^^^
I hope so -:)
Vadim
Hm, wasn't it handling non-atomic disk writes, Andreas?
Yes, but for me, that was only one (for me rather minor) issue.
I still think that the layout of PostgreSQL pages was designed to
reduce the risc of a (heap) page beeing inconsistent because it is
only partly written to an acceptable minimum. If your hw and os canI believe that I explained why it's not minor issue (and never was).
Eg - PageRepaireFragmentation "compacts" page exactly like other,
But this is currently only done during vacuum and as such a special case, no ?
overwriting, DBMSes do and partial write of modified page means
lost page content.
Yes, if contents move around. Not with the original Postgres 4 heap page design
in combination with non overwrite smgr. Maybe this has changed because someone
oversaw the consequences ?
This certainly changes when converting to overwrite smgr, because
then you reuse a slot that might not be the correct size and contents need to be
shifted around. For this case your "physical log" is also good, of course :-)
Andreas
Import Notes
Resolved by subject fallback