WAL shortcoming causes missing-pg_clog-segment problem
I think I've identified a primary cause for the "no such pg_clog file"
problem that we've seen reported several times.
A look at htup.h shows that the WAL only stores the low 8 bits of a
tuple's t_infomask (see xl_heap_header struct). There is some fooling
around in heapam.c's WAL redo routines to try to reconstitute some of
the high-order bits, for example this:
htup->t_infomask = HEAP_XMAX_INVALID | xlhdr.mask;
But this is implicitly assuming that we can reconstruct the
XMIN_COMMITTED bit at will. That was true when the WAL code was
written, but with 7.2's ability to recycle allegedly-no-longer-needed
pg_clog data, we cannot simply drop commit status bits.
The only scenario I've been able to identify in which this actually
causes a failure is when VACUUM FULL moves an old tuple and then shortly
afterwards (before the next checkpoint) there is a crash. Post-crash,
the tuple move will be redone from WAL, and the moved tuple will be
inserted with zeroed-out commit status bits. When we next examine the
tuple, we have to try to retrieve its commit status from pg_clog ... but
it's not there anymore.
As far as I can see, the only realistic solution is to store the full 16
bits of t_infomask in the WAL. We could do this without increasing WAL
size by dropping the t_hoff field from xl_heap_header --- t_hoff is
computable given the number of attributes and the HASNULL/HASOID bits,
both of which are available. (Actually, we could save some space now
by getting rid of t_oid in xl_heap_header; it's not necessary given that
OID isn't in the fixed tuple header anymore.)
This will require a WAL format change of course. Fortunately we can do
that without forcing a complete initdb (people will have to run
pg_resetxlog if they want to update a 7.3beta2 database without initdb).
I see no way to fix the problem in the context of 7.2. Perhaps we
should put out a bulletin warning people to avoid VACUUM FULL in 7.2,
or at least to do CHECKPOINT as soon as possible after one.
regards, tom lane
[ Subject changed.]
Marc, please hold on announcing beta2 until we get this resolved. Thanks.
---------------------------------------------------------------------------
Tom Lane wrote:
I think I've identified a primary cause for the "no such pg_clog file"
problem that we've seen reported several times.A look at htup.h shows that the WAL only stores the low 8 bits of a
tuple's t_infomask (see xl_heap_header struct). There is some fooling
around in heapam.c's WAL redo routines to try to reconstitute some of
the high-order bits, for example this:htup->t_infomask = HEAP_XMAX_INVALID | xlhdr.mask;
But this is implicitly assuming that we can reconstruct the
XMIN_COMMITTED bit at will. That was true when the WAL code was
written, but with 7.2's ability to recycle allegedly-no-longer-needed
pg_clog data, we cannot simply drop commit status bits.The only scenario I've been able to identify in which this actually
causes a failure is when VACUUM FULL moves an old tuple and then shortly
afterwards (before the next checkpoint) there is a crash. Post-crash,
the tuple move will be redone from WAL, and the moved tuple will be
inserted with zeroed-out commit status bits. When we next examine the
tuple, we have to try to retrieve its commit status from pg_clog ... but
it's not there anymore.As far as I can see, the only realistic solution is to store the full 16
bits of t_infomask in the WAL. We could do this without increasing WAL
size by dropping the t_hoff field from xl_heap_header --- t_hoff is
computable given the number of attributes and the HASNULL/HASOID bits,
both of which are available. (Actually, we could save some space now
by getting rid of t_oid in xl_heap_header; it's not necessary given that
OID isn't in the fixed tuple header anymore.)This will require a WAL format change of course. Fortunately we can do
that without forcing a complete initdb (people will have to run
pg_resetxlog if they want to update a 7.3beta2 database without initdb).I see no way to fix the problem in the context of 7.2. Perhaps we
should put out a bulletin warning people to avoid VACUUM FULL in 7.2,
or at least to do CHECKPOINT as soon as possible after one.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Oops, I see beta2's on the web site with yesterday's date. Have they
been announced?
-rw-r--r-- 1 70 70 1072573 Sep 25 10:15 postgresql-test-7.3b2.tar.gz
---------------------------------------------------------------------------
Tom Lane wrote:
I think I've identified a primary cause for the "no such pg_clog file"
problem that we've seen reported several times.A look at htup.h shows that the WAL only stores the low 8 bits of a
tuple's t_infomask (see xl_heap_header struct). There is some fooling
around in heapam.c's WAL redo routines to try to reconstitute some of
the high-order bits, for example this:htup->t_infomask = HEAP_XMAX_INVALID | xlhdr.mask;
But this is implicitly assuming that we can reconstruct the
XMIN_COMMITTED bit at will. That was true when the WAL code was
written, but with 7.2's ability to recycle allegedly-no-longer-needed
pg_clog data, we cannot simply drop commit status bits.The only scenario I've been able to identify in which this actually
causes a failure is when VACUUM FULL moves an old tuple and then shortly
afterwards (before the next checkpoint) there is a crash. Post-crash,
the tuple move will be redone from WAL, and the moved tuple will be
inserted with zeroed-out commit status bits. When we next examine the
tuple, we have to try to retrieve its commit status from pg_clog ... but
it's not there anymore.As far as I can see, the only realistic solution is to store the full 16
bits of t_infomask in the WAL. We could do this without increasing WAL
size by dropping the t_hoff field from xl_heap_header --- t_hoff is
computable given the number of attributes and the HASNULL/HASOID bits,
both of which are available. (Actually, we could save some space now
by getting rid of t_oid in xl_heap_header; it's not necessary given that
OID isn't in the fixed tuple header anymore.)This will require a WAL format change of course. Fortunately we can do
that without forcing a complete initdb (people will have to run
pg_resetxlog if they want to update a 7.3beta2 database without initdb).I see no way to fix the problem in the context of 7.2. Perhaps we
should put out a bulletin warning people to avoid VACUUM FULL in 7.2,
or at least to do CHECKPOINT as soon as possible after one.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Oops, I see beta2's on the web site with yesterday's date. Have they
been announced?
-rw-r--r-- 1 70 70 1072573 Sep 25 10:15 postgresql-test-7.3b2.tar.gz
No, but they've been out there for more than a day. I think it's too
late to retract beta2. We could fix this problem and then do a quick
beta3, though...
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Oops, I see beta2's on the web site with yesterday's date. Have they
been announced?-rw-r--r-- 1 70 70 1072573 Sep 25 10:15 postgresql-test-7.3b2.tar.gz
No, but they've been out there for more than a day. I think it's too
late to retract beta2. We could fix this problem and then do a quick
beta3, though...
Yes, that's the only solution. Let's not announce beta2 and make things
worse.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073