Bug #613: Sequence values fall back to previously checkpointed value after crash
Ben Grimm (bgrimm@zaeon.com) reports a bug with a severity of 1
The lower the number the more severe it is.
Short Description
Sequence values fall back to previously checkpointed value after crash
Long Description
It's hard to decide if it's devestating or not, since the bug is only apparent when a backend crashes. But when a backend does crash the result is pretty awful. When a backend crashes, and subsequently all others are killed off by the postmaster to avoid shared memory corruption, sequences fall back to whatever value they had the last time the db checkpointed. I say checkpoint because this happens independantly of commits, so you could have a table with a serial column do 10 committed inserts, crash a backend, and further inserts will fail having duplicate keys. I've tested this with 7.2rc2 and 7.2 STABLE using a stock postgresql.conf (all defaults).
It seems impossible to me that this is happening. I've looked at the code and seen the comment about how sequences are allocated in advance. So I figured I'd report it...
Steps to reproduce the bug:
- Create a sequence, assign it a value
- Checkpoint (optional)
- Connect to one or more backend
- select nextval (on any/all of the connections opened above) from that sequence several times, noting the first and last value returned
- kill -9 (or -11) any of the backend processes, the database will automatically kill off all the other backends.
- reconnect and select nextval from the sequence and it will be return the first value (from above).
Sample Code
No file was uploaded with this report
Yikes! I have reproduced this bug. My server logs are:
LOG: database system was shut down at 2002-03-08 17:30:03 CET
LOG: checkpoint record is at 0/46D018
LOG: redo record is at 0/46D018; undo record is at 0/0; shutdown TRUE
LOG: next transaction id: 146; next oid: 16561
LOG: database system is ready
ERROR: DefineIndex: operator class "int" not supported by access method "btree"
ERROR: Relation 'test' already exists
LOG: server process (pid 21627) was terminated by signal 9
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing shared memory and semaphores
LOG: database system was interrupted at 2002-03-11 23:22:50 CET
LOG: checkpoint record is at 0/490AB8
LOG: redo record is at 0/490AB8; undo record is at 0/0; shutdown FALSE
LOG: next transaction id: 172; next oid: 24753
LOG: database system was not properly shut down; automatic recovery in progress
LOG: ReadRecord: record with zero length at 0/490AF8
LOG: redo is not required
LOG: database system is ready
I find on reconnection after 'kill -9' that the sequence is indeed set
at 1. I did a checkpoint after the sequence creation.
---------------------------------------------------------------------------
pgsql-bugs@postgresql.org wrote:
Ben Grimm (bgrimm@zaeon.com) reports a bug with a severity of
1 The lower the number the more severe it is.Short Description Sequence values fall back to previously
checkpointed value after crashLong Description It's hard to decide if it's devestating or not,
since the bug is only apparent when a backend crashes. But when
a backend does crash the result is pretty awful. When a backend
crashes, and subsequently all others are killed off by the
postmaster to avoid shared memory corruption, sequences fall
back to whatever value they had the last time the db checkpointed.
I say checkpoint because this happens independantly of commits,
so you could have a table with a serial column do 10 committed
inserts, crash a backend, and further inserts will fail having
duplicate keys. I've tested this with 7.2rc2 and 7.2 STABLE
using a stock postgresql.conf (all defaults).It seems impossible to me that this is happening. I've looked
at the code and seen the comment about how sequences are allocated
in advance. So I figured I'd report it...Steps to reproduce the bug: - Create a sequence, assign it a
value - Checkpoint (optional) - Connect to one or more backend
- select nextval (on any/all of the connections opened above)
from that sequence several times, noting the first and last
value returned - kill -9 (or -11) any of the backend processes,
the database will automatically kill off all the other backends.
- reconnect and select nextval from the sequence and it will be
return the first value (from above).Sample Code
No file was uploaded with this report
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister
command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Yikes! I have reproduced this bug.
I believe I see the problem: MyLastRecPtr is being used in incompatible
ways.
The issue is that sequence operations are logged as "outside transaction
control", which I believe is intended to mark XLOG records that should
be redone whether or not the generating transaction commits. (Or if we
ever do xlog UNDO, records that should not be undone at xact abort.)
This classification is clearly right as far as it goes. Now
MyLastRecPtr is used to chain together the XLOG records that are
*within* xact control, so it doesn't get updated when an
outside-the-xact record is written. (At each record insert,
MyLastRecPtr is used to fill the previous-record-of-xact backlink.)
This is also fine.
The trouble is that at xact commit, we test to see if the current xact
made any loggable changes by checking MyLastRecPtr != 0. Therefore,
if we do an xact consisting ONLY of "select nextval()", this test will
mistakenly think that no xlog records were written. It will not
generate a commit record --- which is no big problem --- and will not
write or flush the xlog --- which is a big problem. An immediately
following crash will leave the sequence un-advanced.
The "no commit record" part of the logic seems okay to me, but we need
an independent test to decide whether to write/flush XLog. If we have
reported a nextval() value to the client then it seems to me we'd better
be certain that XLOG record is flushed to XLog before we report commit
to the client.
This is certainly fixable. However, here's the kicker: essentially what
this means is that we are not treating *reporting a nextval() value to
the client* as a commit-worthy event. I do not think this bug explains
the past reports that claim a nextval() value *inserted into the
database* has been rolled back. Seems to me that a subsequent tuple
insertion would create a normal XLog record which we'd flush before
commit, and thereby also flush the sequence-update XLog record.
Can anyone see a way that this mechanism explains the prior reports?
regards, tom lane
Hi Tom,
On Tuesday 12 March 2002 16:17, Tom Lane wrote:
<snip>
Can anyone see a way that this mechanism explains the prior reports?
Not sure about that, but I really feel the fix for this should go into 7.2.1,
just in case the list of patches for that is still being assembled.
Regards and best wishes,
Justin Clift
Show quoted text
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
On Tue, 12 Mar 2002, Tom Lane wrote:
The "no commit record" part of the logic seems okay to me, but we need
an independent test to decide whether to write/flush XLog. If we have
reported a nextval() value to the client then it seems to me we'd better
be certain that XLOG record is flushed to XLog before we report commit
to the client.
I think the part I don't understand is why WAL is being used to update
sequence values in the first place when sequences exist independantly of
transactions. In previous releases a sequence basically just existed
on disk in a specific location and updates to it updated the on disk
copy directly since there are no concurrency issues. I do realize that
running everything through WAL gives other benefits, so it's not likely
to revert back to the old way. But it would seem that the only way to
fix it is to flush the XLOG record immediately after the XLogInsert is
called, just as if the operation took place within its own transaction.
This is certainly fixable. However, here's the kicker: essentially what
this means is that we are not treating *reporting a nextval() value to
the client* as a commit-worthy event. I do not think this bug explains
the past reports that claim a nextval() value *inserted into the
database* has been rolled back. Seems to me that a subsequent tuple
insertion would create a normal XLog record which we'd flush before
commit, and thereby also flush the sequence-update XLog record.Can anyone see a way that this mechanism explains the prior reports?
Actually, that doesn't appear to be the case either because in some of
my tests I used a serial column type and I was just inserting data into
a table. It would appear that if the sequence is in the same tuple as
the data you modified then it won't get logged. What I did was create
a table with a serial column and a varchar(255). Inserted 100 rows
filled with data, committed. Ran a checkpoint. Checked my sequence
values, inserted 10 more rows of data, committed, checked the value of
the sequence again. Kill -9 the postmaster. Tried to insert into the
table, but to no avail... duplicate key. currval of the sequence and
it matched the value right after the checkpoint. I've been able to
duplicate that scenario several times.
-- Ben
bgrimm@zaeon.com wrote:
On Tue, 12 Mar 2002, Tom Lane wrote:
The "no commit record" part of the logic seems okay to me, but we need
an independent test to decide whether to write/flush XLog. If we have
reported a nextval() value to the client then it seems to me we'd better
be certain that XLOG record is flushed to XLog before we report commit
to the client.I think the part I don't understand is why WAL is being used to update
sequence values in the first place when sequences exist independantly of
transactions. In previous releases a sequence basically just existed
on disk in a specific location and updates to it updated the on disk
copy directly since there are no concurrency issues. I do realize that
running everything through WAL gives other benefits, so it's not likely
to revert back to the old way. But it would seem that the only way to
fix it is to flush the XLOG record immediately after the XLogInsert is
called, just as if the operation took place within its own transaction.
Initially, I agreed with you, but after reading Tom's email, I
understand now. The point is that if no data mofifying transactions are
committed to the database, do we care if a crash rolls back the
transaction counter? I don't think we do because the new sequence
values couldn't have been used for anything in the database, so rollback
is OK. If there is a data modification, the commit does flush the
sequence change to disk, meaning I think we are OK.
So, actually, this now seems like a very low level bug in the sense that
any data modification done would have preserved the sequence change. The
only reason we see the bug is because we are calling 'nextval()' with no
intention of inserting it into the database.
While clearly we should fsync during a transaction that just changes a
sequence counter, it is not clear there is a huge problem with not
preserving it _until_ a data-modifying transaction actually commits.
This is certainly fixable. However, here's the kicker: essentially what
this means is that we are not treating *reporting a nextval() value to
the client* as a commit-worthy event. I do not think this bug explains
the past reports that claim a nextval() value *inserted into the
database* has been rolled back. Seems to me that a subsequent tuple
insertion would create a normal XLog record which we'd flush before
commit, and thereby also flush the sequence-update XLog record.Can anyone see a way that this mechanism explains the prior reports?
Actually, that doesn't appear to be the case either because in some of
my tests I used a serial column type and I was just inserting data into
a table. It would appear that if the sequence is in the same tuple as
the data you modified then it won't get logged. What I did was create
a table with a serial column and a varchar(255). Inserted 100 rows
filled with data, committed. Ran a checkpoint. Checked my sequence
values, inserted 10 more rows of data, committed, checked the value of
the sequence again. Kill -9 the postmaster. Tried to insert into the
table, but to no avail... duplicate key. currval of the sequence and
it matched the value right after the checkpoint. I've been able to
duplicate that scenario several times.
Now, this seems to be a major report because you are saying you have
done data modifications (INSERT) after nextval() and the counter is not
being preserved. What PostgreSQL version are you running?
I just tested it here by doing a similar test of several nextval()
calls, but then doing an INSERT and kill, and on restart, the sequence
counter did have the proper value.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
bgrimm@zaeon.com wrote:
the sequence again. Kill -9 the postmaster. Tried to insert into the
table, but to no avail... duplicate key. currval of the sequence and
it matched the value right after the checkpoint. I've been able to
duplicate that scenario several times.
I just tested it here by doing a similar test of several nextval()
calls, but then doing an INSERT and kill, and on restart, the sequence
counter did have the proper value.
There have been prior reports of similar problems --- all quite
unrepeatable in my testing, and despite considerable study of the source
code I can't see how it could happen. A reproducible test case would be
a tremendous help.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
bgrimm@zaeon.com wrote:
the sequence again. Kill -9 the postmaster. Tried to insert into the
table, but to no avail... duplicate key. currval of the sequence and
it matched the value right after the checkpoint. I've been able to
duplicate that scenario several times.I just tested it here by doing a similar test of several nextval()
calls, but then doing an INSERT and kill, and on restart, the sequence
counter did have the proper value.There have been prior reports of similar problems --- all quite
unrepeatable in my testing, and despite considerable study of the source
code I can't see how it could happen. A reproducible test case would be
a tremendous help.
I can confirm repeatable case!
---------------------------------------------------------------------------
test=> create table test (x serial, y varchar(255));
NOTICE: CREATE TABLE will create implicit sequence 'test_x_seq' for SERIAL column 'test.x'
NOTICE: CREATE TABLE / UNIQUE will create implicit index 'test_x_key' for table 'test'
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32951 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32951 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32951 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32951 1
...
test=> select nextval('test_x_seq');
nextval
---------
41
(1 row)
test=> checkpoint;
CHECKPOINT
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32991 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32992 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32992 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32992 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
INSERT 32992 1
test=> insert into test (y) values ('lkjasdflkja sdfl;kj asdfl;kjasdf');
[ kill -9 backend ]
$ sql test
Welcome to psql, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help on internal slash commands
\g or terminate with semicolon to execute query
\q to quit
test=> select nextval('test_x_seq');
nextval
---------
42
(1 row)
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Justin wrote:
Hi Tom,
On Tuesday 12 March 2002 16:17, Tom Lane wrote:
<snip>Can anyone see a way that this mechanism explains the prior reports?
Not sure about that, but I really feel the fix for this should go into 7.2.1,
just in case the list of patches for that is still being assembled.
We are holding 7.2.1 for this.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Import Notes
Reply to msg id not found: 200203120636.g2C6a4Q03285@candle.pha.pa.us | Resolved by subject fallback
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I can confirm repeatable case!
Ah-hah, now I understand the problem. I think sequences are suffering
from premature optimization.
The crux of the matter: sequences try to avoid generating a WAL record
for every single nextval(). The idea is to generate a WAL record
every SEQ_LOG_VALS (32) nextval operations. If you crash and replay
the WAL, then when you see the WAL record you don't know exactly how
many nextval operations got done, so for safety you set the sequence
value to where-it-was plus 32. This may leave a gap in the sequence
number assignment, but it's no worse than aborting a few transactions.
The problem in the scenario Bruce exhibits is that the CHECKPOINT
forces out both the latest sequence WAL record and the current state
of the sequence relation itself. The subsequent nextval()'s advance
the sequence relation in-memory but generate no disk writes and no
WAL records. On restart, you lose: the sequence relation is back
to where it was checkpointed, and the latest WAL record for the
sequence is before the checkpoint *so it won't get rescanned*.
Thus, the sequence doesn't get pushed forward like it's supposed to.
AFAICS the only way that we could make the one-WAL-record-every-32-
nextvals idea really work would be if CHECKPOINT could nullify the
logged-in-advance state of each sequence (so that the first nextval
after a checkpoint would always generate a fresh WAL record, but
subsequent ones wouldn't have to). But I don't see any practical
way for CHECKPOINT to do that, especially not for sequences whose
disk block isn't even in memory at the instant of the CHECKPOINT.
Accordingly, I'm thinking that we must remove the SEQ_LOG_VALS
functionality and force one-WAL-record-per-nextval operation.
Vadim, do you see another way? This was a cool idea and I hate
to throw it away...
regards, tom lane
On Wed, 13 Mar 2002, Tom Lane wrote:
Accordingly, I'm thinking that we must remove the SEQ_LOG_VALS
functionality and force one-WAL-record-per-nextval operation.Vadim, do you see another way? This was a cool idea and I hate
to throw it away...
It seems like you just need to ensure that when the sequence is loaded
from disk that log_cnt gets updated and written back to disk before the
sequence is used. I'm not sure of the impact, but I can't reproduce the
bugs after making these changes. I just added a flag to the SeqTableData
struct to say whether its been logged yet - this seems like overkill..
but it works for me :-) (at least in the scenarios I've tried)
-- Ben
*** src/backend/commands/sequence.c Wed Mar 13 11:14:42 2002
--- src/backend/commands/sequence.c.orig Tue Mar 12 18:58:55 2002
***************
*** 62,68 ****
int64 cached;
int64 last;
int64 increment;
- bool islogged;
struct SeqTableData *next;
} SeqTableData;
--- 62,67 ----
***************
*** 315,321 ****
PG_RETURN_INT64(elm->last);
}
! seq = read_info("nextval", elm, &buf); /* lock page' buffer and read tuple */
last = next = result = seq->last_value;
incby = seq->increment_by;
--- 314,321 ----
PG_RETURN_INT64(elm->last);
}
! seq = read_info("nextval", elm, &buf); /* lock page' buffer and
! * read tuple */
last = next = result = seq->last_value;
incby = seq->increment_by;
***************
*** 331,337 ****
log--;
}
! if (log < fetch || !elm->islogged)
{
fetch = log = fetch - log + SEQ_LOG_VALS;
logit = true;
--- 331,337 ----
log--;
}
! if (log < fetch)
{
fetch = log = fetch - log + SEQ_LOG_VALS;
logit = true;
***************
*** 405,411 ****
rdata[0].next = &(rdata[1]);
seq->last_value = next;
! elm->islogged = seq->is_called = true;
seq->log_cnt = 0;
rdata[1].buffer = InvalidBuffer;
rdata[1].data = (char *) page + ((PageHeader) page)->pd_upper;
--- 405,411 ----
rdata[0].next = &(rdata[1]);
seq->last_value = next;
! seq->is_called = true;
seq->log_cnt = 0;
rdata[1].buffer = InvalidBuffer;
rdata[1].data = (char *) page + ((PageHeader) page)->pd_upper;
***************
*** 417,424 ****
PageSetLSN(page, recptr);
PageSetSUI(page, ThisStartUpID);
! XLogFlush(recptr);
!
if (fetch) /* not all numbers were fetched */
log -= fetch;
}
--- 417,423 ----
PageSetLSN(page, recptr);
PageSetSUI(page, ThisStartUpID);
!
if (fetch) /* not all numbers were fetched */
log -= fetch;
}
***************
*** 729,735 ****
prev->next = elm;
}
- elm->islogged = false;
return elm;
}
--- 728,733 ----
Ben Grimm <bgrimm@zaeon.com> writes:
It seems like you just need to ensure that when the sequence is loaded
from disk that log_cnt gets updated and written back to disk before the
sequence is used. I'm not sure of the impact, but I can't reproduce the
bugs after making these changes. I just added a flag to the SeqTableData
struct to say whether its been logged yet - this seems like overkill..
but it works for me :-) (at least in the scenarios I've tried)
I don't think that can work. AFAICT what your patch does is to ensure
a WAL record is written by the first nextval() in any given backend
session. But what we need is to ensure a WAL record from the first
nextval() after a checkpoint. The failure cases for your patch would
involve backends that have been running for longer than one checkpoint
cycle ...
regards, tom lane
On Wed, 13 Mar 2002, Tom Lane wrote:
I don't think that can work. AFAICT what your patch does is to ensure
a WAL record is written by the first nextval() in any given backend
session.
That's exactly what it does, yes. It forces the WAL record to be
written at least once. I think the reason this works is because the
WAL record that's written seems to be one behind what should be on
disk. So doing it once gets it ahead of the game. I'm sure it's
a very naive approach, but before yesterday I had never looked at
the source for postgresql. All I can say for my patch it is that if
it does not indeed fix the problem it masks it well enough that I
can't reproduce it.
But what we need is to ensure a WAL record from the first
nextval() after a checkpoint.The problem in the scenario Bruce exhibits is that the CHECKPOINT
forces out both the latest sequence WAL record and the current state
of the sequence relation itself. The subsequent nextval()'s advance
the sequence relation in-memory but generate no disk writes and no
WAL records. On restart, you lose: the sequence relation is back
to where it was checkpointed, and the latest WAL record for the
sequence is before the checkpoint *so it won't get rescanned*.
Thus, the sequence doesn't get pushed forward like it's supposed to.
This isn't quite true... because until you select enough values to
get to log < fetch it won't even have inserted a WAL record to
CHECKPOINT to so it falls back to the unmodified state which means
that the 'last_value' on disk never moves forward, in theory the
value on disk should *always* be equal or greater (up to 32) than the
value being returned to the client and when you load it off disk
it isn't.
More (possibly redundant) examples:
select * from your_seq and note the values. Then select nextval a few
times, kill -9 your backend, and reconnect. select * from your_seq
again and you should see that it's identical to the previous values.
Now, try the same thing again, but force a checkpoint before killing
your backend, then select again... same values as initially.
Now, select nextval the number of times needed to get log_cnt to
loop past 0 and back up to 32, then select * from your_seq again,
note the values and checkpoint. crash the backend, reconnect and
select again... it saved it this time because it got through enough
code to do the xloginsert.
The failure cases for your patch would
involve backends that have been running for longer than one checkpoint
cycle ...
I haven't been able to reproduce that, even checkpointing multiple
times on several open backends. But I also found a couple mistakes
in my patch that make it a little better. I can forward the new
patch if you'd like to see it.
-- Ben