9.4.1 -> 9.4.2 problem: could not access status of transaction 1
I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
just dropped new binaries in place) but it wouldn't start up. I found this
in the logs:
waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
database system was shut down at 2015-05-27 13:12:55 PDT
2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is
starting up
.2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of
transaction 1
2015-05-27 13:13:00 PDT [27341]: [3-1] DETAIL: Could not open file
"pg_multixact/offsets/0000": No such file or directory.
2015-05-27 13:13:00 PDT [27340]: [1-1] LOG: startup process (PID 27341)
exited with exit code 1
2015-05-27 13:13:00 PDT [27340]: [2-1] LOG: aborting startup due to
startup process failure
stopped waiting
I poked around in pg_multixact/offsets and there are lots of other files in
there, just not 0000.
I tried launching postgres in standalone mode to try some debugging but it
yields the same error.
AFAICT, there's nothing wrong with hardware, it's a VM, and it's been
working okay. About a week ago we upgraded it from 9.3.5 to 9.4.1 via
pg_upgrade --link, which worked amazingly well.
I found [this report from a couple days ago](
https://bugs.archlinux.org/task/45071) from someone else that looks like
the same problem.
I put this box's [postgresql.conf up on a gist](
https://gist.github.com/skehlet/3589b0d83f2cafe19624).
What can I try next? Thanks!
Steve Kehlet wrote:
I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
just dropped new binaries in place) but it wouldn't start up. I found this
in the logs:waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
database system was shut down at 2015-05-27 13:12:55 PDT
2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is
starting up
.2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of
transaction 1
I am debugging today a problem currently that looks very similar to
this. AFAICT the problem is that WAL replay of an online checkpoint in
which multixact files are removed fails because replay tries to read a
file that has already been removed.
(I was nervous about removing the check to omit reading pg_multixact
files while on recovery. Looks like my hunch was right, though the
actual problem is not what I was fearing.)
I think the fix to this is to verify whether the file exists on disk
before reading it; if it doesn't, assume the truncation has already
happened and that it's not necessary to remove it.
I found [this report from a couple days ago](
https://bugs.archlinux.org/task/45071) from someone else that looks like
the same problem.
Right :-(
I think a patch like this should be able to fix it ... not tested yet.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
mxact-trunc-in-recovery.patchtext/x-diff; charset=us-asciiDownload+22-8
On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
I think a patch like this should be able to fix it ... not tested yet.
Thanks Alvaro. I got a compile error, so looked for other uses of
SimpleLruDoesPhysicalPageExist and added MultiXactOffsetCtl, does this look
right?
+ (!InRecovery || SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl,
pageno)))
It compiled then, but it didn't seem to change anything: postgres gave the
same errors and still would not start up.
I'm recompiling it now just to be sure I didn't make any mistakes. I have
to leave for the day in a few minutes so I thought I'd check on the above
argument with you. Thanks again!
Steve Kehlet wrote:
On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:I think a patch like this should be able to fix it ... not tested yet.
Thanks Alvaro. I got a compile error, so looked for other uses of
SimpleLruDoesPhysicalPageExist and added MultiXactOffsetCtl, does this look
right?+ (!InRecovery || SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl,
pageno)))
Meh, I sent you the wrong version of the patch. Yeah, that's obviously
better.
It compiled then, but it didn't seem to change anything: postgres gave the
same errors and still would not start up.
I'm setting up a reproducer for the problem to verify that the patch
fixes it (evidently not). Maybe I've messed up the conditional or
something else ...
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Steve Kehlet wrote:
I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
just dropped new binaries in place) but it wouldn't start up. I found this
in the logs:waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
database system was shut down at 2015-05-27 13:12:55 PDT
2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is
starting up
.2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of
transaction 1I am debugging today a problem currently that looks very similar to
this. AFAICT the problem is that WAL replay of an online checkpoint in
which multixact files are removed fails because replay tries to read a
file that has already been removed.
Hmm, so what exactly is the sequence of events here? It's possible
that I'm not thinking clearly just now, but it seems to me that if
we're replaying the same checkpoint we replayed previously, the offset
of the oldest multixact will be the first file that we didn't remove.
However, I can see that there could be a problem if we try to replay
an older checkpoint after having already replayed a new one - for
example, if a standby replays checkpoint A truncating the members
multixact and performs a restart point, and then replays checkpoint B
truncating the members multixact again but without performing a
restartpoint, and then is shut down, it will resume replay from
checkpoint A, and trouble will ensue. Is that the scenario, or is
there something else?
I think the fix to this is to verify whether the file exists on disk
before reading it; if it doesn't, assume the truncation has already
happened and that it's not necessary to remove it.
That might be an OK fix, but this implementation doesn't seem very
clean. If we're going to remove the invariant that
MultiXactState->oldestOffset will always be valid after replaying a
checkpoint, then we should be explicit about that and add a flag
indicating whether or not it's currently valid. Shoving nextOffset in
there and hoping that's good enough seems like a bad idea to me.
I think we should modify the API for find_multixact_start. Let's have
it return a Boolean and return oldestOffset via an out parameter. If
!InRecovery, it will always return true and set the out parameter; but
if in recovery, it is allowed to return false without setting the out
parameter. Both values can get stored in MultiXactState, and we can
adjust the logic elsewhere to disregard oldestOffset when the
accompanying flag is false.
This still leaves open an ugly possibility: can we reach normal
running without a valid oldestOffset? If so, until the next
checkpoint happens, autovacuum has no clue whether it needs to worry.
There's got to be a fix for that, but it escapes me at the moment.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Steve Kehlet wrote:
I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
just dropped new binaries in place) but it wouldn't start up. I found this
in the logs:waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
database system was shut down at 2015-05-27 13:12:55 PDT
2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is
starting up
.2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of
transaction 1I am debugging today a problem currently that looks very similar to
this. AFAICT the problem is that WAL replay of an online checkpoint in
which multixact files are removed fails because replay tries to read a
file that has already been removed.Hmm, so what exactly is the sequence of events here? It's possible
that I'm not thinking clearly just now, but it seems to me that if
we're replaying the same checkpoint we replayed previously, the offset
of the oldest multixact will be the first file that we didn't remove.
Well I'm not very clear on what's the problematic case. The scenario I
actually saw this first reported was a pg_basebackup taken on a very
large database, so the master could have truncated multixact and the
standby receives a truncated directory but actually tries to apply a
checkpoint that is much older than what the master currently has
transmitted as pg_multixact contents.
I think the fix to this is to verify whether the file exists on disk
before reading it; if it doesn't, assume the truncation has already
happened and that it's not necessary to remove it.That might be an OK fix, but this implementation doesn't seem very
clean. If we're going to remove the invariant that
MultiXactState->oldestOffset will always be valid after replaying a
checkpoint, then we should be explicit about that and add a flag
indicating whether or not it's currently valid. Shoving nextOffset in
there and hoping that's good enough seems like a bad idea to me.I think we should modify the API for find_multixact_start. Let's have
it return a Boolean and return oldestOffset via an out parameter. If
!InRecovery, it will always return true and set the out parameter; but
if in recovery, it is allowed to return false without setting the out
parameter. Both values can get stored in MultiXactState, and we can
adjust the logic elsewhere to disregard oldestOffset when the
accompanying flag is false.
Sounds good. I think I prefer that multixact creation is rejected
altogether if the new flag is false. Is that what you mean when you say
"adjust the logic"?
This still leaves open an ugly possibility: can we reach normal
running without a valid oldestOffset? If so, until the next
checkpoint happens, autovacuum has no clue whether it needs to worry.
There's got to be a fix for that, but it escapes me at the moment.
I think the fix to that issue is to set the oldest offset on
TrimMultiXact. That way, once WAL replay finished we're certain that we
have a valid oldest offset to create new multixacts with.
I'm also wondering whether the call to DetermineSafeOldestOffset on
StartupMultiXact is good. At that point, we haven't replayed any WAL
yet, so the oldest multi might be pointing at a file that has already
been removed -- again considering the pg_basebackup scenario where the
multixact files are copied much later than pg_control, so the checkpoint
to replay is old but the pg_multixact contents have already been
truncated in the master and are copied truncated.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wed, May 27, 2015 at 10:14 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Well I'm not very clear on what's the problematic case. The scenario I
actually saw this first reported was a pg_basebackup taken on a very
large database, so the master could have truncated multixact and the
standby receives a truncated directory but actually tries to apply a
checkpoint that is much older than what the master currently has
transmitted as pg_multixact contents.
OK, that makes sense.
That might be an OK fix, but this implementation doesn't seem very
clean. If we're going to remove the invariant that
MultiXactState->oldestOffset will always be valid after replaying a
checkpoint, then we should be explicit about that and add a flag
indicating whether or not it's currently valid. Shoving nextOffset in
there and hoping that's good enough seems like a bad idea to me.I think we should modify the API for find_multixact_start. Let's have
it return a Boolean and return oldestOffset via an out parameter. If
!InRecovery, it will always return true and set the out parameter; but
if in recovery, it is allowed to return false without setting the out
parameter. Both values can get stored in MultiXactState, and we can
adjust the logic elsewhere to disregard oldestOffset when the
accompanying flag is false.Sounds good. I think I prefer that multixact creation is rejected
altogether if the new flag is false. Is that what you mean when you say
"adjust the logic"?
No. I'm not sure quite what you mean here. We can't reject multixact
creation during normal running, and during recovery, we won't create
any really new mulitxacts, but we must replay the creation of
multixacts. What I meant was stuff like this:
if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit) ||
(MultiXactState->nextOffset - MultiXactState->oldestOffset
MULTIXACT_MEMBER_SAFE_THRESHOLD))
I meant that we'd change the second prong of the test to check
multiXactState->nextOffsetValid && MultiXactState->nextOffset -
MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD. And
likewise change anything else that relies on oldestOffset. Or else we
guarantee that we can't reach those points until the oldestOffset is
valid, and then check that it is with an Assert() or elog().
This still leaves open an ugly possibility: can we reach normal
running without a valid oldestOffset? If so, until the next
checkpoint happens, autovacuum has no clue whether it needs to worry.
There's got to be a fix for that, but it escapes me at the moment.I think the fix to that issue is to set the oldest offset on
TrimMultiXact. That way, once WAL replay finished we're certain that we
have a valid oldest offset to create new multixacts with.I'm also wondering whether the call to DetermineSafeOldestOffset on
StartupMultiXact is good. At that point, we haven't replayed any WAL
yet, so the oldest multi might be pointing at a file that has already
been removed -- again considering the pg_basebackup scenario where the
multixact files are copied much later than pg_control, so the checkpoint
to replay is old but the pg_multixact contents have already been
truncated in the master and are copied truncated.
Moving the call from StartupMultiXact() to TrimMultiXact() seems like
a good idea. I'm not sure why we didn't do that before.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Steve Kehlet wrote:
I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
just dropped new binaries in place) but it wouldn't start up. I found this
in the logs:waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
database system was shut down at 2015-05-27 13:12:55 PDT
2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is
starting up
.2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of
transaction 1I am debugging today a problem currently that looks very similar to
this. AFAICT the problem is that WAL replay of an online checkpoint in
which multixact files are removed fails because replay tries to read a
file that has already been removed.
Steve: Can you tell us more about how you shut down the old cluster?
Did you by any chance perform an immediate shutdown? Do you have the
actual log messages that were written when the system was shut down
for the upgrade?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Steve Kehlet wrote:
I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
just dropped new binaries in place) but it wouldn't start up. I found this
in the logs:waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
database system was shut down at 2015-05-27 13:12:55 PDT
2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is
starting up
.2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of
transaction 1I am debugging today a problem currently that looks very similar to
this. AFAICT the problem is that WAL replay of an online checkpoint in
which multixact files are removed fails because replay tries to read a
file that has already been removed.
Wait a minute, wait a minute. There's a serious problem with this
theory, at least in Steve's scenario. This message:
2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut
down at 2015-05-27
That message implies a *clean shutdown*. If he had performed an
immediate shutdown or just pulled the plug, it would have said
"database system was interrupted" or some such.
There may be bugs in redo, also, but they don't explain what happened to Steve.
Steve, is there any chance we can get your pg_controldata output and a
list of all the files in pg_clog?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Thu, May 28, 2015 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Steve Kehlet wrote:
I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
just dropped new binaries in place) but it wouldn't start up. I found this
in the logs:waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
database system was shut down at 2015-05-27 13:12:55 PDT
2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is
starting up
.2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of
transaction 1I am debugging today a problem currently that looks very similar to
this. AFAICT the problem is that WAL replay of an online checkpoint in
which multixact files are removed fails because replay tries to read a
file that has already been removed.Wait a minute, wait a minute. There's a serious problem with this
theory, at least in Steve's scenario. This message:2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut
down at 2015-05-27That message implies a *clean shutdown*. If he had performed an
immediate shutdown or just pulled the plug, it would have said
"database system was interrupted" or some such.There may be bugs in redo, also, but they don't explain what happened to Steve.
Steve, is there any chance we can get your pg_controldata output and a
list of all the files in pg_clog?
Err, make that pg_multixact/members, which I assume is at issue here.
You didn't show us the DETAIL line from this message, which would
presumably clarify:
FATAL: could not access status of transaction 1
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Thu, May 28, 2015 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Steve, is there any chance we can get your pg_controldata output and a
list of all the files in pg_clog?Err, make that pg_multixact/members, which I assume is at issue here.
You didn't show us the DETAIL line from this message, which would
presumably clarify:FATAL: could not access status of transaction 1
And I'm still wrong, probably. The new code in 9.4.2 cares about
being able to look at an *offsets* file to find the corresponding
member offset. So most likely it is an offsets file that is missing
here. The question is, how are we ending up with an offsets file that
is referenced by the control file but not actually present on disk?
It seems like it would be good to compare the pg_controldata output to
what is actually present in pg_multixact/offsets (hopefully that's the
right directory, now that I'm on my third try) and try to understand
what is going on here.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
[ speculation ]
OK, I finally managed to reproduce this, after some off-list help from
Steve Kehlet (the reporter), Alvaro, and Thomas Munro. Here's how to
do it:
1. Install any pre-9.3 version of the server and generate enough
multixacts to create at least TWO new segments. When you shut down
the server, all segments except for the most current one will be
removed. At this point, the only thing in
$PGDATA/pg_multixact/offsets should be a single file, and the name of
that file should not be 0000 or 0001.
2. Use pg_upgrade to upgrade to 9.3.4. It is possible that versions <
9.3.4 will also work here, but you must not use 9.3.5 or higher,
because 9.3.5 includes Bruce's commit 3d2e18510, which arranged to
preserve relminmxid and datminmxid values. At this point,
pg_controldata on the new cluster should show an oldestMultiXid value
greater than 1 (copied from the old cluster), but all the datminmxid
values are 1. Also, initdb will have left behind a bogus 0000 file in
pg_multixact/offsets.
3. Move to 9.3.5 (or 9.3.6), not via pg_upgrade, but just by dropping
in the new binaries. Follow the instructions in the 9.3.5 release
notes; since you created at least TWO new segments in step one, there
will be no 0001 file, and the query there will say that you should
remove the bogus 0000 file. So do that, leaving just the good file in
pg_multixact/offsets. At this point, pg_multixact/offsets is OK, and
pg_controldata still says that oldestMultiXid > 1, so that is also OK.
The only problem is that we've got some bogus datminmxid values
floating around. Our next step will be to convince vacuum to
propagate the bogus datminmxid values back into pg_controldata.
4. Consume at least one transaction ID (e.g. SELECT txid_current())
and then do this:
postgres=# set vacuum_freeze_min_age = 0;
SET
postgres=# set vacuum_freeze_table_age = 0;
SET
postgres=# vacuum;
VACUUM
Setting the GUCs forces full table scans, so that we advance
relfrozenxid. But notice that we were careful not to just run VACUUM
FREEZE, which would have also advanced relminmxid, which, for purposes
of reproducing this bug, is not what we want to happen. So relminmxid
is still (incorrectly) set to 1 for every database. However, since
the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
which will call SetMultiXactIdLimit, which will propagate the bogus
datminmxid = 1 setting into shared memory.
(In my testing, this step doesn't work if performed on 9.3.4; you have
to do it on 9.3.5. I think that's because of Tom's commit 78db307bb,
but I believe in a more complex test scenario you might be able to get
this to happen on 9.3.4 also.)
I believe it's the case that an autovacuum of even a single table can
substitute for this step if it happens to advance relfrozenxid but not
relminmxid.
5. The next checkpoint, or the shutdown checkpoint in any event, will
propagate the bogus value of 1 from shared memory back into the
control file.
6. Now try to start 9.3.7. It will see the bogus oldestMultiXid = 1
value in the control file, attempt to read the corresponding offsets
file, and die.
In the process of investigating this, we found a few other things that
seem like they may also be bugs:
- As noted upthread, replaying an older checkpoint after a newer
checkpoint has already happened may lead to similar problems. This
may be possible when starting from an online base backup; or when
restarting a standby that did not perform a restartpoint when
replaying the last checkpoint before the shutdown.
- pg_upgrade sets datminmxid =
old_cluster.controldata.chkpnt_nxtmulti, which is correct only if
there are ZERO multixacts in use at the time of the upgrade. It would
be best, I think, to set this to the same value it had in the old
cluster, but if we're going to use a blanket value, I think it needs
to be chkpnt_oldstMulti.
- There's a third possible problem related to boundary cases in
SlruScanDirCbRemoveMembers, but I don't understand that one well
enough to explain it. Maybe Thomas can jump in here and explain the
concern.
Thanks,
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 05/28/2015 12:56 PM, Robert Haas wrote:
FTR: Robert, you have been a Samurai on this issue. Our many thanks.
Sincerely,
jD
--
Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
[ speculation ]
OK, I finally managed to reproduce this, after some off-list help from
Steve Kehlet (the reporter), Alvaro, and Thomas Munro. Here's how to
do it:
It's a long list of steps, but if you consider them carefully, it
becomes clear that they are natural steps that a normal production
system would go through -- essentially the only one that's really
time-critical is the decision to pg_upgrade with a version before 9.3.5.
In the process of investigating this, we found a few other things that
seem like they may also be bugs:- As noted upthread, replaying an older checkpoint after a newer
checkpoint has already happened may lead to similar problems. This
may be possible when starting from an online base backup; or when
restarting a standby that did not perform a restartpoint when
replaying the last checkpoint before the shutdown.
I'm going through this one now, as it's closely related and caused
issues for us.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, May 29, 2015 at 7:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
[ speculation ]
[...] However, since
the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
which will call SetMultiXactIdLimit, which will propagate the bogus
datminmxid = 1 setting into shared memory.
Ah!
[...]
- There's a third possible problem related to boundary cases in
SlruScanDirCbRemoveMembers, but I don't understand that one well
enough to explain it. Maybe Thomas can jump in here and explain the
concern.
I noticed something in passing which is probably not harmful, and not
relevant to this bug report, it was just a bit confusing while
testing: SlruScanDirCbRemoveMembers never deletes any files if
rangeStart == rangeEnd. In practice, if you have an idle cluster with
a lot of multixact data and you VACUUM FREEZE all databases and then
CHECKPOINT, you might be surprised to see no member files going away
quite yet, but they'll eventually be truncated by a future checkpoint,
once rangeEnd has had a chance to advance to the next page due to more
multixacts being created.
If we want to fix this one day, maybe the right thing to do is to
treat the rangeStart == rangeEnd case the same way we treat rangeStart
< rangeEnd, that is, to assume that the range of pages isn't
wrapped/inverted in this case. Although we don't have the actual
start and end offset values to compare here, we know that for them to
fall on the same page, the start offset index must be <= the end
offset index (since we added the new error to prevent member space
wrapping, we never allow the end to get close enough to the start to
fall on the same page). Like this (not tested):
diff --git a/src/backend/access/transam/multixact.c
b/src/backend/access/transam/multixact.c
index 9568ff1..4d0bcc4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2755,7 +2755,7 @@ SlruScanDirCbRemoveMembers(SlruCtl ctl, char
*filename, int segpage,
/* Recheck the deletion condition. If it still holds, perform deletion */
if ((range->rangeStart > range->rangeEnd &&
segpage > range->rangeEnd && segpage < range->rangeStart) ||
- (range->rangeStart < range->rangeEnd &&
+ (range->rangeStart <= range->rangeEnd &&
(segpage < range->rangeStart || segpage > range->rangeEnd)))
SlruDeleteSegment(ctl, filename);
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
FTR: Robert, you have been a Samurai on this issue. Our many thanks.
Thanks! I really appreciate the kind words.
So, in thinking through this situation further, it seems to me that
the situation is pretty dire:
1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid
or datminmxid values which are 1 instead of the correct value.
Setting the value to 1 was too far in the past if your MXID counter is
< 2B, and too far in the future if your MXID counter is > 2B.
2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
values which are equal to the next-mxid counter instead of the correct
value; in other words, they are two new.
3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will
have the first problem for tables in template databases, and the
second one for the rest. (See 866f3017a.)
4. Wrong relminmxid or datminmxid values can eventually propagate into
the control file, as demonstrated in my previous post. Therefore, we
can't count on relminmxid to be correct, we can't count on datminmxid
to be correct, and we can't count on the control file to be correct.
That's a sack of sad.
5. If the values are too far in the past, then nothing really terrible
will happen unless you upgrade to 9.3.7 or 9.4.2, at which point the
system will refuse to start. Forcing a VACUUM FREEZE on every
database, including the unconnectable ones, should fix this and allow
you to upgrade safely - which you want to do, because 9.3.7 and 9.4.2
fix a different set of multixact data loss bugs.
6. If the values are too far in the future, the system may fail to
prevent wraparound, leading to data loss. I am not totally clear on
whether a VACUUM FREEZE will fix this problem. It seems like the
chances are better if you are running at least 9.3.5+ or 9.4.X,
because of 78db307bb. But I'm not sure how complete a fix that is.
So what do we do about this? I have a few ideas:
A. Most obviously, we should fix pg_upgrade so that it installs
chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
that we stop creating new instances of this problem. That won't get
us out of the hole we've dug for ourselves, but we can at least try to
stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the
wrong thing - anyone want to double-check me on that one?)
B. We need to change find_multixact_start() to fail softly. This is
important because it's legitimate for it to fail in recovery, as
discussed upthread, and also because we probably want to eliminate the
fail-to-start hazard introduced in 9.4.2 and 9.3.7.
find_multixact_start() is used in three places, and they each require
separate handling:
- In SetMultiXactIdLimit, find_multixact_start() is used to set
MultiXactState->oldestOffset, which is used to determine how
aggressively to vacuum. If find_multixact_start() fails, we don't
know how aggressively we need to vacuum to prevent members wraparound;
it's probably best to decide to vacuum as aggressively as possible.
Of course, if we're in recovery, we won't vacuum either way; the fact
that it fails softly is good enough.
- In DetermineSafeOldestOffset, find_multixact_start() is used to set
MultiXactState->offsetStopLimit. If it fails here, we don't know when
to refuse multixact creation to prevent wraparound. Again, in
recovery, that's fine. If it happens in normal running, it's not
clear what to do. Refusing multixact creation is an awfully blunt
instrument. Maybe we can scan pg_multixact/offsets to determine a
workable stop limit: the first file greater than the current file that
exists, minus two segments, is a good stop point. Perhaps we ought to
use this mechanism here categorically, not just when
find_multixact_start() fails. It might be more robust than what we
have now.
- In TruncateMultiXact, find_multixact_start() is used to set the
truncation point for the members SLRU. If it fails here, I'm guessing
the right solution is not to truncate anything - instead, rely on
intense vacuuming to eventually advance oldestMXact to a value whose
member data still exists; truncate then.
C. I think we should also change TruncateMultiXact() to truncate
offsets first, and then members. As things stand, if we truncate
members first, we increase the risk of seeing an offset that will fail
when passed to find_multixact_start(), because TruncateMultiXact()
might get interrupted before it finishes. That seem like an
unnecessary risk.
Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
values which are equal to the next-mxid counter instead of the correct
value; in other words, they are too new.
What you describe is what happens if you upgrade from 9.2 or earlier.
For this case we use this call:
exec_prog(UTILITY_LOG_FILE, NULL, true,
"\"%s/pg_resetxlog\" -m %u,%u \"%s\"",
new_cluster.bindir,
old_cluster.controldata.chkpnt_nxtmulti + 1,
old_cluster.controldata.chkpnt_nxtmulti,
new_cluster.pgdata);
This uses the old cluster's nextMulti value as oldestMulti in the new
cluster, and that value+1 is used as nextMulti. This is correct: we
don't want to preserve any of the multixact state from the previous
cluster; anything before that value can be truncated with no loss of
critical data. In fact, there is no critical data before that value at
all.
If you upgrade from 9.3, this other call is used instead:
/*
* we preserve all files and contents, so we must preserve both "next"
* counters here and the oldest multi present on system.
*/
exec_prog(UTILITY_LOG_FILE, NULL, true,
"\"%s/pg_resetxlog\" -O %u -m %u,%u \"%s\"",
new_cluster.bindir,
old_cluster.controldata.chkpnt_nxtmxoff,
old_cluster.controldata.chkpnt_nxtmulti,
old_cluster.controldata.chkpnt_oldstMulti,
new_cluster.pgdata);
In this case we use the oldestMulti from the old cluster as oldestMulti
in the new cluster, which is also correct.
A. Most obviously, we should fix pg_upgrade so that it installs
chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
that we stop creating new instances of this problem. That won't get
us out of the hole we've dug for ourselves, but we can at least try to
stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the
wrong thing - anyone want to double-check me on that one?)
I don't think there's anything that we need to fix here.
B. We need to change find_multixact_start() to fail softly. This is
important because it's legitimate for it to fail in recovery, as
discussed upthread, and also because we probably want to eliminate the
fail-to-start hazard introduced in 9.4.2 and 9.3.7.
find_multixact_start() is used in three places, and they each require
separate handling:- In SetMultiXactIdLimit, find_multixact_start() is used to set
MultiXactState->oldestOffset, which is used to determine how
aggressively to vacuum. If find_multixact_start() fails, we don't
know how aggressively we need to vacuum to prevent members wraparound;
it's probably best to decide to vacuum as aggressively as possible.
Of course, if we're in recovery, we won't vacuum either way; the fact
that it fails softly is good enough.
Sounds good.
- In DetermineSafeOldestOffset, find_multixact_start() is used to set
MultiXactState->offsetStopLimit. If it fails here, we don't know when
to refuse multixact creation to prevent wraparound. Again, in
recovery, that's fine. If it happens in normal running, it's not
clear what to do. Refusing multixact creation is an awfully blunt
instrument. Maybe we can scan pg_multixact/offsets to determine a
workable stop limit: the first file greater than the current file that
exists, minus two segments, is a good stop point. Perhaps we ought to
use this mechanism here categorically, not just when
find_multixact_start() fails. It might be more robust than what we
have now.
Blunt instruments have the desirable property of being simple. We don't
want any more clockwork here, I think --- this stuff is pretty
complicated already. As far as I understand, if during normal running
we see that find_multixact_start has failed, sufficient vacuuming should
get it straight eventually with no loss of data.
- In TruncateMultiXact, find_multixact_start() is used to set the
truncation point for the members SLRU. If it fails here, I'm guessing
the right solution is not to truncate anything - instead, rely on
intense vacuuming to eventually advance oldestMXact to a value whose
member data still exists; truncate then.
Fine.
C. I think we should also change TruncateMultiXact() to truncate
offsets first, and then members. As things stand, if we truncate
members first, we increase the risk of seeing an offset that will fail
when passed to find_multixact_start(), because TruncateMultiXact()
might get interrupted before it finishes. That seem like an
unnecessary risk.
Not sure about this point. We did it the way you propose previously,
and found it to be a problem because sometimes we tried to read an
offset file that was no longer there. Do we really read member files
anywhere? I thought we only tried to read offset files. If we remove
member files, what is it that we try to read and find not to be present?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Alvaro Herrera wrote:
Robert Haas wrote:
2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
values which are equal to the next-mxid counter instead of the correct
value; in other words, they are too new.What you describe is what happens if you upgrade from 9.2 or earlier.
Oh, you're referring to pg_database values, not the ones in pg_control.
Ugh :-( This invalidates my argument that there's nothing to fix,
obviously ... it's clearly broken as is.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
values which are equal to the next-mxid counter instead of the correct
value; in other words, they are too new.[ discussion of how the control file's oldestMultiXid gets set ]
I'm talking about the datminmxid in pg_database. You're talking about
the contents of pg_control. Those are two different things. The
relevant code is not what you quote, but rather this:
/* set pg_database.datminmxid */
PQclear(executeQueryOrDie(conn_template1,
"UPDATE
pg_catalog.pg_database "
"SET
datminmxid = '%u'",
old_cluster.controldata.chkpnt_nxtmulti));
Tom previously observed this to be wrong, here:
/messages/by-id/9879.1405877821@sss.pgh.pa.us
Although Tom was correct to note that it's wrong, nothing ever got fixed. :-(
A. Most obviously, we should fix pg_upgrade so that it installs
chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
that we stop creating new instances of this problem. That won't get
us out of the hole we've dug for ourselves, but we can at least try to
stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the
wrong thing - anyone want to double-check me on that one?)I don't think there's anything that we need to fix here.
I see your followup now agreeing this is broken. Since I wrote the
previous email, I've had two new ideas that I think are both better
than the above.
1. Figure out the oldest multixact offset that actually exists in
pg_multixacts/offsets, and use that value. If any older MXIDs still
exist, they won't be able to be looked up anyway, so if they wrap
around, it doesn't matter. The only value that needs to be reliable
in order to do this is pg_controldata's NextMultiXactId, which to the
best of my knowledge is not implicated in any of these bugs.
pg_upgrade can check that the offsets file containing that value
exists, and if not bail out. Then, start stepping backwards a file at
a time. When it hits a missing file, the first multixact in the next
file is a safe value of datfrozenxid for every database in the new
cluster. If older MXIDs exist, they're unreadable anyway, so if they
wrap, nothing lost. If the value is older than necessary, the first
vacuum in each database will fix it. We have to be careful: if we
step back too many files, such that our proposed datfrozenxid might
wrap, then we've got a confusing situation and had better bail out -
or at least think really carefully about what to do.
2. When we're upgrading from a version 9.3 or higher, copy the EXACT
datminmxid from each old database to the corresponding new database.
This seems like it ought to be really simple.
- In DetermineSafeOldestOffset, find_multixact_start() is used to set
MultiXactState->offsetStopLimit. If it fails here, we don't know when
to refuse multixact creation to prevent wraparound. Again, in
recovery, that's fine. If it happens in normal running, it's not
clear what to do. Refusing multixact creation is an awfully blunt
instrument. Maybe we can scan pg_multixact/offsets to determine a
workable stop limit: the first file greater than the current file that
exists, minus two segments, is a good stop point. Perhaps we ought to
use this mechanism here categorically, not just when
find_multixact_start() fails. It might be more robust than what we
have now.Blunt instruments have the desirable property of being simple. We don't
want any more clockwork here, I think --- this stuff is pretty
complicated already. As far as I understand, if during normal running
we see that find_multixact_start has failed, sufficient vacuuming should
get it straight eventually with no loss of data.
Unfortunately, I don't believe that to be true. If
find_multixact_start() fails, we have no idea how close we are to the
member wraparound point. Sure, we can start vacuuming, but the user
can be creating new, large multixacts at top speed while we're doing
that, which could cause us to wrap around before we can finish
vacuuming.
Furthermore, if we adopted the blunt instrument, users who are in this
situation would update to 9.4.3 (or whenever these fixes get released)
and find that they can't create new MXIDs for a possibly very
protracted period of time. That amounts to an outage for which users
won't thank us.
Looking at the files in the directory seems pretty simple in this
case, and quite a bit more fail-safe than what we're doing right now.
The current logic purports to leave a one-file gap in the member
space, but there's no guarantee that the gap really exists on disk the
way we think it does. With this approach, we can be certain that
there is a gap. And that is a darned good thing to be certain about.
C. I think we should also change TruncateMultiXact() to truncate
offsets first, and then members. As things stand, if we truncate
members first, we increase the risk of seeing an offset that will fail
when passed to find_multixact_start(), because TruncateMultiXact()
might get interrupted before it finishes. That seem like an
unnecessary risk.Not sure about this point. We did it the way you propose previously,
and found it to be a problem because sometimes we tried to read an
offset file that was no longer there. Do we really read member files
anywhere? I thought we only tried to read offset files. If we remove
member files, what is it that we try to read and find not to be present?
Do you have a link to the previous discussion?
I mean, the problem we're having right now is that sometimes we have
an offset, but the corresponding member isn't there. So clearly
offsets reference members. Do members also reference offsets? I
didn't think so, but life is full of surprises.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, May 29, 2015 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
A. Most obviously, we should fix pg_upgrade so that it installs
chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
that we stop creating new instances of this problem. That won't get
us out of the hole we've dug for ourselves, but we can at least try to
stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the
wrong thing - anyone want to double-check me on that one?)
Yes, it seems like this could lead to truncation of multixacts still
referenced by tuples, leading to errors when updating, locking,
vacuuming. Why don't we have reports of that?
B. We need to change find_multixact_start() to fail softly. This is
important because it's legitimate for it to fail in recovery, as
discussed upthread, and also because we probably want to eliminate the
fail-to-start hazard introduced in 9.4.2 and 9.3.7.
find_multixact_start() is used in three places, and they each require
separate handling:
Here is an experimental WIP patch that changes StartupMultiXact and
SetMultiXactIdLimit to find the oldest multixact that exists on disk
(by scanning the directory), and uses that if it is more recent than
the oldestMultiXactId from shmem, when calling
DetermineSafeOldestOffset. I'm not all that happy with it, see below,
but let me know what you think.
Using unpatched master, I reproduced the startup error with a bit of a
short cut:
1. initdb, generate enough multixacts to get more than one offsets file
2. ALTER DATABASE template0 ALLOW_CONNECTION = true;, vacuumdb
--freeze --all, CHECKPOINT
3. verify that pg_control now holds a large oldestMultiXactId, and
note NextMultiXactId
4. shutdown, pg_resetxlog -m (NextMultiXactId from 3),1 pg_data
5. start up: fails
Apply this patch, and it starts up successfully.
What are the repro steps for the replay problem? Is a basebackup of a
large database undergoing truncation and some good timing needed?
- In SetMultiXactIdLimit, find_multixact_start() is used to set
MultiXactState->oldestOffset, which is used to determine how
aggressively to vacuum. If find_multixact_start() fails, we don't
know how aggressively we need to vacuum to prevent members wraparound;
it's probably best to decide to vacuum as aggressively as possible.
Of course, if we're in recovery, we won't vacuum either way; the fact
that it fails softly is good enough.
Isn't it enough to use the start offset for the most recent of the
oldest multixact ID and the oldest multixact found by scanning
pg_multixact/offsets? In this patch, it does that, but I'm not happy
with the time the work is done, it just doesn't seem right for
SetMultiXactIdLimit to be scanning that directory. The result of that
operation should only change when files have been truncated anyway,
and the truncate code was already doing a filesystem scan. Maybe the
truncate code should store the earliest multixact ID found on disk in
shared memory, so that SetMultiXactIdLimit can use it for free. I
tried to get that working but couldn't figure out where it should be
initialised -- StartupMultiXact is too late (StartupXLOG calls
SetMultiXactIdLimit before that), but BootstrapMultiXact and
MultiXactShmemInit didn't seem like the right places either.
- In DetermineSafeOldestOffset, find_multixact_start() is used to set
MultiXactState->offsetStopLimit. If it fails here, we don't know when
to refuse multixact creation to prevent wraparound. Again, in
recovery, that's fine. If it happens in normal running, it's not
clear what to do. Refusing multixact creation is an awfully blunt
instrument. Maybe we can scan pg_multixact/offsets to determine a
workable stop limit: the first file greater than the current file that
exists, minus two segments, is a good stop point. Perhaps we ought to
use this mechanism here categorically, not just when
find_multixact_start() fails. It might be more robust than what we
have now.
Done in this patch -- the truncate code calls
DetermineSafeOldestOffset with the earliest SLRU found by scanning if
that's more recent than the shmem value, and then
DetermineSafeOldestOffset applies the step-back-one-whole-segment
logic to that as before.
- In TruncateMultiXact, find_multixact_start() is used to set the
truncation point for the members SLRU. If it fails here, I'm guessing
the right solution is not to truncate anything - instead, rely on
intense vacuuming to eventually advance oldestMXact to a value whose
member data still exists; truncate then.
TruncateMultiXact already contained logic to do nothing at all if
oldestMXact is older than the earliest it can find on disk. I moved
that code into find_earliest_multixact_on_disk() to be able to use it
elsewhere too, in this patch.
C. I think we should also change TruncateMultiXact() to truncate
offsets first, and then members. As things stand, if we truncate
members first, we increase the risk of seeing an offset that will fail
when passed to find_multixact_start(), because TruncateMultiXact()
might get interrupted before it finishes. That seem like an
unnecessary risk.
I don't see why the order matters. find_multixact_start() doesn't
read the members, only the offsets SLRU (ie the index into members,
not the contents of members). As I understand it, the only time we
need to access the members themselves is when we encounter multixacts
in tuple headers (updating, locking or vacuuming). If you have
truncated multixacts referenced in your tuples then you have a
different form of corruption than the
pg_upgrade-tramples-on-oldestMultiXactId case we're trying to handle
gracefully here.
--
Thomas Munro
http://www.enterprisedb.com