Hot Standby: too many KnownAssignedXids

Started by Joachim Wielandabout 15 years ago17 messages
#1Joachim Wieland
joe@mcknight.de

Hi,

I am seeing the following here on 9.0.1 on Linux x86-64:

LOG: redo starts at 1F8/FC00E978
FATAL: too many KnownAssignedXids
CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23

and this is the complete history:

postgres was running as HS in foreground, Ctrl-C'ed it for a restart.

LOG: received fast shutdown request
LOG: aborting any active transactions
FATAL: terminating walreceiver process due to administrator command
FATAL: terminating connection due to administrator command
LOG: shutting down
LOG: database system is shut down

Started it up again:

$ postgres -D /db/
LOG: database system was shut down in recovery at 2010-11-19 14:36:30 EST
LOG: entering standby mode
cp: cannot stat `/archive/00000001000001F900000001': No such file or directory
cp: cannot stat `/archive/00000001000001F8000000FC': No such file or directory
LOG: redo starts at 1F8/FC00E978
FATAL: too many KnownAssignedXids
CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23
LOG: startup process (PID 30052) exited with exit code 1
LOG: terminating any other active server processes

(copied the log files over...)

./postgres -D /db/

LOG: database system was interrupted while in recovery at log time
2010-11-19 14:36:12 EST
HINT: If this has occurred more than once some data might be
corrupted and you might need to choose an earlier recovery target.
LOG: entering standby mode
LOG: restored log file "00000001000001F900000001" from archive
LOG: restored log file "00000001000001F8000000FC" from archive
LOG: redo starts at 1F8/FC00E978
FATAL: too many KnownAssignedXids
CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23
LOG: startup process (PID 31581) exited with exit code 1
LOG: terminating any other active server processes

Changing the line in the source code to give some more output gives me:

FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978,
pArray->maxKnownAssignedXids: 6890

I still have the server, if you want me to debug anything or send a
patch against 9.0.1 that gives more output, just let me know.

Joachim

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Joachim Wieland (#1)
Re: Hot Standby: too many KnownAssignedXids

On Sat, Nov 20, 2010 at 6:46 AM, Joachim Wieland <joe@mcknight.de> wrote:

I still have the server, if you want me to debug anything or send a
patch against 9.0.1 that gives more output, just let me know.

Per previous Simon's comment, the following information would be useful.
http://archives.postgresql.org/pgsql-general/2010-10/msg00154.php

--------------
If you suspect a bug in Hot Standby, please set
trace_recovery_messages = DEBUG2
in postgresql.conf and repeat the action

Always useful to know
* max_connections
* current number of sessions
* whether we have two phase commits happening
--------------

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#3Joachim Wieland
joe@mcknight.de
In reply to: Fujii Masao (#2)
Re: Hot Standby: too many KnownAssignedXids

On Sun, Nov 21, 2010 at 11:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

--------------
If you suspect a bug in Hot Standby, please set
       trace_recovery_messages = DEBUG2
in postgresql.conf and repeat the action

Always useful to know
* max_connections
* current number of sessions
* whether we have two phase commits happening
--------------

The trace_recovery_messages parameter does not give more output...

max_connections is set to 100

there have been no sessions on the standby itself, but a few on the
primary database, I don't know how much but probably not more than 10.
The sessions there were doing quite some load however, among them
slony synchronization, since the hot standby master database was
actually a slony replica.

max_prepared_transactions has not been changed from its default value of 0.

Joachim

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joachim Wieland (#1)
Re: Hot Standby: too many KnownAssignedXids

On 19.11.2010 23:46, Joachim Wieland wrote:

FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978,
pArray->maxKnownAssignedXids: 6890

Hmm, that's a lot of entries in KnownAssignedXids.

Can you recompile with WAL_DEBUG, and run the recovery again with
wal_debug=on ? That will print all the replayed WAL records, which is a
lot of data, but it might give a hint what's going on.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Joachim Wieland
joe@mcknight.de
In reply to: Heikki Linnakangas (#4)
Re: Hot Standby: too many KnownAssignedXids

On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 19.11.2010 23:46, Joachim Wieland wrote:

FATAL:  too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978,
pArray->maxKnownAssignedXids: 6890

Hmm, that's a lot of entries in KnownAssignedXids.

Can you recompile with WAL_DEBUG, and run the recovery again with
wal_debug=on ? That will print all the replayed WAL records, which is a lot
of data, but it might give a hint what's going on.

Sure, but this gives me only one more line:

[...]
LOG: redo starts at 1F8/FC00E978
LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid
385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid
3829898/23
FATAL: too many KnownAssignedXids
CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23
LOG: startup process (PID 4587) exited with exit code 1
LOG: terminating any other active server processes

Joachim

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joachim Wieland (#5)
Re: Hot Standby: too many KnownAssignedXids

On 24.11.2010 06:56, Joachim Wieland wrote:

On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 19.11.2010 23:46, Joachim Wieland wrote:

FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978,
pArray->maxKnownAssignedXids: 6890

Hmm, that's a lot of entries in KnownAssignedXids.

Can you recompile with WAL_DEBUG, and run the recovery again with
wal_debug=on ? That will print all the replayed WAL records, which is a lot
of data, but it might give a hint what's going on.

Sure, but this gives me only one more line:

[...]
LOG: redo starts at 1F8/FC00E978
LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid
385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid
3829898/23
FATAL: too many KnownAssignedXids
CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23
LOG: startup process (PID 4587) exited with exit code 1
LOG: terminating any other active server processes

Thanks, I can reproduce this now. This happens when you have a wide gap
between the oldest still active xid and the latest xid.

When recovery starts, we fetch the oldestActiveXid from the checkpoint
record. Let's say that it's 100. We then start replaying WAL records
from the Redo pointer, and the first record (heap insert in your case)
contains an Xid that's much larger than 100, say 10000. We call
RecordKnownAssignedXids() to make note that all xids between that range
are in-progress, but there isn't enough room in the array for that.

We normally get away with a smallish array because the array is trimmed
at commit and abort records, and the special xid-assignment record to
handle the case of a lot of subtransactions. We initialize the array
from the running-xacts record that's written at a checkpoint. That
mechanism fails in this case because the heap insert record is seen
before the running-xacts record, causing all those xids in the range
100-10000 to be considered in-progress. The running-xacts record that
comes later would prune them, but we don't have enough slots to hold
them until that.

Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash
the xids we see during WAL replay in private memory instead of putting
them in the KnownAssignedXids array until we see the running-xacts record.

To reproduce this, I did this in the master:

postgres=# CREATE FUNCTION insertfunc(n integer) RETURNS VOID AS $$
declare
i integer;
begin
FOR i IN 1..n LOOP
BEGIN
INSERT INTO foo VALUES (1);
EXCEPTION WHEN division_by_zero THEN RAISE NOTICE 'divbyzero';
END;
END LOOP;
end;
$$ LANGUAGE plpgsql;
postgres=# SELECT insertfunc(100000000);

After letting that run for a while, so that a couple of checkpoints have
occurred, kill the master and start standby to recover that from
archive. After it has replayed all the WAL, stop the standby and restart it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#6)
Re: Hot Standby: too many KnownAssignedXids

On 24.11.2010 12:48, Heikki Linnakangas wrote:

On 24.11.2010 06:56, Joachim Wieland wrote:

On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 19.11.2010 23:46, Joachim Wieland wrote:

FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978,
pArray->maxKnownAssignedXids: 6890

Hmm, that's a lot of entries in KnownAssignedXids.

Can you recompile with WAL_DEBUG, and run the recovery again with
wal_debug=on ? That will print all the replayed WAL records, which is
a lot
of data, but it might give a hint what's going on.

Sure, but this gives me only one more line:

[...]
LOG: redo starts at 1F8/FC00E978
LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid
385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid
3829898/23
FATAL: too many KnownAssignedXids
CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23
LOG: startup process (PID 4587) exited with exit code 1
LOG: terminating any other active server processes

Thanks, I can reproduce this now. This happens when you have a wide gap
between the oldest still active xid and the latest xid.

When recovery starts, we fetch the oldestActiveXid from the checkpoint
record. Let's say that it's 100. We then start replaying WAL records
from the Redo pointer, and the first record (heap insert in your case)
contains an Xid that's much larger than 100, say 10000. We call
RecordKnownAssignedXids() to make note that all xids between that range
are in-progress, but there isn't enough room in the array for that.

We normally get away with a smallish array because the array is trimmed
at commit and abort records, and the special xid-assignment record to
handle the case of a lot of subtransactions. We initialize the array
from the running-xacts record that's written at a checkpoint. That
mechanism fails in this case because the heap insert record is seen
before the running-xacts record, causing all those xids in the range
100-10000 to be considered in-progress. The running-xacts record that
comes later would prune them, but we don't have enough slots to hold
them until that.

Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash
the xids we see during WAL replay in private memory instead of putting
them in the KnownAssignedXids array until we see the running-xacts record.

Looking closer at RecordKnownAssignedTransactionIds(), there's a related
much more serious bug there too. When latestObservedXid is initialized
to the oldest still-running xid, oldestActiveXid, at WAL recovery, we
zero the CLOG starting from the oldestActiveXid. That will zap away the
clog bits of any old transactions that had already committed before the
checkpoint started, but were younger than the oldest still running
transaction. The transactions will be lost :-(.

It's dangerous to initialize latestObservedXid to anything to an older
value. The idea of keeping the seen xids in a temporary list private to
the startup process until the running-xacts record would solve that
problem too. ProcArrayInitRecoveryInfo() would not be needed anymore,
the KnownAssignedXids tracking would start at the first running-xacts
record (or shutdown checkpoint) we see, not any sooner than that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#7)
Re: Hot Standby: too many KnownAssignedXids

On 24.11.2010 13:38, Heikki Linnakangas wrote:

It's dangerous to initialize latestObservedXid to anything to an older
value.

older value than the nextXid-1 from the checkpoint record, I meant to say.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#6)
Re: Hot Standby: too many KnownAssignedXids

On Wed, 2010-11-24 at 12:48 +0200, Heikki Linnakangas wrote:

When recovery starts, we fetch the oldestActiveXid from the checkpoint
record. Let's say that it's 100. We then start replaying WAL records
from the Redo pointer, and the first record (heap insert in your case)
contains an Xid that's much larger than 100, say 10000. We call
RecordKnownAssignedXids() to make note that all xids between that
range are in-progress, but there isn't enough room in the array for
that.

Agreed.

Hmm. I'm not sure off the top of my head how to fix that. Perhaps

stash

the xids we see during WAL replay in private memory instead of
putting
them in the KnownAssignedXids array until we see the running-xacts
record.

Moving LogStandbySnapshot() earlier will help but won't solve it fully.

Will think.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#6)
1 attachment(s)
Re: Hot Standby: too many KnownAssignedXids

On 24.11.2010 12:48, Heikki Linnakangas wrote:

When recovery starts, we fetch the oldestActiveXid from the checkpoint
record. Let's say that it's 100. We then start replaying WAL records
from the Redo pointer, and the first record (heap insert in your case)
contains an Xid that's much larger than 100, say 10000. We call
RecordKnownAssignedXids() to make note that all xids between that range
are in-progress, but there isn't enough room in the array for that.

We normally get away with a smallish array because the array is trimmed
at commit and abort records, and the special xid-assignment record to
handle the case of a lot of subtransactions. We initialize the array
from the running-xacts record that's written at a checkpoint. That
mechanism fails in this case because the heap insert record is seen
before the running-xacts record, causing all those xids in the range
100-10000 to be considered in-progress. The running-xacts record that
comes later would prune them, but we don't have enough slots to hold
them until that.

Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash
the xids we see during WAL replay in private memory instead of putting
them in the KnownAssignedXids array until we see the running-xacts record.

So, here's a patch using that approach.

Another approach would be to revisit the way the running-xacts snapshot
is taken. Currently, we first take a snapshot, and then WAL-log it.
There is a small window between the steps where backends can begin/end
transactions, and recovery has to deal with that. When this was
designed, there was long discussion on whether we should instead grab
WALInsertLock and ProcArrayLock at the same time, to ensure that the
running-xacts snapshot represents an up-to-date situation at the point
in WAL where it's inserted.

We didn't want to do that because both locks can be heavily contended.
But maybe we should after all. It would make the recovery code simpler.

If we want to get fancy, we wouldn't necessarily need to hold both locks
for the whole duration. We could first grab ProcArrayLock and construct
the snapshot. Then grab WALInsertLock and release ProcArrayLock, and
finally write the WAL record and release WALInsertLock. But that would
require small changes to XLogInsert.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

knownassignedxids-fix-1.patchtext/x-diff; name=knownassignedxids-fix-1.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 6095,6102 **** StartupXLOG(void)
  			StartupSUBTRANS(oldestActiveXID);
  			StartupMultiXact();
  
- 			ProcArrayInitRecoveryInfo(oldestActiveXID);
- 
  			/*
  			 * If we're beginning at a shutdown checkpoint, we know that
  			 * nothing was running on the master at this point. So fake-up an
--- 6095,6100 ----
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 435,453 **** ProcArrayClearTransaction(PGPROC *proc)
  }
  
  /*
-  * ProcArrayInitRecoveryInfo
-  *
-  * When trying to assemble our snapshot we only care about xids after this value.
-  * See comments for LogStandbySnapshot().
-  */
- void
- ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid)
- {
- 	latestObservedXid = oldestActiveXid;
- 	TransactionIdRetreat(latestObservedXid);
- }
- 
- /*
   * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
   *
   * Takes us through 3 states: Initialized, Pending and Ready.
--- 435,440 ----
***************
*** 519,533 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	Assert(standbyState == STANDBY_INITIALIZED);
  
  	/*
! 	 * OK, we need to initialise from the RunningTransactionsData record
! 	 */
! 
! 	/*
! 	 * Remove all xids except xids later than the snapshot. We don't know
! 	 * exactly which ones that is until precisely now, so that is why we allow
! 	 * xids to be added only to remove most of them again here.
  	 */
- 	ExpireOldKnownAssignedTransactionIds(running->nextXid);
  	StandbyReleaseOldLocks(running->nextXid);
  
  	/*
--- 506,514 ----
  	Assert(standbyState == STANDBY_INITIALIZED);
  
  	/*
! 	 * Release any locks belonging to old transactions that are not
! 	 * running according to the running-xacts record.
  	 */
  	StandbyReleaseOldLocks(running->nextXid);
  
  	/*
***************
*** 536,544 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
  
  	/*
- 	 * Combine the running xact data with already known xids, if any exist.
  	 * KnownAssignedXids is sorted so we cannot just add new xids, we have to
! 	 * combine them first, sort them and then re-add to KnownAssignedXids.
  	 *
  	 * Some of the new xids are top-level xids and some are subtransactions.
  	 * We don't call SubtransSetParent because it doesn't matter yet. If we
--- 517,524 ----
  	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
  
  	/*
  	 * KnownAssignedXids is sorted so we cannot just add new xids, we have to
! 	 * sort them first.
  	 *
  	 * Some of the new xids are top-level xids and some are subtransactions.
  	 * We don't call SubtransSetParent because it doesn't matter yet. If we
***************
*** 547,597 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	 * xids to subtrans. If RunningXacts is overflowed then we don't have
  	 * enough information to correctly update subtrans anyway.
  	 */
  
  	/*
! 	 * Allocate a temporary array so we can combine xids. The total of both
! 	 * arrays should never normally exceed TOTAL_MAX_CACHED_SUBXIDS.
! 	 */
! 	xids = palloc(sizeof(TransactionId) * TOTAL_MAX_CACHED_SUBXIDS);
! 
! 	/*
! 	 * Get the remaining KnownAssignedXids. In most cases there won't be any
! 	 * at all since this exists only to catch a theoretical race condition.
  	 */
! 	nxids = KnownAssignedXidsGet(xids, InvalidTransactionId);
! 	if (nxids > 0)
! 		KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
! 
! 	/*
! 	 * Now we have a copy of any KnownAssignedXids we can zero the array
! 	 * before we re-insert combined snapshot.
! 	 */
! 	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
  
  	/*
! 	 * Add to the temp array any xids which have not already completed, taking
! 	 * care not to overflow in extreme cases.
  	 */
  	for (i = 0; i < running->xcnt; i++)
  	{
  		TransactionId xid = running->xids[i];
  
  		/*
  		 * The running-xacts snapshot can contain xids that were running at
! 		 * the time of the snapshot, yet complete before the snapshot was
! 		 * written to WAL. They're running now, so ignore them.
  		 */
  		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
  			continue;
  
  		xids[nxids++] = xid;
- 
- 		/*
- 		 * Test for overflow only after we have filtered out already complete
- 		 * transactions.
- 		 */
- 		if (nxids > TOTAL_MAX_CACHED_SUBXIDS)
- 			elog(ERROR, "too many xids to add into KnownAssignedXids");
  	}
  
  	if (nxids > 0)
--- 527,557 ----
  	 * xids to subtrans. If RunningXacts is overflowed then we don't have
  	 * enough information to correctly update subtrans anyway.
  	 */
+ 	Assert(procArray->numKnownAssignedXids == 0);
  
  	/*
! 	 * Allocate a temporary array to avoid modifying the array passed as
! 	 * argument.
  	 */
! 	xids = palloc(sizeof(TransactionId) * running->xcnt);
  
  	/*
! 	 * Add to the temp array any xids which have not already completed.
  	 */
+ 	nxids = 0;
  	for (i = 0; i < running->xcnt; i++)
  	{
  		TransactionId xid = running->xids[i];
  
  		/*
  		 * The running-xacts snapshot can contain xids that were running at
! 		 * the time of the snapshot, yet completed before the snapshot was
! 		 * written to WAL. They're not running anymore, so ignore them.
  		 */
  		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
  			continue;
  
  		xids[nxids++] = xid;
  	}
  
  	if (nxids > 0)
***************
*** 603,621 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  		qsort(xids, nxids, sizeof(TransactionId), xidComparator);
  
  		/*
- 		 * Re-initialise latestObservedXid to the highest xid we've seen.
- 		 */
- 		latestObservedXid = xids[nxids - 1];
- 
- 		/*
  		 * Add the sorted snapshot into KnownAssignedXids
  		 */
  		for (i = 0; i < nxids; i++)
! 		{
! 			TransactionId xid = xids[i];
! 
! 			KnownAssignedXidsAdd(xid, xid, true);
! 		}
  
  		KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
  	}
--- 563,572 ----
  		qsort(xids, nxids, sizeof(TransactionId), xidComparator);
  
  		/*
  		 * Add the sorted snapshot into KnownAssignedXids
  		 */
  		for (i = 0; i < nxids; i++)
! 			KnownAssignedXidsAdd(xids[i], xids[i], true);
  
  		KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
  	}
***************
*** 623,630 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	pfree(xids);
  
  	/*
! 	 * Now we've got the running xids we need to set the global values thare
! 	 * used to track snapshots as they evolve further
  	 *
  	 * * latestCompletedXid which will be the xmax for snapshots *
  	 * lastOverflowedXid which shows whether snapshots overflow * nextXid
--- 574,613 ----
  	pfree(xids);
  
  	/*
! 	 * Ok, KnownAssignedXids has now been initialized with the xids from
! 	 * the snapshot. But we still have to add any XIDs we saw before the
! 	 * running-xacts record, that belong to transactions that started
! 	 * between taking the snapshot was taken and writing it to WAL.
! 	 * RecordKnownAssignedXids() has been updating latestObservedXids to
! 	 * keep track of those.
! 	 *
! 	 * XXX: This can lead to overflowing the KnownAssignedXids array, if
! 	 * a large number of new subtransactions started in that window. That
! 	 * should be rare in practice; we try to keep the window small, so that
! 	 * there isn't time to create many subtransactions. We could avoid the
! 	 * problem by keeping track of xact-assignment records that we see
! 	 * before the running-xact record, and but it doesn't seem worth it.
! 	 */
! 	if (!TransactionIdIsValid(latestObservedXid))
! 	{
! 		latestObservedXid = running->nextXid;
! 		TransactionIdRetreat(latestObservedXid);
! 	}
! 	else
! 	{
! 		TransactionId xid = running->nextXid;
! 
! 		while (TransactionIdPrecedesOrEquals(xid, latestObservedXid))
! 		{
! 			if (!TransactionIdDidCommit(xid) && !TransactionIdDidAbort(xid))
! 				KnownAssignedXidsAdd(xid, xid, true);
! 			TransactionIdAdvance(xid);
! 		}
! 	}
! 
! 	/*
! 	 * Now we've got the running xids we need to set the global values that
! 	 * are used to track snapshots as they evolve further.
  	 *
  	 * * latestCompletedXid which will be the xmax for snapshots *
  	 * lastOverflowedXid which shows whether snapshots overflow * nextXid
***************
*** 2337,2346 **** DisplayXidCache(void)
   *		unobserved XIDs.
   *
   * RecordKnownAssignedTransactionIds() should be run for *every* WAL record
!  * type apart from XLOG_RUNNING_XACTS (since that initialises the first
!  * snapshot so that RecordKnownAssignedTransactionIds() can be called). Must
!  * be called for each record after we have executed StartupCLOG() et al,
!  * since we must ExtendCLOG() etc..
   *
   * Called during recovery in analogy with and in place of GetNewTransactionId()
   */
--- 2320,2327 ----
   *		unobserved XIDs.
   *
   * RecordKnownAssignedTransactionIds() should be run for *every* WAL record
!  * associated with a transaction. Must be called for each record after we
!  * have executed StartupCLOG() et al, since we must ExtendCLOG() etc..
   *
   * Called during recovery in analogy with and in place of GetNewTransactionId()
   */
***************
*** 2348,2360 **** void
  RecordKnownAssignedTransactionIds(TransactionId xid)
  {
  	Assert(standbyState >= STANDBY_INITIALIZED);
- 	Assert(TransactionIdIsValid(latestObservedXid));
  	Assert(TransactionIdIsValid(xid));
  
  	elog(trace_recovery(DEBUG4), "record known xact %u latestObservedXid %u",
  		 xid, latestObservedXid);
  
  	/*
  	 * When a newly observed xid arrives, it is frequently the case that it is
  	 * *not* the next xid in sequence. When this occurs, we must treat the
  	 * intervening xids as running also.
--- 2329,2354 ----
  RecordKnownAssignedTransactionIds(TransactionId xid)
  {
  	Assert(standbyState >= STANDBY_INITIALIZED);
  	Assert(TransactionIdIsValid(xid));
  
  	elog(trace_recovery(DEBUG4), "record known xact %u latestObservedXid %u",
  		 xid, latestObservedXid);
  
  	/*
+ 	 * If the KnownAssignedXids machinery isn't up yet, just update
+ 	 * latestObservedXid.
+ 	 */
+ 	if (standbyState == STANDBY_INITIALIZED)
+ 	{
+ 		if (!TransactionIdIsValid(latestObservedXid) ||
+ 			TransactionIdFollows(xid, latestObservedXid))
+ 			latestObservedXid = xid;
+ 		return;
+ 	}
+ 
+ 	Assert(TransactionIdIsValid(latestObservedXid));
+ 
+ 	/*
  	 * When a newly observed xid arrives, it is frequently the case that it is
  	 * *not* the next xid in sequence. When this occurs, we must treat the
  	 * intervening xids as running also.
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
***************
*** 28,34 **** extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
  extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
  extern void ProcArrayClearTransaction(PGPROC *proc);
  
- extern void ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid);
  extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
  extern void ProcArrayApplyXidAssignment(TransactionId topxid,
  							int nsubxids, TransactionId *subxids);
--- 28,33 ----
#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#10)
1 attachment(s)
Re: Hot Standby: too many KnownAssignedXids

On 01.12.2010 20:51, Heikki Linnakangas wrote:

Another approach would be to revisit the way the running-xacts snapshot
is taken. Currently, we first take a snapshot, and then WAL-log it.
There is a small window between the steps where backends can begin/end
transactions, and recovery has to deal with that. When this was
designed, there was long discussion on whether we should instead grab
WALInsertLock and ProcArrayLock at the same time, to ensure that the
running-xacts snapshot represents an up-to-date situation at the point
in WAL where it's inserted.

We didn't want to do that because both locks can be heavily contended.
But maybe we should after all. It would make the recovery code simpler.

If we want to get fancy, we wouldn't necessarily need to hold both locks
for the whole duration. We could first grab ProcArrayLock and construct
the snapshot. Then grab WALInsertLock and release ProcArrayLock, and
finally write the WAL record and release WALInsertLock. But that would
require small changes to XLogInsert.

I took a look at that approach. We don't actually need to hold
ProcArrayLock while the WAL-record is written, we need to hold
XidGenLock. I believe that's less severe than holding the ProcArrayLock
as there's already precedence for writing a WAL record while holding
that: we do that when we advance to a new clog page and write a
zero-clog-page record.

So this is what we should do IMHO.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

knownassignedxids-fix-with-locking-1.patchtext/x-diff; name=knownassignedxids-fix-with-locking-1.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 6095,6102 **** StartupXLOG(void)
  			StartupSUBTRANS(oldestActiveXID);
  			StartupMultiXact();
  
- 			ProcArrayInitRecoveryInfo(oldestActiveXID);
- 
  			/*
  			 * If we're beginning at a shutdown checkpoint, we know that
  			 * nothing was running on the master at this point. So fake-up an
--- 6095,6100 ----
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 435,453 **** ProcArrayClearTransaction(PGPROC *proc)
  }
  
  /*
-  * ProcArrayInitRecoveryInfo
-  *
-  * When trying to assemble our snapshot we only care about xids after this value.
-  * See comments for LogStandbySnapshot().
-  */
- void
- ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid)
- {
- 	latestObservedXid = oldestActiveXid;
- 	TransactionIdRetreat(latestObservedXid);
- }
- 
- /*
   * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
   *
   * Takes us through 3 states: Initialized, Pending and Ready.
--- 435,440 ----
***************
*** 519,533 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	Assert(standbyState == STANDBY_INITIALIZED);
  
  	/*
! 	 * OK, we need to initialise from the RunningTransactionsData record
! 	 */
! 
! 	/*
! 	 * Remove all xids except xids later than the snapshot. We don't know
! 	 * exactly which ones that is until precisely now, so that is why we allow
! 	 * xids to be added only to remove most of them again here.
  	 */
- 	ExpireOldKnownAssignedTransactionIds(running->nextXid);
  	StandbyReleaseOldLocks(running->nextXid);
  
  	/*
--- 506,514 ----
  	Assert(standbyState == STANDBY_INITIALIZED);
  
  	/*
! 	 * Release any locks belonging to old transactions that are not
! 	 * running according to the running-xacts record.
  	 */
  	StandbyReleaseOldLocks(running->nextXid);
  
  	/*
***************
*** 536,544 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
  
  	/*
- 	 * Combine the running xact data with already known xids, if any exist.
  	 * KnownAssignedXids is sorted so we cannot just add new xids, we have to
! 	 * combine them first, sort them and then re-add to KnownAssignedXids.
  	 *
  	 * Some of the new xids are top-level xids and some are subtransactions.
  	 * We don't call SubtransSetParent because it doesn't matter yet. If we
--- 517,524 ----
  	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
  
  	/*
  	 * KnownAssignedXids is sorted so we cannot just add new xids, we have to
! 	 * sort them first.
  	 *
  	 * Some of the new xids are top-level xids and some are subtransactions.
  	 * We don't call SubtransSetParent because it doesn't matter yet. If we
***************
*** 547,597 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	 * xids to subtrans. If RunningXacts is overflowed then we don't have
  	 * enough information to correctly update subtrans anyway.
  	 */
  
  	/*
! 	 * Allocate a temporary array so we can combine xids. The total of both
! 	 * arrays should never normally exceed TOTAL_MAX_CACHED_SUBXIDS.
  	 */
! 	xids = palloc(sizeof(TransactionId) * TOTAL_MAX_CACHED_SUBXIDS);
  
  	/*
! 	 * Get the remaining KnownAssignedXids. In most cases there won't be any
! 	 * at all since this exists only to catch a theoretical race condition.
! 	 */
! 	nxids = KnownAssignedXidsGet(xids, InvalidTransactionId);
! 	if (nxids > 0)
! 		KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
! 
! 	/*
! 	 * Now we have a copy of any KnownAssignedXids we can zero the array
! 	 * before we re-insert combined snapshot.
! 	 */
! 	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
! 
! 	/*
! 	 * Add to the temp array any xids which have not already completed, taking
! 	 * care not to overflow in extreme cases.
  	 */
  	for (i = 0; i < running->xcnt; i++)
  	{
  		TransactionId xid = running->xids[i];
  
  		/*
! 		 * The running-xacts snapshot can contain xids that were running at
! 		 * the time of the snapshot, yet complete before the snapshot was
! 		 * written to WAL. They're running now, so ignore them.
  		 */
  		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
  			continue;
  
  		xids[nxids++] = xid;
- 
- 		/*
- 		 * Test for overflow only after we have filtered out already complete
- 		 * transactions.
- 		 */
- 		if (nxids > TOTAL_MAX_CACHED_SUBXIDS)
- 			elog(ERROR, "too many xids to add into KnownAssignedXids");
  	}
  
  	if (nxids > 0)
--- 527,558 ----
  	 * xids to subtrans. If RunningXacts is overflowed then we don't have
  	 * enough information to correctly update subtrans anyway.
  	 */
+ 	Assert(procArray->numKnownAssignedXids == 0);
  
  	/*
! 	 * Allocate a temporary array to avoid modifying the array passed as
! 	 * argument.
  	 */
! 	xids = palloc(sizeof(TransactionId) * running->xcnt);
  
  	/*
! 	 * Add to the temp array any xids which have not already completed.
  	 */
+ 	nxids = 0;
  	for (i = 0; i < running->xcnt; i++)
  	{
  		TransactionId xid = running->xids[i];
  
  		/*
! 		 * The running-xacts snapshot can contain xids that were still visible
! 		 * in the procarray when the snapshot was taken, but were already
! 		 * WAL-logged as completed. They're not running anymore, so ignore
! 		 * them.
  		 */
  		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
  			continue;
  
  		xids[nxids++] = xid;
  	}
  
  	if (nxids > 0)
***************
*** 603,621 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  		qsort(xids, nxids, sizeof(TransactionId), xidComparator);
  
  		/*
- 		 * Re-initialise latestObservedXid to the highest xid we've seen.
- 		 */
- 		latestObservedXid = xids[nxids - 1];
- 
- 		/*
  		 * Add the sorted snapshot into KnownAssignedXids
  		 */
  		for (i = 0; i < nxids; i++)
! 		{
! 			TransactionId xid = xids[i];
! 
! 			KnownAssignedXidsAdd(xid, xid, true);
! 		}
  
  		KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
  	}
--- 564,573 ----
  		qsort(xids, nxids, sizeof(TransactionId), xidComparator);
  
  		/*
  		 * Add the sorted snapshot into KnownAssignedXids
  		 */
  		for (i = 0; i < nxids; i++)
! 			KnownAssignedXidsAdd(xids[i], xids[i], true);
  
  		KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
  	}
***************
*** 623,630 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	pfree(xids);
  
  	/*
! 	 * Now we've got the running xids we need to set the global values thare
! 	 * used to track snapshots as they evolve further
  	 *
  	 * * latestCompletedXid which will be the xmax for snapshots *
  	 * lastOverflowedXid which shows whether snapshots overflow * nextXid
--- 575,582 ----
  	pfree(xids);
  
  	/*
! 	 * Now we've got the running xids we need to set the global values that
! 	 * are used to track snapshots as they evolve further.
  	 *
  	 * * latestCompletedXid which will be the xmax for snapshots *
  	 * lastOverflowedXid which shows whether snapshots overflow * nextXid
***************
*** 633,668 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	 * but the recovery snapshot isn't fully valid yet because we know there
  	 * are some subxids missing. We don't know the specific subxids that are
  	 * missing, so conservatively assume the last one is latestObservedXid.
- 	 * If no missing subxids, try to clear lastOverflowedXid.
- 	 *
- 	 * If the snapshot didn't overflow it's still possible that an overflow
- 	 * occurred in the gap between taking snapshot and logging record, so we
- 	 * also need to check if lastOverflowedXid is already ahead of us.
  	 */
  	if (running->subxid_overflow)
  	{
  		standbyState = STANDBY_SNAPSHOT_PENDING;
  
  		standbySnapshotPendingXmin = latestObservedXid;
! 		if (TransactionIdFollows(latestObservedXid,
! 								 procArray->lastOverflowedXid))
! 			procArray->lastOverflowedXid = latestObservedXid;
! 	}
! 	else if (TransactionIdFollows(procArray->lastOverflowedXid,
! 								  latestObservedXid))
! 	{
! 		standbyState = STANDBY_SNAPSHOT_PENDING;
! 
! 		standbySnapshotPendingXmin = procArray->lastOverflowedXid;
  	}
  	else
  	{
  		standbyState = STANDBY_SNAPSHOT_READY;
  
  		standbySnapshotPendingXmin = InvalidTransactionId;
! 		if (TransactionIdFollows(running->oldestRunningXid,
! 								 procArray->lastOverflowedXid))
! 			procArray->lastOverflowedXid = InvalidTransactionId;
  	}
  
  	/*
--- 585,607 ----
  	 * but the recovery snapshot isn't fully valid yet because we know there
  	 * are some subxids missing. We don't know the specific subxids that are
  	 * missing, so conservatively assume the last one is latestObservedXid.
  	 */
+ 	latestObservedXid = running->nextXid;
+ 	TransactionIdRetreat(latestObservedXid);
+ 
  	if (running->subxid_overflow)
  	{
  		standbyState = STANDBY_SNAPSHOT_PENDING;
  
  		standbySnapshotPendingXmin = latestObservedXid;
! 		procArray->lastOverflowedXid = latestObservedXid;
  	}
  	else
  	{
  		standbyState = STANDBY_SNAPSHOT_READY;
  
  		standbySnapshotPendingXmin = InvalidTransactionId;
! 		procArray->lastOverflowedXid = InvalidTransactionId;
  	}
  
  	/*
***************
*** 1407,1412 **** GetSnapshotData(Snapshot snapshot)
--- 1346,1355 ----
   * Similar to GetSnapshotData but returns more information. We include
   * all PGPROCs with an assigned TransactionId, even VACUUM processes.
   *
+  * We acquire XidGenLock, but the caller is responsible for releasing it.
+  * This ensures that no new XIDs enter the proc array until the caller has
+  * WAL-logged this snapshot, and releases the lock.
+  *
   * The returned data structure is statically allocated; caller should not
   * modify it, and must not assume it is valid past the next call.
   *
***************
*** 1526,1532 **** GetRunningTransactionData(void)
  	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
  	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
  
! 	LWLockRelease(XidGenLock);
  	LWLockRelease(ProcArrayLock);
  
  	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
--- 1469,1475 ----
  	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
  	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
  
! 	/* We don't release XidGenLock here, the caller is responsible for that */
  	LWLockRelease(ProcArrayLock);
  
  	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
***************
*** 2337,2346 **** DisplayXidCache(void)
   *		unobserved XIDs.
   *
   * RecordKnownAssignedTransactionIds() should be run for *every* WAL record
!  * type apart from XLOG_RUNNING_XACTS (since that initialises the first
!  * snapshot so that RecordKnownAssignedTransactionIds() can be called). Must
!  * be called for each record after we have executed StartupCLOG() et al,
!  * since we must ExtendCLOG() etc..
   *
   * Called during recovery in analogy with and in place of GetNewTransactionId()
   */
--- 2280,2287 ----
   *		unobserved XIDs.
   *
   * RecordKnownAssignedTransactionIds() should be run for *every* WAL record
!  * associated with a transaction. Must be called for each record after we
!  * have executed StartupCLOG() et al, since we must ExtendCLOG() etc..
   *
   * Called during recovery in analogy with and in place of GetNewTransactionId()
   */
***************
*** 2348,2360 **** void
  RecordKnownAssignedTransactionIds(TransactionId xid)
  {
  	Assert(standbyState >= STANDBY_INITIALIZED);
- 	Assert(TransactionIdIsValid(latestObservedXid));
  	Assert(TransactionIdIsValid(xid));
  
  	elog(trace_recovery(DEBUG4), "record known xact %u latestObservedXid %u",
  		 xid, latestObservedXid);
  
  	/*
  	 * When a newly observed xid arrives, it is frequently the case that it is
  	 * *not* the next xid in sequence. When this occurs, we must treat the
  	 * intervening xids as running also.
--- 2289,2308 ----
  RecordKnownAssignedTransactionIds(TransactionId xid)
  {
  	Assert(standbyState >= STANDBY_INITIALIZED);
  	Assert(TransactionIdIsValid(xid));
  
  	elog(trace_recovery(DEBUG4), "record known xact %u latestObservedXid %u",
  		 xid, latestObservedXid);
  
  	/*
+ 	 * If the KnownAssignedXids machinery isn't up yet, do nothing.
+ 	 */
+ 	if (standbyState == STANDBY_INITIALIZED)
+ 		return;
+ 
+ 	Assert(TransactionIdIsValid(latestObservedXid));
+ 
+ 	/*
  	 * When a newly observed xid arrives, it is frequently the case that it is
  	 * *not* the next xid in sequence. When this occurs, we must treat the
  	 * intervening xids as running also.
*** a/src/backend/storage/ipc/standby.c
--- b/src/backend/storage/ipc/standby.c
***************
*** 671,677 **** StandbyReleaseAllLocks(void)
  /*
   * StandbyReleaseOldLocks
   *		Release standby locks held by XIDs < removeXid, as long
!  *		as their not prepared transactions.
   */
  void
  StandbyReleaseOldLocks(TransactionId removeXid)
--- 671,677 ----
  /*
   * StandbyReleaseOldLocks
   *		Release standby locks held by XIDs < removeXid, as long
!  *		as they're not prepared transactions.
   */
  void
  StandbyReleaseOldLocks(TransactionId removeXid)
***************
*** 848,861 **** LogStandbySnapshot(TransactionId *oldestActiveXid, TransactionId *nextXid)
  	 * record we write, because standby will open up when it sees this.
  	 */
  	running = GetRunningTransactionData();
- 
- 	/*
- 	 * The gap between GetRunningTransactionData() and
- 	 * LogCurrentRunningXacts() is what most of the fuss is about here, so
- 	 * artifically extending this interval is a great way to test the little
- 	 * used parts of the code.
- 	 */
  	LogCurrentRunningXacts(running);
  
  	*oldestActiveXid = running->oldestRunningXid;
  	*nextXid = running->nextXid;
--- 848,856 ----
  	 * record we write, because standby will open up when it sees this.
  	 */
  	running = GetRunningTransactionData();
  	LogCurrentRunningXacts(running);
+ 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
+ 	LWLockRelease(XidGenLock);
  
  	*oldestActiveXid = running->oldestRunningXid;
  	*nextXid = running->nextXid;
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
***************
*** 28,34 **** extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
  extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
  extern void ProcArrayClearTransaction(PGPROC *proc);
  
- extern void ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid);
  extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
  extern void ProcArrayApplyXidAssignment(TransactionId topxid,
  							int nsubxids, TransactionId *subxids);
--- 28,33 ----
#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#11)
Re: Hot Standby: too many KnownAssignedXids

On Thu, 2010-12-02 at 10:39 +0200, Heikki Linnakangas wrote:

On 01.12.2010 20:51, Heikki Linnakangas wrote:

Another approach would be to revisit the way the running-xacts snapshot
is taken. Currently, we first take a snapshot, and then WAL-log it.
There is a small window between the steps where backends can begin/end
transactions, and recovery has to deal with that. When this was
designed, there was long discussion on whether we should instead grab
WALInsertLock and ProcArrayLock at the same time, to ensure that the
running-xacts snapshot represents an up-to-date situation at the point
in WAL where it's inserted.

We didn't want to do that because both locks can be heavily contended.
But maybe we should after all. It would make the recovery code simpler.

If we want to get fancy, we wouldn't necessarily need to hold both locks
for the whole duration. We could first grab ProcArrayLock and construct
the snapshot. Then grab WALInsertLock and release ProcArrayLock, and
finally write the WAL record and release WALInsertLock. But that would
require small changes to XLogInsert.

I took a look at that approach. We don't actually need to hold
ProcArrayLock while the WAL-record is written, we need to hold
XidGenLock. I believe that's less severe than holding the ProcArrayLock
as there's already precedence for writing a WAL record while holding
that: we do that when we advance to a new clog page and write a
zero-clog-page record.

So this is what we should do IMHO.

Oh, thanks for looking at this. I've been looking at this also and as we
might expect had a slightly different design.

First, your assessment of the locking above is better than mine. I agree
with your analysis so we should do it that way. The locking issue was
the reason I haven't patched this yet so I'm glad you've improved this.

In terms of the rest of the patch, it seems we have different designs, I
think I have a much simpler, less invasive solution:

The cause of the issue is that replay starts at one LSN and there is a
delay until the RunningXacts WAL record occurs. If there was no delay,
there would be no issue at all. In CreateCheckpoint() we start by
grabbing the WAInsertLock and later recording that pointer as part of
the checkpoint record. My proposal is to replace the "grab the lock"
code with the insert of the RunningXacts WAL record (when wal_level
set), so that recovery always starts with that record type.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#12)
Re: Hot Standby: too many KnownAssignedXids

On 02.12.2010 11:02, Simon Riggs wrote:

The cause of the issue is that replay starts at one LSN and there is a
delay until the RunningXacts WAL record occurs. If there was no delay,
there would be no issue at all. In CreateCheckpoint() we start by
grabbing the WAInsertLock and later recording that pointer as part of
the checkpoint record. My proposal is to replace the "grab the lock"
code with the insert of the RunningXacts WAL record (when wal_level
set), so that recovery always starts with that record type.

Oh, interesting idea. But AFAICS closing the gap between acquiring the
running-xacts snapshot and writing it to the log is sufficient, I don't
see what moving the running-xacts record buys us. Does it allow some
further simplifications somewhere?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#13)
Re: Hot Standby: too many KnownAssignedXids

On Thu, 2010-12-02 at 12:41 +0200, Heikki Linnakangas wrote:

On 02.12.2010 11:02, Simon Riggs wrote:

The cause of the issue is that replay starts at one LSN and there is a
delay until the RunningXacts WAL record occurs. If there was no delay,
there would be no issue at all. In CreateCheckpoint() we start by
grabbing the WAInsertLock and later recording that pointer as part of
the checkpoint record. My proposal is to replace the "grab the lock"
code with the insert of the RunningXacts WAL record (when wal_level
set), so that recovery always starts with that record type.

Oh, interesting idea. But AFAICS closing the gap between acquiring the
running-xacts snapshot and writing it to the log is sufficient, I don't
see what moving the running-xacts record buys us. Does it allow some
further simplifications somewhere?

Your patch is quite long and you do a lot more than just alter the
locking. I don't think we need those changes at all and especially would
not wish to backpatch that.

Earlier on this thread, we discussed:

On Wed, 2010-11-24 at 15:19 +0000, Simon Riggs wrote:

On Wed, 2010-11-24 at 12:48 +0200, Heikki Linnakangas wrote:

When recovery starts, we fetch the oldestActiveXid from the checkpoint
record. Let's say that it's 100. We then start replaying WAL records
from the Redo pointer, and the first record (heap insert in your case)
contains an Xid that's much larger than 100, say 10000. We call
RecordKnownAssignedXids() to make note that all xids between that
range are in-progress, but there isn't enough room in the array for
that.

Agreed.

The current code fails because of the gap between the redo pointer and
the XLOG_RUNNING_XACTS WAL record. If there is no gap, there is no
problem.

So my preferred solution would:
* Log XLOG_RUNNING_XACTS while holding XidGenLock, as you suggest
* Move logging to occur at the Redo pointer

That is a much smaller patch with a smaller footprint.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#14)
Re: Hot Standby: too many KnownAssignedXids

On 02.12.2010 13:25, Simon Riggs wrote:

On Thu, 2010-12-02 at 12:41 +0200, Heikki Linnakangas wrote:

On 02.12.2010 11:02, Simon Riggs wrote:

The cause of the issue is that replay starts at one LSN and there is a
delay until the RunningXacts WAL record occurs. If there was no delay,
there would be no issue at all. In CreateCheckpoint() we start by
grabbing the WAInsertLock and later recording that pointer as part of
the checkpoint record. My proposal is to replace the "grab the lock"
code with the insert of the RunningXacts WAL record (when wal_level
set), so that recovery always starts with that record type.

Oh, interesting idea. But AFAICS closing the gap between acquiring the
running-xacts snapshot and writing it to the log is sufficient, I don't
see what moving the running-xacts record buys us. Does it allow some
further simplifications somewhere?

Your patch is quite long and you do a lot more than just alter the
locking. I don't think we need those changes at all and especially would
not wish to backpatch that.

Most of the changes to procarray.c were about removing code that's no
longer necessary when we close the gap between acquiring and writing the
running-xacts WAL record. You can leave it as it is as a historical
curiosity, but I'd prefer to simplify it, given that we now know that it
doesn't actually work correctly if the gap is not closed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#15)
Re: Hot Standby: too many KnownAssignedXids

On 02.12.2010 12:31, Heikki Linnakangas wrote:

On 02.12.2010 13:25, Simon Riggs wrote:

On Thu, 2010-12-02 at 12:41 +0200, Heikki Linnakangas wrote:

On 02.12.2010 11:02, Simon Riggs wrote:

The cause of the issue is that replay starts at one LSN and there is a
delay until the RunningXacts WAL record occurs. If there was no delay,
there would be no issue at all. In CreateCheckpoint() we start by
grabbing the WAInsertLock and later recording that pointer as part of
the checkpoint record. My proposal is to replace the "grab the lock"
code with the insert of the RunningXacts WAL record (when wal_level
set), so that recovery always starts with that record type.

Oh, interesting idea. But AFAICS closing the gap between acquiring the
running-xacts snapshot and writing it to the log is sufficient, I don't
see what moving the running-xacts record buys us. Does it allow some
further simplifications somewhere?

Your patch is quite long and you do a lot more than just alter the
locking. I don't think we need those changes at all and especially would
not wish to backpatch that.

Most of the changes to procarray.c were about removing code that's no
longer necessary when we close the gap between acquiring and writing the
running-xacts WAL record. You can leave it as it is as a historical
curiosity, but I'd prefer to simplify it, given that we now know that it
doesn't actually work correctly if the gap is not closed.

Ok, I've committed this patch now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Joachim Wieland
joe@mcknight.de
In reply to: Heikki Linnakangas (#16)
Re: Hot Standby: too many KnownAssignedXids

On Tue, Dec 7, 2010 at 3:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Ok, I've committed this patch now.

I can confirm that I could continue replaying the logfiles on the
standby host with this patch.

Thanks a lot,
Joachim