SSI bug?

Started by Nonamealmost 15 years ago71 messages

yamt@mwd.biglobe.ne.jp

almost 15 years ago

hi,

it seems that PredicateLockTupleRowVersionLink sometimes create
a loop of targets (it founds an existing 'newtarget' whose nextVersionOfRow
chain points to the 'oldtarget') and it later causes
CheckTargetForConflictsIn loop forever.

YAMAMOTO Takashi

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#1)

Re: SSI bug?

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

it seems that PredicateLockTupleRowVersionLink sometimes create
a loop of targets (it founds an existing 'newtarget' whose
nextVersionOfRow chain points to the 'oldtarget') and it later
causes CheckTargetForConflictsIn loop forever.

Is this a hypothetical risk based on looking at the code, or have
you seen this actually happen? Either way, could you provide more
details? (A reproducible test case would be ideal.)

This being the newest part of the code, I'll grant that it is the
most likely to have an unidentified bug; but given that the pointers
are from one predicate lock target structure identified by a tuple
ID to one identified by the tuple ID of the next version of the row,
it isn't obvious to me how a cycle could develop.

-Kevin

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#2)

1 attachment(s)

Re: SSI bug?

hi,

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

it seems that PredicateLockTupleRowVersionLink sometimes create
a loop of targets (it founds an existing 'newtarget' whose
nextVersionOfRow chain points to the 'oldtarget') and it later
causes CheckTargetForConflictsIn loop forever.

Is this a hypothetical risk based on looking at the code, or have
you seen this actually happen? Either way, could you provide more
details? (A reproducible test case would be ideal.)

i have seen this actually happen. i've confirmed the creation of the loop
with the attached patch. it's easily reproducable with my application.
i can provide the full source code of my application if you want.
(but it isn't easy to run unless you are familiar with the recent
version of NetBSD)
i haven't found a smaller reproducible test case yet.

YAMAMOTO Takashi

Show quoted text

This being the newest part of the code, I'll grant that it is the
most likely to have an unidentified bug; but given that the pointers
are from one predicate lock target structure identified by a tuple
ID to one identified by the tuple ID of the next version of the row,
it isn't obvious to me how a cycle could develop.

-Kevin

Attachments:

a.difftext/plain; charset=us-asciiDownload

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 722d0f8..3e1a3e2 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2350,7 +2350,25 @@ PredicateLockTupleRowVersionLink(const Relation relation,
 			newtarget->nextVersionOfRow = NULL;
 		}
 		else
+		{
 			Assert(newtarget->priorVersionOfRow == NULL);
+#if 0
+			Assert(newtarget->nextVersionOfRow == NULL);
+#endif
+			if (newtarget->nextVersionOfRow != NULL) {
+				PREDICATELOCKTARGET *t;
+
+				elog(WARNING, "new %p has next %p\n",
+					newtarget, newtarget->nextVersionOfRow);
+				for (t = newtarget->nextVersionOfRow; t != NULL;
+					t = t->nextVersionOfRow) {
+					if (oldtarget != t) {
+						elog(WARNING, "creating a loop new=%p old=%p\n",
+							newtarget, oldtarget);
+					}
+				}
+			}
+		}
 
 		newtarget->priorVersionOfRow = oldtarget;
 		oldtarget->nextVersionOfRow = newtarget;

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#3)

Re: SSI bug?

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

i have seen this actually happen. i've confirmed the creation of
the loop with the attached patch. it's easily reproducable with
my application. i can provide the full source code of my
application if you want. (but it isn't easy to run unless you are
familiar with the recent version of NetBSD)
i haven't found a smaller reproducible test case yet.

I've never used NetBSD, so maybe a few details will help point me in
the right direction faster than the source code.

Has your application ever triggered any of the assertions in the
code? (In particular, it would be interesting if it ever hit the
one right above where you patched.)

How long was the loop?

Did you notice whether the loop involved multiple tuples within a
single page?

Did this coincide with an autovacuum of the table?

These last two are of interest because it seems likely that such a
cycle might be related to this new code not properly allowing for
some aspect of tuple cleanup.

Thanks for finding this and reporting it, and thanks in advance for
any further detail you can provide.

-Kevin

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Kevin Grittner (#4)

1 attachment(s)

Re: SSI bug?

I wrote:

it seems likely that such a cycle might be related to this new
code not properly allowing for some aspect of tuple cleanup.

I found a couple places where cleanup could let these fall through
the cracks long enough to get stale and still be around when a tuple
ID is re-used, causing problems. Please try the attached patch and
see if it fixes the problem for you.

If it does, then there's no need to try to track the other things I
was asking about.

Thanks!

-Kevin

Attachments:

ssi-cleanup-fix.patchtext/plain; name=ssi-cleanup-fix.patchDownload

*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
***************
*** 2329,2334 **** PredicateLockTupleRowVersionLink(const Relation relation,
--- 2329,2336 ----
  		if (next != NULL)
  		{
  			next->priorVersionOfRow = NULL;
+ 			if (SHMQueueEmpty(&next->predicateLocks))
+ 				PredXact->NeedTargetLinkCleanup = true;
  			oldtarget->nextVersionOfRow = NULL;
  		}
  
***************
*** 3128,3133 **** ClearOldPredicateLocks(void)
--- 3130,3136 ----
  	int			i;
  	HASH_SEQ_STATUS seqstat;
  	PREDICATELOCKTARGET *locktarget;
+ 	PREDICATELOCKTARGET *next;
  
  	LWLockAcquire(SerializableFinishedListLock, LW_EXCLUSIVE);
  	finishedSxact = (SERIALIZABLEXACT *)
***************
*** 3237,3256 **** ClearOldPredicateLocks(void)
  		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_EXCLUSIVE);
  	LWLockAcquire(PredicateLockNextRowLinkLock, LW_SHARED);
  
! 	hash_seq_init(&seqstat, PredicateLockTargetHash);
! 	while ((locktarget = (PREDICATELOCKTARGET *) hash_seq_search(&seqstat)))
  	{
! 		if (SHMQueueEmpty(&locktarget->predicateLocks)
! 			&& locktarget->priorVersionOfRow == NULL
! 			&& locktarget->nextVersionOfRow == NULL)
  		{
! 			hash_search(PredicateLockTargetHash, &locktarget->tag,
! 						HASH_REMOVE, NULL);
  		}
  	}
  
- 	PredXact->NeedTargetLinkCleanup = false;
- 
  	LWLockRelease(PredicateLockNextRowLinkLock);
  	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
  		LWLockRelease(FirstPredicateLockMgrLock + i);
--- 3240,3267 ----
  		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_EXCLUSIVE);
  	LWLockAcquire(PredicateLockNextRowLinkLock, LW_SHARED);
  
! 	while (PredXact->NeedTargetLinkCleanup)
  	{
! 		PredXact->NeedTargetLinkCleanup = false;
! 		hash_seq_init(&seqstat, PredicateLockTargetHash);
! 		while ((locktarget = (PREDICATELOCKTARGET *) hash_seq_search(&seqstat)))
  		{
! 			if (SHMQueueEmpty(&locktarget->predicateLocks)
! 				&& locktarget->priorVersionOfRow == NULL)
! 			{
! 				next = locktarget->nextVersionOfRow;
! 				if (next != NULL)
! 				{
! 					next->priorVersionOfRow = NULL;
! 					if (SHMQueueEmpty(&next->predicateLocks))
! 						PredXact->NeedTargetLinkCleanup = true;
! 				}
! 				hash_search(PredicateLockTargetHash, &locktarget->tag,
! 							HASH_REMOVE, NULL);
! 			}
  		}
  	}
  
  	LWLockRelease(PredicateLockNextRowLinkLock);
  	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
  		LWLockRelease(FirstPredicateLockMgrLock + i);

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#5)

Re: SSI bug?

hi,

I wrote:

it seems likely that such a cycle might be related to this new
code not properly allowing for some aspect of tuple cleanup.

I found a couple places where cleanup could let these fall through
the cracks long enough to get stale and still be around when a tuple
ID is re-used, causing problems. Please try the attached patch and
see if it fixes the problem for you.

If it does, then there's no need to try to track the other things I
was asking about.

thanks. unfortunately the problem still happens with the patch.

YAMAMOTO Takashi

Show quoted text

Thanks!

-Kevin

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#4)

Re: SSI bug?

hi,

all of the following answers are with the patch you provided in
other mail applied.

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

i have seen this actually happen. i've confirmed the creation of
the loop with the attached patch. it's easily reproducable with
my application. i can provide the full source code of my
application if you want. (but it isn't easy to run unless you are
familiar with the recent version of NetBSD)
i haven't found a smaller reproducible test case yet.

I've never used NetBSD, so maybe a few details will help point me in
the right direction faster than the source code.

Has your application ever triggered any of the assertions in the
code? (In particular, it would be interesting if it ever hit the
one right above where you patched.)

the assertion right above is sometimes triggered. sometimes not.

How long was the loop?

see below.

Did you notice whether the loop involved multiple tuples within a
single page?

if i understand correctly, yes.

the following is a snippet of my debug code (dump targets when
triggerCheckTargetForConflictsIn loops >1000 times) and its output.
the same locktag_field3 value means the same page, right?

+   for (t = target, i = 0; t != NULL; i++) {
+       elog(WARNING, "[%u] target %p tag %" PRIx32 ":%" PRIx32 ":%" PRIx32
+           ":%" PRIx16 ":%" PRIx16 " prior %p next %p", i, t,
+           t->tag.locktag_field1,
+           t->tag.locktag_field2,
+           t->tag.locktag_field3,
+           t->tag.locktag_field4,
+           t->tag.locktag_field5,
+           t->priorVersionOfRow,
+           t->nextVersionOfRow);
+       t = t->priorVersionOfRow;
+       if (t == target) {
+           elog(WARNING, "found a loop");
+           break;
+       }
+   }

WARNING: [0] target 0xbb51f238 tag 4000:4017:53b:6c:0 prior 0xbb51f350 next 0xbb51f350
WARNING: [1] target 0xbb51f350 tag 4000:4017:53b:69:0 prior 0xbb51f238 next 0xbb51f238
WARNING: found a loop

another sample:

WARNING: [0] target 0xbb51f530 tag 4000:4017:565:ae:0 prior 0xbb51f1e8 next 0xbb51f300
WARNING: [1] target 0xbb51f1e8 tag 4000:4017:565:ad:0 prior 0xbb51f580 next 0xbb51f530
WARNING: [2] target 0xbb51f580 tag 4000:4017:565:ac:0 prior 0xbb51f300 next 0xbb51f1e8
WARNING: [3] target 0xbb51f300 tag 4000:4017:565:ab:0 prior 0xbb51f530 next 0xbb51f580
WARNING: found a loop

the table seems mostly hot-updated, if it matters.

hoge=# select * from pg_stat_user_tables where relid=16407;
-[ RECORD 1 ]-----+--------------------
relid | 16407
schemaname | pgfs
relname | file
seq_scan | 0
seq_tup_read | 0
idx_scan | 53681
idx_tup_fetch | 52253
n_tup_ins | 569
n_tup_upd | 12054
n_tup_del | 476
n_tup_hot_upd | 12041
n_live_tup | 93
n_dead_tup | 559
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze |
vacuum_count | 0
autovacuum_count | 0
analyze_count | 4922528128875102208
autoanalyze_count | 7598807461784802080

(values in the last two columns seems bogus.
i don't know if it's related or not.)

Did this coincide with an autovacuum of the table?

no.
(assuming that autovacuum=off in postgresql.conf is enough to exclude
the possibility.)

These last two are of interest because it seems likely that such a
cycle might be related to this new code not properly allowing for
some aspect of tuple cleanup.

Thanks for finding this and reporting it, and thanks in advance for
any further detail you can provide.

thanks for looking.

YAMAMOTO Takashi

Show quoted text

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Noname (#7)

Re: SSI bug?

Looking at the prior/next version chaining, aside from the looping
issue, isn't it broken by lock promotion too? There's a check in
RemoveTargetIfNoLongerUsed() so that we don't release a lock target if
its priorVersionOfRow is set, but what if the tuple lock is promoted to
a page level lock first, and PredicateLockTupleRowVersionLink() is
called only after that? Or can that not happen because of something else
that I'm missing?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Heikki Linnakangas (#8)

Re: SSI bug?

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

Looking at the prior/next version chaining, aside from the
looping issue, isn't it broken by lock promotion too? There's a
check in RemoveTargetIfNoLongerUsed() so that we don't release a
lock target if its priorVersionOfRow is set, but what if the tuple
lock is promoted to a page level lock first, and
PredicateLockTupleRowVersionLink() is called only after that? Or
can that not happen because of something else that I'm missing?

I had to ponder that a while. Here's my thinking.

Predicate locks only matter when there is a write. Predicate locks
on heap tuples only matter when there is an UPDATE or DELETE of a
locked tuple. The problem these links are addressing is that an
intervening transaction might UPDATE the transaction between the
read of the tuple and a later UPDATE or DELETE. We want the second
UPDATE to see that it conflicts with a read from before the first
UPDATE. The first UPDATE creates the link from the "before" tuple
ID the "after" tuple ID at the target level. What predicate locks
exist on the second target are irrelevant when it comes to seeing
the conflict between the second UPDATE (or DELETE) and the initial
read. So I don't see where granularity promotion for locks on the
second target is a problem as long as the target itself doesn't get
deleted because of the link to the prior version of the tuple.

Promotion of the lock granularity on the prior tuple is where we
have problems. If the two tuple versions are in separate pages then
the second UPDATE could miss the conflict. My first thought was to
fix that by requiring promotion of a predicate lock on a tuple to
jump straight to the relation level if nextVersionOfRow is set for
the lock target and it points to a tuple in a different page. But
that doesn't cover a situation where we have a heap tuple predicate
lock which gets promoted to page granularity before the tuple is
updated. To handle that we would need to say that an UPDATE to a
tuple on a page which is predicate locked by the transaction would
need to be promoted to relation granularity if the new version of
the tuple wasn't on the same page as the old version.

That's all doable without too much trouble, but more than I'm
likely to get done today. It would be good if someone can confirm
my thinking on this first, too.

That said, the above is about eliminating false negatives from some
corner cases which escaped notice until now. I don't think the
changes described above will do anything to prevent the problems
reported by YAMAMOTO Takashi. Unless I'm missing something, it
sounds like tuple IDs are being changed or reused while predicate
locks are held on the tuples. That's probably not going to be
overwhelmingly hard to fix if we can identify how that can happen.
I tried to cover HOT issues, but it seems likely I missed something.
:-( I will be looking at it.

-Kevin

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#7)

Re: SSI bug?

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

Did you notice whether the loop involved multiple tuples within a
single page?

if i understand correctly, yes.

the following is a snippet of my debug code (dump targets when
triggerCheckTargetForConflictsIn loops >1000 times) and its
output.the same locktag_field3 value means the same page, right?

Right.

the table seems mostly hot-updated, if it matters.

idx_scan | 53681
idx_tup_fetch | 52253
n_tup_ins | 569
n_tup_upd | 12054
n_tup_del | 476
n_tup_hot_upd | 12041
n_live_tup | 93
n_dead_tup | 559

That probably matters a lot.

analyze_count | 4922528128875102208
autoanalyze_count | 7598807461784802080

(values in the last two columns seems bogus.
i don't know if it's related or not.)

That seems unlikely to be related to this problem. It sure does
look odd, though. Maybe post that in a separate thread?

Thanks for all the additional info. I'll keep digging.

-Kevin

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Kevin Grittner (#9)

Re: SSI bug?

On 14.02.2011 20:10, Kevin Grittner wrote:

Promotion of the lock granularity on the prior tuple is where we
have problems. If the two tuple versions are in separate pages then
the second UPDATE could miss the conflict. My first thought was to
fix that by requiring promotion of a predicate lock on a tuple to
jump straight to the relation level if nextVersionOfRow is set for
the lock target and it points to a tuple in a different page. But
that doesn't cover a situation where we have a heap tuple predicate
lock which gets promoted to page granularity before the tuple is
updated. To handle that we would need to say that an UPDATE to a
tuple on a page which is predicate locked by the transaction would
need to be promoted to relation granularity if the new version of
the tuple wasn't on the same page as the old version.

Yeah, promoting the original lock on the UPDATE was my first thought too.

Another idea is to duplicate the original predicate lock on the first
update, so that the original reader holds a lock on both row versions. I
think that would ultimately be simpler as we wouldn't need the
next-prior chains anymore.

For example, suppose that transaction X is holding a predicate lock on
tuple A. Transaction Y updates tuple A, creating a new tuple B.
Transaction Y sees that X holds a lock on tuple A (or the page
containing A), so it acquires a new predicate lock on tuple B on behalf
of X.

If the updater aborts, the lock on the new tuple needs to be cleaned up,
so that it doesn't get confused with later tuple that's stored in the
same physical location. We could store the xmin of the tuple in the
predicate lock to check for that. Whenever you check for conflict, if
the xmin of the lock doesn't match the xmin on the tuple, you know that
the lock belonged to an old dead tuple stored in the same location, and
can be simply removed as the tuple doesn't exist anymore.

That said, the above is about eliminating false negatives from some
corner cases which escaped notice until now. I don't think the
changes described above will do anything to prevent the problems
reported by YAMAMOTO Takashi.

Agreed, it's a separate issue. Although if we change the way we handle
the read-update-update problem, the other issue might go away too.

Unless I'm missing something, it
sounds like tuple IDs are being changed or reused while predicate
locks are held on the tuples. That's probably not going to be
overwhelmingly hard to fix if we can identify how that can happen.
I tried to cover HOT issues, but it seems likely I missed something.

Storing the xmin of the original tuple would probably help with that
too. But it would be nice to understand and be able to reproduce the
issue first.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#10)

1 attachment(s)

Re: SSI bug?

hi,

might be unrelated to the loop problem, but...

i got the following SEGV when runnning vacuum on a table.
(the line numbers in predicate.c is different as i have local modifications.)
oldlocktag.myTarget was NULL.
it seems that TransferPredicateLocksToNewTarget sometimes use stack garbage
for newpredlocktag.myTarget. vacuum on the table succeeded with the attached
patch. the latter part of the patch was necessary to avoid targetList
corruption, which later seems to make DeleteChildTargetLocks loop inifinitely.

YAMAMOTO Takashi

#0 0x0823cf6c in PredicateLockAcquire (targettag=0xbfbfa734)
at predicate.c:1835
#1 0x0823f18a in PredicateLockPage (relation=0x99b4dcf0, blkno=1259)
at predicate.c:2206
#2 0x080ac978 in _bt_search (rel=0x99b4dcf0, keysz=2, scankey=0x99a05040,
nextkey=0 '\0', bufP=0xbfbfa894, access=1) at nbtsearch.c:97
#3 0x080a996d in _bt_pagedel (rel=0x99b4dcf0, buf=<value optimized out>,
stack=0x0) at nbtpage.c:1059
#4 0x080aacc2 in btvacuumscan (info=0xbfbfbcc4, stats=0x99a01328,
callback=0x8184d50 <lazy_tid_reaped>, callback_state=0x99a012e0,
cycleid=13675) at nbtree.c:981
#5 0x080ab15c in btbulkdelete (fcinfo=0xbfbfb9e0) at nbtree.c:573
#6 0x082fde74 in FunctionCall4 (flinfo=0x99b86958, arg1=3217013956, arg2=0,
arg3=135810384, arg4=2577404640) at fmgr.c:1437
#7 0x080a4fd0 in index_bulk_delete (info=0xbfbfbcc4, stats=0x0,
callback=0x8184d50 <lazy_tid_reaped>, callback_state=0x99a012e0)
at indexam.c:738
#8 0x08184cd4 in lazy_vacuum_index (indrel=0x99b4dcf0, stats=0x99a023e0,
vacrelstats=0x99a012e0) at vacuumlazy.c:938
#9 0x081854b6 in lazy_vacuum_rel (onerel=0x99b47650, vacstmt=0x99b059d0,
bstrategy=0x99a07018, scanned_all=0xbfbfcfd8 "") at vacuumlazy.c:762
#10 0x08184265 in vacuum_rel (relid=16424, vacstmt=0x99b059d0,
do_toast=1 '\001', for_wraparound=0 '\0', scanned_all=0xbfbfcfd8 "")
at vacuum.c:978
#11 0x081845ea in vacuum (vacstmt=0x99b059d0, relid=0, do_toast=1 '\001',
bstrategy=0x0, for_wraparound=0 '\0', isTopLevel=1 '\001') at vacuum.c:230
#12 0xbbab50c3 in pgss_ProcessUtility (parsetree=0x99b059d0,
queryString=0x99b05018 "vacuum (verbose,analyze) pgfs.dirent;",
params=0x0, isTopLevel=1 '\001', dest=0x99b05b80,
completionTag=0xbfbfd21a "") at pg_stat_statements.c:603
#13 0x082499ea in PortalRunUtility (portal=0x99b33018, utilityStmt=0x99b059d0,
isTopLevel=1 '\001', dest=0x99b05b80, completionTag=0xbfbfd21a "")
at pquery.c:1191
#14 0x0824a79e in PortalRunMulti (portal=0x99b33018, isTopLevel=4 '\004',
dest=0x99b05b80, altdest=0x99b05b80, completionTag=0xbfbfd21a "")
at pquery.c:1298
#15 0x0824b21a in PortalRun (portal=0x99b33018, count=2147483647,
isTopLevel=1 '\001', dest=0x99b05b80, altdest=0x99b05b80,
completionTag=0xbfbfd21a "") at pquery.c:822
#16 0x08247dc7 in exec_simple_query (
query_string=0x99b05018 "vacuum (verbose,analyze) pgfs.dirent;")
at postgres.c:1059
#17 0x08248a79 in PostgresMain (argc=2, argv=0xbb912650,
username=0xbb9125c0 "takashi") at postgres.c:3943
#18 0x0820e231 in ServerLoop () at postmaster.c:3590
#19 0x0820ef88 in PostmasterMain (argc=3, argv=0xbfbfe59c) at postmaster.c:1110
#20 0x081b6439 in main (argc=3, argv=0xbfbfe59c) at main.c:199
(gdb) list
1830 offsetof(PREDICATELOCK, xactLink));
1831
1832 oldlocktag = predlock->tag;
1833 Assert(oldlocktag.myXact == sxact);
1834 oldtarget = oldlocktag.myTarget;
1835 oldtargettag = oldtarget->tag;
1836
1837 if (TargetTagIsCoveredBy(oldtargettag, *newtargettag))
1838 {
1839 uint32 oldtargettaghash;
(gdb)

Attachments:

a.difftext/plain; charset=us-asciiDownload

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 722d0f8..4dde6ae 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2537,8 +2558,8 @@ TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
 		if (!found)
 		{
 			SHMQueueInit(&(newtarget->predicateLocks));
-			newpredlocktag.myTarget = newtarget;
 		}
+		newpredlocktag.myTarget = newtarget;
 
 		oldpredlock = (PREDICATELOCK *)
 			SHMQueueNext(&(oldtarget->predicateLocks),
@@ -2588,10 +2609,12 @@ TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
 				outOfShmem = true;
 				goto exit;
 			}
-			SHMQueueInsertBefore(&(newtarget->predicateLocks),
-								 &(newpredlock->targetLink));
-			SHMQueueInsertBefore(&(newpredlocktag.myXact->predicateLocks),
-								 &(newpredlock->xactLink));
+			if (!found) {
+				SHMQueueInsertBefore(&(newtarget->predicateLocks),
+									 &(newpredlock->targetLink));
+				SHMQueueInsertBefore(&(newpredlocktag.myXact->predicateLocks),
+									 &(newpredlock->xactLink));
+			}
 
 			oldpredlock = nextpredlock;
 		}

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#12)

Re: SSI bug?

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

might be unrelated to the loop problem, but...

i got the following SEGV when runnning vacuum on a table.

vacuum on the table succeeded with the attached patch.

Thanks! I appreciate the heavy testing and excellent diagnostics.
On the face of it, this doesn't look related to the other problem,
but I'll post again soon after closer review.

-Kevin

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Kevin Grittner (#13)

1 attachment(s)

Re: SSI bug?

YAMAMOTO Takashi wrote:

might be unrelated to the loop problem, but...

Aha! I think it *is* related. There were several places where data
was uninitialized here; mostly because Dan was working on this piece
while I was working on separate issues which added the new fields.
I missed the interaction on integrating the two efforts. :-(
The uninitialized fields could lead to all the symptoms you saw.
I've reviewed, looking for other similar issues and didn't find any.

Could you try the attached patch and see if this fixes the issues
you've seen?

-Kevin

Attachments:

ssi-uninit.patchapplication/octet-stream; name=ssi-uninit.patchDownload

*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
***************
*** 2535,2543 **** TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
  		if (!found)
  		{
  			SHMQueueInit(&(newtarget->predicateLocks));
! 			newpredlocktag.myTarget = newtarget;
  		}
  
  		oldpredlock = (PREDICATELOCK *)
  			SHMQueueNext(&(oldtarget->predicateLocks),
  						 &(oldtarget->predicateLocks),
--- 2535,2546 ----
  		if (!found)
  		{
  			SHMQueueInit(&(newtarget->predicateLocks));
! 			newtarget->priorVersionOfRow = NULL;
! 			newtarget->nextVersionOfRow = NULL;
  		}
  
+ 		newpredlocktag.myTarget = newtarget;
+ 
  		oldpredlock = (PREDICATELOCK *)
  			SHMQueueNext(&(oldtarget->predicateLocks),
  						 &(oldtarget->predicateLocks),
***************
*** 2586,2595 **** TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
  				outOfShmem = true;
  				goto exit;
  			}
! 			SHMQueueInsertBefore(&(newtarget->predicateLocks),
! 								 &(newpredlock->targetLink));
! 			SHMQueueInsertBefore(&(newpredlocktag.myXact->predicateLocks),
! 								 &(newpredlock->xactLink));
  
  			oldpredlock = nextpredlock;
  		}
--- 2589,2602 ----
  				outOfShmem = true;
  				goto exit;
  			}
! 			if (!found)
! 			{
! 				SHMQueueInsertBefore(&(newtarget->predicateLocks),
! 									 &(newpredlock->targetLink));
! 				SHMQueueInsertBefore(&(newpredlocktag.myXact->predicateLocks),
! 									 &(newpredlock->xactLink));
! 				newpredlock->commitSeqNo = InvalidSerCommitSeqNo;
! 			}
  
  			oldpredlock = nextpredlock;
  		}

Import Notes

Resolved by subject fallback

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#14)

Re: SSI bug?

hi,

YAMAMOTO Takashi wrote:

might be unrelated to the loop problem, but...

Aha! I think it *is* related. There were several places where data
was uninitialized here; mostly because Dan was working on this piece
while I was working on separate issues which added the new fields.
I missed the interaction on integrating the two efforts. :-(
The uninitialized fields could lead to all the symptoms you saw.
I've reviewed, looking for other similar issues and didn't find any.

Could you try the attached patch and see if this fixes the issues
you've seen?

with your previous patch or not?

YAMAMOTO Takashi

Show quoted text

-Kevin

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#15)

Re: SSI bug?

YAMAMOTO Takashi wrote:

with your previous patch or not?

With, thanks.

-Kevin

Import Notes

Resolved by subject fallback

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#16)

Re: SSI bug?

hi,

YAMAMOTO Takashi wrote:

with your previous patch or not?

With, thanks.

i tried. unfortunately i can still reproduce the original loop problem.

WARNING: [0] target 0xbb51ef18 tag 4000:4017:7e3:78:0 prior 0xbb51f148 next 0xb
b51edb0
WARNING: [1] target 0xbb51f148 tag 4000:4017:7e3:77:0 prior 0xbb51ef90 next 0xb
b51ef18
WARNING: [2] target 0xbb51ef90 tag 4000:4017:7e3:74:0 prior 0xbb51edb0 next 0xb
b51f148
WARNING: [3] target 0xbb51edb0 tag 4000:4017:7e3:71:0 prior 0xbb51ef18 next 0xb
b51ef90
WARNING: found a loop

YAMAMOTO Takashi

Show quoted text

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

drkp@csail.mit.edu

almost 15 years ago

In reply to: Noname (#12)

Re: SSI bug?

On Wed, Feb 16, 2011 at 10:13:35PM +0000, YAMAMOTO Takashi wrote:

i got the following SEGV when runnning vacuum on a table.
(the line numbers in predicate.c is different as i have local modifications.)
oldlocktag.myTarget was NULL.
it seems that TransferPredicateLocksToNewTarget sometimes use stack garbage
for newpredlocktag.myTarget. vacuum on the table succeeded with the attached
patch. the latter part of the patch was necessary to avoid targetList
corruption, which later seems to make DeleteChildTargetLocks loop inifinitely.

Oops. Those are both definitely bugs (and my fault). Your patch looks
correct. Thanks for catching that!

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Dan Ports (#18)

Re: SSI bug?

Dan Ports <drkp@csail.mit.edu> wrote:

Oops. Those are both definitely bugs (and my fault). Your patch
looks correct. Thanks for catching that!

Could a committer please apply the slightly modified version here?:

http://archives.postgresql.org/message-id/4D5C46BB020000250003AB40@gw.wicourts.gov

It is a pretty straightforward bug fix to initialize some currently
uninitialized data which is causing occasional but severe problems,
especially during vacuum.

I'm still working on the other issues raised by YAMAMOTO Takashi and
Heikki. I expect to have a solution for those issues this weekend,
but this bug fix is needed regardless of how those issues are
settled.

-Kevin

Magnus Hagander

magnus@hagander.net

almost 15 years ago

In reply to: Kevin Grittner (#19)

Re: SSI bug?

On Thu, Feb 17, 2011 at 23:11, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Dan Ports <drkp@csail.mit.edu> wrote:

Oops. Those are both definitely bugs (and my fault). Your patch
looks correct. Thanks for catching that!

Could a committer please apply the slightly modified version here?:

http://archives.postgresql.org/message-id/4D5C46BB020000250003AB40@gw.wicourts.gov

It is a pretty straightforward bug fix to initialize some currently
uninitialized data which is causing occasional but severe problems,
especially during vacuum.

Done, thanks.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Magnus Hagander (#20)

Re: SSI bug?

Heikki Linnakangas wrote:
On 14.02.2011 20:10, Kevin Grittner wrote:

Promotion of the lock granularity on the prior tuple is where we
have problems. If the two tuple versions are in separate pages
then the second UPDATE could miss the conflict. My first thought
was to fix that by requiring promotion of a predicate lock on a
tuple to jump straight to the relation level if nextVersionOfRow
is set for the lock target and it points to a tuple in a different
page. But that doesn't cover a situation where we have a heap
tuple predicate lock which gets promoted to page granularity
before the tuple is updated. To handle that we would need to say
that an UPDATE to a tuple on a page which is predicate locked by
the transaction would need to be promoted to relation granularity
if the new version of the tuple wasn't on the same page as the old
version.

Yeah, promoting the original lock on the UPDATE was my first
thought too.

Another idea is to duplicate the original predicate lock on the
first update, so that the original reader holds a lock on both row
versions. I think that would ultimately be simpler as we wouldn't
need the next-prior chains anymore.

For example, suppose that transaction X is holding a predicate lock
on tuple A. Transaction Y updates tuple A, creating a new tuple B.
Transaction Y sees that X holds a lock on tuple A (or the page
containing A), so it acquires a new predicate lock on tuple B on
behalf of X.

If the updater aborts, the lock on the new tuple needs to be
cleaned up, so that it doesn't get confused with later tuple that's
stored in the same physical location. We could store the xmin of
the tuple in the predicate lock to check for that. Whenever you
check for conflict, if the xmin of the lock doesn't match the xmin
on the tuple, you know that the lock belonged to an old dead tuple
stored in the same location, and can be simply removed as the tuple
doesn't exist anymore.

That said, the above is about eliminating false negatives from
some corner cases which escaped notice until now. I don't think
the changes described above will do anything to prevent the
problems reported by YAMAMOTO Takashi.

Agreed, it's a separate issue. Although if we change the way we
handle the read-update-update problem, the other issue might go
away too.

Unless I'm missing something, it sounds like tuple IDs are being
changed or reused while predicate locks are held on the tuples.
That's probably not going to be overwhelmingly hard to fix if we
can identify how that can happen. I tried to cover HOT issues, but
it seems likely I missed something.

Storing the xmin of the original tuple would probably help with
that too. But it would be nice to understand and be able to
reproduce the issue first.

I haven't been able to produce a test case yet, but I'm getting clear
enough about the issue that I think I can see my way to a good fix.
Even if I have a fix first, I'll continue to try to create a test to
show the pre-fix failure (and post-fix success), to include in the
regression suite. This is the sort of thing which is hard enough to
hit that a regression could slip in because of some change and make
it out in a release unnoticed if we aren't specifically testing for
it.

It's pretty clear that the issue is this -- we are cleaning up the
predicate locks for a transaction when the transaction which read the
data becomes irrelevant; but a heap tuple predicate locked by a
transaction may be updated or deleted and become eligible for cleanup
before the transaction become irrelevant. This can lead to cycles in
the target links if a tuple ID is reused before the transaction is
cleaned up.

Since we may want to detect the rw-conflicts with the reading
transaction based on later versions of a row after the tuple it read
has been updated and the old tuple removed and its ID reused,
Heikki's suggestion of predicate lock duplication in these cases goes
beyond being a simple way to avoid the particular symptoms reported;
it is actually necessary for correct behavior. One adjustment I'm
looking at for it is that I think the tuple xmin can go in the lock
target structure rather than the lock structure, because if the tuple
ID has already been reused, it's pretty clear that there are no
transactions which can now update the old tuple which had that ID, so
any predicate locks attached to a target with the old xmin can be
released and the target reused with the new xmin.

This does, however, raise the question of what happens if the tuple
lock has been promoted to a page lock and a tuple on that page is
non-HOT updated. (Clearly, it's not an issue if the locks on a heap
relation for a transaction have been promoted all the way to the
relation granularity, but it's equally clear we don't want to jump to
that granularity when we can avoid it.)

It seems to me that in addition to what Heikki suggested, we need to
create a predicate lock on the new tuple or its (new) page for every
transaction holding a predicate lock on the old page whenever a tuple
on a locked page is updated. We won't know whether the updated tuple
is one that was read, or by which of the transactions holding page
locks, but we need to act as though it was read by all of them to
prevent false negatives. Covering this just at the tuple level
doesn't seem adequate.

It looks like these changes will be very localized and unlikely to
break other things. We're talking about corner cases which haven't
surfaced in the hundreds of existing SSI regression tests, and for
which it is hard to actually create a reproducible test case. The
HOT updates tended to expose the issues because of the more
aggressive cleanup in HOT pruning. That's going to be important to
exploit in creating a reproducible test case.

I'm proceeding on this basis. Any comments welcome.

-Kevin

Import Notes

Resolved by subject fallback

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Kevin Grittner (#21)

1 attachment(s)

Re: SSI bug?

"Kevin Grittner" wrote:

I'm proceeding on this basis.

Result attached. I found myself passing around the tuple xmin value
just about everywhere that the predicate lock target tag was being
passed, so it finally dawned on me that this logically belonged as
part of the target tag. That simplified the changes, and the net
result of following Heikki's suggestion here is the reduction of
total lines of code by 178 while adding coverage for missed corner
cases and fixing bugs.

Thanks again, Heikki!

I will test this some more tomorrow. So far I haven't done more than
ensure it passes the standard regression tests and the isolation
tests added for SSI. The latter took awhile because the hash_any
function was including uninitialized bytes past the length of the tag
in its calculations. We should probably either fix that or document
it. I had to add another couple bytes to the tag to get it to a four
byte boundary to fix it. Easy once you know that's how it works...

The attached patch can also be viewed here:

http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=46fd5ea6728b566c521ec83048bc00a207289dd9

If this stands up to further testing, the only issue I know of for
the 9.1 release is to update the documentation of shared memory usage
to include the new structures.

-Kevin

Attachments:

ssi-multi-update-1.patchapplication/octet-stream; name=ssi-multi-update-1.patchDownload

*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
***************
*** 124,133 ****
   *	SerializableXactHashLock
   *		- Protects both PredXact and SerializableXidHash.
   *
-  *	PredicateLockNextRowLinkLock
-  *		- Protects the priorVersionOfRow and nextVersionOfRow fields of
-  *			PREDICATELOCKTARGET when linkage is being created or destroyed.
-  *
   *
   * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
--- 124,129 ----
***************
*** 444,451 **** static void ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
  						   bool summarize);
  static bool XidIsConcurrent(TransactionId xid);
  static void CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag);
- static bool CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
- 						  PREDICATELOCKTARGETTAG *nexttargettag);
  static void FlagRWConflict(SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer);
  static void OnConflict_CheckForSerializationFailure(const SERIALIZABLEXACT *reader,
  										SERIALIZABLEXACT *writer);
--- 440,445 ----
***************
*** 1044,1050 **** InitPredicateLocks(void)
  		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
  		PredXact->CanPartialClearThrough = 0;
  		PredXact->HavePartialClearedThrough = 0;
- 		PredXact->NeedTargetLinkCleanup = false;
  		requestSize = mul_size((Size) max_table_size,
  							   PredXactListElementDataSize);
  		PredXact->element = ShmemAlloc(requestSize);
--- 1038,1043 ----
***************
*** 1747,1753 **** static void
  RemoveTargetIfNoLongerUsed(PREDICATELOCKTARGET *target, uint32 targettaghash)
  {
  	PREDICATELOCKTARGET *rmtarget;
- 	PREDICATELOCKTARGET *next;
  
  	Assert(LWLockHeldByMe(SerializablePredicateLockListLock));
  
--- 1740,1745 ----
***************
*** 1755,1787 **** RemoveTargetIfNoLongerUsed(PREDICATELOCKTARGET *target, uint32 targettaghash)
  	if (!SHMQueueEmpty(&target->predicateLocks))
  		return;
  
- 	/* Can't remove it if there are locks for a prior row version. */
- 	LWLockAcquire(PredicateLockNextRowLinkLock, LW_EXCLUSIVE);
- 	if (target->priorVersionOfRow != NULL)
- 	{
- 		LWLockRelease(PredicateLockNextRowLinkLock);
- 		return;
- 	}
- 
- 	/*
- 	 * We are going to release this target,  This requires that we let the
- 	 * next version of the row (if any) know that it's previous version is
- 	 * done.
- 	 *
- 	 * It might be that the link was all that was keeping the other target
- 	 * from cleanup, but we can't clean that up here -- LW locking is all
- 	 * wrong for that.	We'll pass the HTAB in the general cleanup function to
- 	 * get rid of such "dead" targets.
- 	 */
- 	next = target->nextVersionOfRow;
- 	if (next != NULL)
- 	{
- 		next->priorVersionOfRow = NULL;
- 		if (SHMQueueEmpty(&next->predicateLocks))
- 			PredXact->NeedTargetLinkCleanup = true;
- 	}
- 	LWLockRelease(PredicateLockNextRowLinkLock);
- 
  	/* Actually remove the target. */
  	rmtarget = hash_search_with_hash_value(PredicateLockTargetHash,
  										   &target->tag,
--- 1747,1752 ----
***************
*** 2065,2075 **** CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
  				 errmsg("out of shared memory"),
  				 errhint("You might need to increase max_pred_locks_per_transaction.")));
  	if (!found)
- 	{
  		SHMQueueInit(&(target->predicateLocks));
- 		target->priorVersionOfRow = NULL;
- 		target->nextVersionOfRow = NULL;
- 	}
  
  	/* We've got the sxact and target, make sure they're joined. */
  	locktag.myTarget = target;
--- 2030,2036 ----
***************
*** 2215,2220 **** PredicateLockTuple(const Relation relation, const HeapTuple tuple)
--- 2176,2182 ----
  {
  	PREDICATELOCKTARGETTAG tag;
  	ItemPointer tid;
+ 	TransactionId targetxmin;
  
  	if (SkipSerialization(relation))
  		return;
***************
*** 2224,2238 **** PredicateLockTuple(const Relation relation, const HeapTuple tuple)
  	 */
  	if (relation->rd_index == NULL)
  	{
! 		TransactionId myxid = GetTopTransactionIdIfAny();
  
  		if (TransactionIdIsValid(myxid))
  		{
! 			TransactionId xid = HeapTupleHeaderGetXmin(tuple->t_data);
! 
! 			if (TransactionIdFollowsOrEquals(xid, TransactionXmin))
  			{
! 				xid = SubTransGetTopmostTransaction(xid);
  				if (TransactionIdEquals(xid, myxid))
  				{
  					/* We wrote it; we already have a write lock. */
--- 2186,2201 ----
  	 */
  	if (relation->rd_index == NULL)
  	{
! 		TransactionId	myxid;
  
+ 		targetxmin = HeapTupleHeaderGetXmin(tuple->t_data);
+ 
+ 		myxid = GetTopTransactionIdIfAny();
  		if (TransactionIdIsValid(myxid))
  		{
! 			if (TransactionIdFollowsOrEquals(targetxmin, TransactionXmin))
  			{
! 				TransactionId xid = SubTransGetTopmostTransaction(targetxmin);
  				if (TransactionIdEquals(xid, myxid))
  				{
  					/* We wrote it; we already have a write lock. */
***************
*** 2241,2246 **** PredicateLockTuple(const Relation relation, const HeapTuple tuple)
--- 2204,2211 ----
  			}
  		}
  	}
+ 	else
+ 		targetxmin = InvalidTransactionId;
  
  	/*
  	 * Do quick-but-not-definitive test for a relation lock first.	This will
***************
*** 2259,2374 **** PredicateLockTuple(const Relation relation, const HeapTuple tuple)
  									 relation->rd_node.dbNode,
  									 relation->rd_id,
  									 ItemPointerGetBlockNumber(tid),
! 									 ItemPointerGetOffsetNumber(tid));
  	PredicateLockAcquire(&tag);
  }
  
  /*
!  * If the old tuple has any predicate locks, create a lock target for the
!  * new tuple and point them at each other.	Conflict detection needs to
!  * look for locks against prior versions of the row.
   */
  void
  PredicateLockTupleRowVersionLink(const Relation relation,
  								 const HeapTuple oldTuple,
  								 const HeapTuple newTuple)
  {
! 	PREDICATELOCKTARGETTAG oldtargettag;
  	PREDICATELOCKTARGETTAG newtargettag;
- 	PREDICATELOCKTARGET *oldtarget;
- 	PREDICATELOCKTARGET *newtarget;
- 	PREDICATELOCKTARGET *next;
- 	uint32		oldtargettaghash;
- 	LWLockId	oldpartitionLock;
- 	uint32		newtargettaghash;
- 	LWLockId	newpartitionLock;
- 	bool		found;
  
! 	SET_PREDICATELOCKTARGETTAG_TUPLE(oldtargettag,
  									 relation->rd_node.dbNode,
  									 relation->rd_id,
  							  ItemPointerGetBlockNumber(&(oldTuple->t_self)),
! 							ItemPointerGetOffsetNumber(&(oldTuple->t_self)));
! 	oldtargettaghash = PredicateLockTargetTagHashCode(&oldtargettag);
! 	oldpartitionLock = PredicateLockHashPartitionLock(oldtargettaghash);
  
  	SET_PREDICATELOCKTARGETTAG_TUPLE(newtargettag,
  									 relation->rd_node.dbNode,
  									 relation->rd_id,
  							  ItemPointerGetBlockNumber(&(newTuple->t_self)),
! 							ItemPointerGetOffsetNumber(&(newTuple->t_self)));
! 	newtargettaghash = PredicateLockTargetTagHashCode(&newtargettag);
! 	newpartitionLock = PredicateLockHashPartitionLock(newtargettaghash);
  
! 	/* Lock lower numbered partition first. */
! 	if (oldpartitionLock < newpartitionLock)
! 	{
! 		LWLockAcquire(oldpartitionLock, LW_SHARED);
! 		LWLockAcquire(newpartitionLock, LW_EXCLUSIVE);
! 	}
! 	else if (newpartitionLock < oldpartitionLock)
! 	{
! 		LWLockAcquire(newpartitionLock, LW_EXCLUSIVE);
! 		LWLockAcquire(oldpartitionLock, LW_SHARED);
! 	}
! 	else
! 		LWLockAcquire(newpartitionLock, LW_EXCLUSIVE);
  
! 	oldtarget = (PREDICATELOCKTARGET *)
! 		hash_search_with_hash_value(PredicateLockTargetHash,
! 									&oldtargettag, oldtargettaghash,
! 									HASH_FIND, NULL);
  
! 	/* Only need to link if there is an old target already. */
! 	if (oldtarget)
! 	{
! 		LWLockAcquire(PredicateLockNextRowLinkLock, LW_EXCLUSIVE);
! 
! 		/* Guard against stale pointers from rollback. */
! 		next = oldtarget->nextVersionOfRow;
! 		if (next != NULL)
! 		{
! 			next->priorVersionOfRow = NULL;
! 			oldtarget->nextVersionOfRow = NULL;
! 		}
! 
! 		/* Find or create the new target, and link old and new. */
! 		newtarget = (PREDICATELOCKTARGET *)
! 			hash_search_with_hash_value(PredicateLockTargetHash,
! 										&newtargettag, newtargettaghash,
! 										HASH_ENTER, &found);
! 		if (!newtarget)
! 			ereport(ERROR,
! 					(errcode(ERRCODE_OUT_OF_MEMORY),
! 					 errmsg("out of shared memory"),
! 					 errhint("You might need to increase max_pred_locks_per_transaction.")));
! 		if (!found)
! 		{
! 			SHMQueueInit(&(newtarget->predicateLocks));
! 			newtarget->nextVersionOfRow = NULL;
! 		}
! 		else
! 			Assert(newtarget->priorVersionOfRow == NULL);
! 
! 		newtarget->priorVersionOfRow = oldtarget;
! 		oldtarget->nextVersionOfRow = newtarget;
! 
! 		LWLockRelease(PredicateLockNextRowLinkLock);
! 	}
! 
! 	/* Release lower number partition last. */
! 	if (oldpartitionLock < newpartitionLock)
! 	{
! 		LWLockRelease(newpartitionLock);
! 		LWLockRelease(oldpartitionLock);
! 	}
! 	else if (newpartitionLock < oldpartitionLock)
! 	{
! 		LWLockRelease(oldpartitionLock);
! 		LWLockRelease(newpartitionLock);
! 	}
! 	else
! 		LWLockRelease(newpartitionLock);
  }
  
  
--- 2224,2271 ----
  									 relation->rd_node.dbNode,
  									 relation->rd_id,
  									 ItemPointerGetBlockNumber(tid),
! 									 ItemPointerGetOffsetNumber(tid),
! 									 targetxmin);
  	PredicateLockAcquire(&tag);
  }
  
  /*
!  * If the old tuple has any predicate locks, copy them to the new target.
   */
  void
  PredicateLockTupleRowVersionLink(const Relation relation,
  								 const HeapTuple oldTuple,
  								 const HeapTuple newTuple)
  {
! 	PREDICATELOCKTARGETTAG oldtupletargettag;
! 	PREDICATELOCKTARGETTAG oldpagetargettag;
  	PREDICATELOCKTARGETTAG newtargettag;
  
! 	SET_PREDICATELOCKTARGETTAG_TUPLE(oldtupletargettag,
  									 relation->rd_node.dbNode,
  									 relation->rd_id,
  							  ItemPointerGetBlockNumber(&(oldTuple->t_self)),
! 							ItemPointerGetOffsetNumber(&(oldTuple->t_self)),
! 									 HeapTupleHeaderGetXmin(oldTuple->t_data));
! 
! 	SET_PREDICATELOCKTARGETTAG_PAGE(oldpagetargettag,
! 									relation->rd_node.dbNode,
! 									relation->rd_id,
! 							  ItemPointerGetBlockNumber(&(oldTuple->t_self)));
  
  	SET_PREDICATELOCKTARGETTAG_TUPLE(newtargettag,
  									 relation->rd_node.dbNode,
  									 relation->rd_id,
  							  ItemPointerGetBlockNumber(&(newTuple->t_self)),
! 							ItemPointerGetOffsetNumber(&(newTuple->t_self)),
! 									 HeapTupleHeaderGetXmin(newTuple->t_data));
  
! 	LWLockAcquire(SerializablePredicateLockListLock, LW_EXCLUSIVE);
  
! 	TransferPredicateLocksToNewTarget(oldtupletargettag, newtargettag, false);
! 	TransferPredicateLocksToNewTarget(oldpagetargettag, newtargettag, false);
  
! 	LWLockRelease(SerializablePredicateLockListLock);
  }
  
  
***************
*** 2533,2543 **** TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
  
  		/* If we created a new entry, initialize it */
  		if (!found)
- 		{
  			SHMQueueInit(&(newtarget->predicateLocks));
- 			newtarget->priorVersionOfRow = NULL;
- 			newtarget->nextVersionOfRow = NULL;
- 		}
  
  		newpredlocktag.myTarget = newtarget;
  
--- 2430,2436 ----
***************
*** 3132,3140 **** ClearOldPredicateLocks(void)
  {
  	SERIALIZABLEXACT *finishedSxact;
  	PREDICATELOCK *predlock;
- 	int			i;
- 	HASH_SEQ_STATUS seqstat;
- 	PREDICATELOCKTARGET *locktarget;
  
  	LWLockAcquire(SerializableFinishedListLock, LW_EXCLUSIVE);
  	finishedSxact = (SERIALIZABLEXACT *)
--- 3025,3030 ----
***************
*** 3232,3266 **** ClearOldPredicateLocks(void)
  
  	LWLockRelease(SerializablePredicateLockListLock);
  	LWLockRelease(SerializableFinishedListLock);
- 
- 	if (!PredXact->NeedTargetLinkCleanup)
- 		return;
- 
- 	/*
- 	 * Clean up any targets which were disconnected from a prior version with
- 	 * no predicate locks attached.
- 	 */
- 	for (i = 0; i < NUM_PREDICATELOCK_PARTITIONS; i++)
- 		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_EXCLUSIVE);
- 	LWLockAcquire(PredicateLockNextRowLinkLock, LW_SHARED);
- 
- 	hash_seq_init(&seqstat, PredicateLockTargetHash);
- 	while ((locktarget = (PREDICATELOCKTARGET *) hash_seq_search(&seqstat)))
- 	{
- 		if (SHMQueueEmpty(&locktarget->predicateLocks)
- 			&& locktarget->priorVersionOfRow == NULL
- 			&& locktarget->nextVersionOfRow == NULL)
- 		{
- 			hash_search(PredicateLockTargetHash, &locktarget->tag,
- 						HASH_REMOVE, NULL);
- 		}
- 	}
- 
- 	PredXact->NeedTargetLinkCleanup = false;
- 
- 	LWLockRelease(PredicateLockNextRowLinkLock);
- 	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
- 		LWLockRelease(FirstPredicateLockMgrLock + i);
  }
  
  /*
--- 3122,3127 ----
***************
*** 3682,3713 **** CheckForSerializableConflictOut(const bool visible, const Relation relation,
  static void
  CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
  {
- 	PREDICATELOCKTARGETTAG nexttargettag = { 0 };
- 	PREDICATELOCKTARGETTAG thistargettag;
- 
- 	for (;;)
- 	{
- 		if (!CheckSingleTargetForConflictsIn(targettag, &nexttargettag))
- 			break;
- 		thistargettag = nexttargettag;
- 		targettag = &thistargettag;
- 	}
- }
- 
- /*
-  * Check a particular target for rw-dependency conflict in. If the tuple
-  * has prior versions, returns true and *nexttargettag is set to the tag
-  * of the prior tuple version.
-  */
- static bool
- CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
- 								PREDICATELOCKTARGETTAG *nexttargettag)
- {
  	uint32		targettaghash;
  	LWLockId	partitionLock;
  	PREDICATELOCKTARGET *target;
  	PREDICATELOCK *predlock;
- 	bool		hasnexttarget = false;
  
  	Assert(MySerializableXact != InvalidSerializableXact);
  
--- 3543,3552 ----
***************
*** 3717,3723 **** CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
  	targettaghash = PredicateLockTargetTagHashCode(targettag);
  	partitionLock = PredicateLockHashPartitionLock(targettaghash);
  	LWLockAcquire(partitionLock, LW_SHARED);
- 	LWLockAcquire(PredicateLockNextRowLinkLock, LW_SHARED);
  	target = (PREDICATELOCKTARGET *)
  		hash_search_with_hash_value(PredicateLockTargetHash,
  									targettag, targettaghash,
--- 3556,3561 ----
***************
*** 3725,3747 **** CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
  	if (!target)
  	{
  		/* Nothing has this target locked; we're done here. */
- 		LWLockRelease(PredicateLockNextRowLinkLock);
  		LWLockRelease(partitionLock);
! 		return false;
  	}
  
  	/*
- 	 * If the target is linked to a prior version of the row, save the tag so
- 	 * that it can be used for iterative calls to this function.
- 	 */
- 	if (target->priorVersionOfRow != NULL)
- 	{
- 		*nexttargettag = target->priorVersionOfRow->tag;
- 		hasnexttarget = true;
- 	}
- 	LWLockRelease(PredicateLockNextRowLinkLock);
- 
- 	/*
  	 * Each lock for an overlapping transaction represents a conflict: a
  	 * rw-dependency in to this transaction.
  	 */
--- 3563,3573 ----
  	if (!target)
  	{
  		/* Nothing has this target locked; we're done here. */
  		LWLockRelease(partitionLock);
! 		return;
  	}
  
  	/*
  	 * Each lock for an overlapping transaction represents a conflict: a
  	 * rw-dependency in to this transaction.
  	 */
***************
*** 3848,3854 **** CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
  					 * the target, bail out before re-acquiring the locks.
  					 */
  					if (rmtarget)
! 						return hasnexttarget;
  
  					/*
  					 * The list has been altered.  Start over at the front.
--- 3674,3680 ----
  					 * the target, bail out before re-acquiring the locks.
  					 */
  					if (rmtarget)
! 						return;
  
  					/*
  					 * The list has been altered.  Start over at the front.
***************
*** 3895,3902 **** CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
  	}
  	LWLockRelease(SerializableXactHashLock);
  	LWLockRelease(partitionLock);
- 
- 	return hasnexttarget;
  }
  
  /*
--- 3721,3726 ----
***************
*** 3943,3949 **** CheckForSerializableConflictIn(const Relation relation, const HeapTuple tuple,
  										 relation->rd_node.dbNode,
  										 relation->rd_id,
  						 ItemPointerGetBlockNumber(&(tuple->t_data->t_ctid)),
! 					   ItemPointerGetOffsetNumber(&(tuple->t_data->t_ctid)));
  		CheckTargetForConflictsIn(&targettag);
  	}
  
--- 3767,3774 ----
  										 relation->rd_node.dbNode,
  										 relation->rd_id,
  						 ItemPointerGetBlockNumber(&(tuple->t_data->t_ctid)),
! 					   ItemPointerGetOffsetNumber(&(tuple->t_data->t_ctid)),
! 					   HeapTupleHeaderGetXmin(tuple->t_data));
  		CheckTargetForConflictsIn(&targettag);
  	}
  
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 78,84 **** typedef enum LWLockId
  	SerializableFinishedListLock,
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
- 	PredicateLockNextRowLinkLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
--- 78,83 ----
*** a/src/include/storage/predicate_internals.h
--- b/src/include/storage/predicate_internals.h
***************
*** 150,157 **** typedef struct PredXactListData
  	SerCommitSeqNo HavePartialClearedThrough;	/* have cleared through this
  												 * seq no */
  	SERIALIZABLEXACT *OldCommittedSxact;		/* shared copy of dummy sxact */
- 	bool		NeedTargetLinkCleanup;	/* to save cleanup effort for rare
- 										 * case */
  
  	PredXactListElement element;
  } PredXactListData;
--- 150,155 ----
***************
*** 231,239 **** typedef struct SERIALIZABLEXID
  
  /*
   * The PREDICATELOCKTARGETTAG struct identifies a database object which can
!  * be the target of predicate locks.  It is designed to fit into 16 bytes
!  * with no padding.  Note that this would need adjustment if we widen Oid or
!  * BlockNumber to more than 32 bits.
   *
   * TODO SSI: If we always use the same fields for the same type of value, we
   * should rename these.  Holding off until it's clear there are no exceptions.
--- 229,241 ----
  
  /*
   * The PREDICATELOCKTARGETTAG struct identifies a database object which can
!  * be the target of predicate locks.
!  *
!  * locktag_field6 was added to allow slack space to be filled so that stack-
!  * allocated tags wouldn't have random bytes in the middle.  Moving the only
!  * uint16 to the end didn't work, because the hash function being used
!  * doesn't properly respect tag length -- it will go to a four byte boundary
!  * past the end of the tag.
   *
   * TODO SSI: If we always use the same fields for the same type of value, we
   * should rename these.  Holding off until it's clear there are no exceptions.
***************
*** 247,254 **** typedef struct PREDICATELOCKTARGETTAG
  	uint32		locktag_field1; /* a 32-bit ID field */
  	uint32		locktag_field2; /* a 32-bit ID field */
  	uint32		locktag_field3; /* a 32-bit ID field */
  	uint16		locktag_field4; /* a 16-bit ID field */
! 	uint16		locktag_field5; /* a 16-bit ID field */
  } PREDICATELOCKTARGETTAG;
  
  /*
--- 249,257 ----
  	uint32		locktag_field1; /* a 32-bit ID field */
  	uint32		locktag_field2; /* a 32-bit ID field */
  	uint32		locktag_field3; /* a 32-bit ID field */
+ 	uint32		locktag_field5; /* a 32-bit ID field */
  	uint16		locktag_field4; /* a 16-bit ID field */
! 	uint16		locktag_field6; /* filler */
  } PREDICATELOCKTARGETTAG;
  
  /*
***************
*** 260,271 **** typedef struct PREDICATELOCKTARGETTAG
   * already have one.  An entry is removed when the last lock is removed from
   * its list.
   *
!  * Because a check for predicate locks on a tuple target should also find
!  * locks on previous versions of the same row, if there are any created by
!  * overlapping transactions, we keep a pointer to the target for the prior
!  * version of the row.	We also keep a pointer to the next version of the
!  * row, so that when we no longer have any predicate locks and the back
!  * pointer is clear, we can clean up the prior pointer for the next version.
   */
  typedef struct PREDICATELOCKTARGET PREDICATELOCKTARGET;
  
--- 263,273 ----
   * already have one.  An entry is removed when the last lock is removed from
   * its list.
   *
!  * Because a particular target might become obsolete, due to update to a new
!  * version, before the reading transaction is obsolete, we need some way to
!  * prevent errors from reuse of a tuple ID.  Rather than attempting to clean
!  * up the targets as the related tuples are pruned or vacuumed, we check the
!  * xmin on access.  This should be far less costly.
   */
  typedef struct PREDICATELOCKTARGET PREDICATELOCKTARGET;
  
***************
*** 277,291 **** struct PREDICATELOCKTARGET
  	/* data */
  	SHM_QUEUE	predicateLocks; /* list of PREDICATELOCK objects assoc. with
  								 * predicate lock target */
- 
- 	/*
- 	 * The following two pointers are only used for tuple locks, and are only
- 	 * consulted for conflict detection and cleanup; not for granularity
- 	 * promotion.
- 	 */
- 	PREDICATELOCKTARGET *priorVersionOfRow;		/* what other locks to check */
- 	PREDICATELOCKTARGET *nextVersionOfRow;		/* who has pointer here for
- 												 * more targets */
  };
  
  
--- 279,284 ----
***************
*** 387,407 **** typedef struct PredicateLockData
  	 (locktag).locktag_field2 = (reloid), \
  	 (locktag).locktag_field3 = InvalidBlockNumber, \
  	 (locktag).locktag_field4 = InvalidOffsetNumber, \
! 	 (locktag).locktag_field5 = 0)
  
  #define SET_PREDICATELOCKTARGETTAG_PAGE(locktag,dboid,reloid,blocknum) \
  	((locktag).locktag_field1 = (dboid), \
  	 (locktag).locktag_field2 = (reloid), \
  	 (locktag).locktag_field3 = (blocknum), \
  	 (locktag).locktag_field4 = InvalidOffsetNumber, \
! 	 (locktag).locktag_field5 = 0)
  
! #define SET_PREDICATELOCKTARGETTAG_TUPLE(locktag,dboid,reloid,blocknum,offnum) \
  	((locktag).locktag_field1 = (dboid), \
  	 (locktag).locktag_field2 = (reloid), \
  	 (locktag).locktag_field3 = (blocknum), \
  	 (locktag).locktag_field4 = (offnum), \
! 	 (locktag).locktag_field5 = 0)
  
  #define GET_PREDICATELOCKTARGETTAG_DB(locktag) \
  	((locktag).locktag_field1)
--- 380,403 ----
  	 (locktag).locktag_field2 = (reloid), \
  	 (locktag).locktag_field3 = InvalidBlockNumber, \
  	 (locktag).locktag_field4 = InvalidOffsetNumber, \
! 	 (locktag).locktag_field5 = InvalidTransactionId, \
! 	 (locktag).locktag_field6 = 0)
  
  #define SET_PREDICATELOCKTARGETTAG_PAGE(locktag,dboid,reloid,blocknum) \
  	((locktag).locktag_field1 = (dboid), \
  	 (locktag).locktag_field2 = (reloid), \
  	 (locktag).locktag_field3 = (blocknum), \
  	 (locktag).locktag_field4 = InvalidOffsetNumber, \
! 	 (locktag).locktag_field5 = InvalidTransactionId, \
! 	 (locktag).locktag_field6 = 0)
  
! #define SET_PREDICATELOCKTARGETTAG_TUPLE(locktag,dboid,reloid,blocknum,offnum,xmin) \
  	((locktag).locktag_field1 = (dboid), \
  	 (locktag).locktag_field2 = (reloid), \
  	 (locktag).locktag_field3 = (blocknum), \
  	 (locktag).locktag_field4 = (offnum), \
! 	 (locktag).locktag_field5 = (xmin), \
! 	 (locktag).locktag_field6 = 0)
  
  #define GET_PREDICATELOCKTARGETTAG_DB(locktag) \
  	((locktag).locktag_field1)
***************
*** 411,416 **** typedef struct PredicateLockData
--- 407,414 ----
  	((locktag).locktag_field3)
  #define GET_PREDICATELOCKTARGETTAG_OFFSET(locktag) \
  	((locktag).locktag_field4)
+ #define GET_PREDICATELOCKTARGETTAG_XMIN(locktag) \
+ 	((locktag).locktag_field5)
  #define GET_PREDICATELOCKTARGETTAG_TYPE(locktag)							 \
  	(((locktag).locktag_field4 != InvalidOffsetNumber) ? PREDLOCKTAG_TUPLE : \
  	 (((locktag).locktag_field3 != InvalidBlockNumber) ? PREDLOCKTAG_PAGE :   \

Import Notes

Resolved by subject fallback

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#22)

Re: SSI bug?

hi,

"Kevin Grittner" wrote:

I'm proceeding on this basis.

Result attached. I found myself passing around the tuple xmin value
just about everywhere that the predicate lock target tag was being
passed, so it finally dawned on me that this logically belonged as
part of the target tag. That simplified the changes, and the net
result of following Heikki's suggestion here is the reduction of
total lines of code by 178 while adding coverage for missed corner
cases and fixing bugs.

Thanks again, Heikki!

I will test this some more tomorrow. So far I haven't done more than
ensure it passes the standard regression tests and the isolation
tests added for SSI. The latter took awhile because the hash_any
function was including uninitialized bytes past the length of the tag
in its calculations. We should probably either fix that or document
it. I had to add another couple bytes to the tag to get it to a four
byte boundary to fix it. Easy once you know that's how it works...

The attached patch can also be viewed here:

http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=46fd5ea6728b566c521ec83048bc00a207289dd9

If this stands up to further testing, the only issue I know of for
the 9.1 release is to update the documentation of shared memory usage
to include the new structures.

-Kevin

i tested ede45e90dd1992bfd3e1e61ce87bad494b81f54d + ssi-multi-update-1.patch
with my application and got the following assertion failure.

YAMAMOTO Takashi

Core was generated by `postgres'.
Program terminated with signal 6, Aborted.
#0 0xbbba4cc7 in _lwp_kill () from /usr/lib/libc.so.12
(gdb) bt
#0 0xbbba4cc7 in _lwp_kill () from /usr/lib/libc.so.12
#1 0xbbba4c85 in raise (s=6) at /siro/nbsd/src/lib/libc/gen/raise.c:48
#2 0xbbba445a in abort () at /siro/nbsd/src/lib/libc/stdlib/abort.c:74
#3 0x0833d9f2 in ExceptionalCondition (
conditionName=0x8493f6d "!(locallock != ((void *)0))",
errorType=0x8370451 "FailedAssertion", fileName=0x8493ec3 "predicate.c",
lineNumber=3657) at assert.c:57
#4 0x0827977e in CheckTargetForConflictsIn (targettag=0xbfbfce78)
at predicate.c:3657
#5 0x0827bd74 in CheckForSerializableConflictIn (relation=0x99b5bad4,
tuple=0xbfbfcf78, buffer=234) at predicate.c:3772
#6 0x080a5be8 in heap_update (relation=0x99b5bad4, otid=0xbfbfd038,
newtup=0x99a0aa08, ctid=0xbfbfd03e, update_xmax=0xbfbfd044, cid=1,
crosscheck=0x0, wait=1 '\001') at heapam.c:2593
#7 0x081c81ca in ExecModifyTable (node=0x99a0566c) at nodeModifyTable.c:624
#8 0x081b2153 in ExecProcNode (node=0x99a0566c) at execProcnode.c:371
#9 0x081b0f82 in standard_ExecutorRun (queryDesc=0x99b378f0,
direction=ForwardScanDirection, count=0) at execMain.c:1247
#10 0xbbaaf352 in pgss_ExecutorRun (queryDesc=0x99b378f0,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:541
#11 0xbbab5ee5 in explain_ExecutorRun (queryDesc=0x99b378f0,
direction=ForwardScanDirection, count=0) at auto_explain.c:201
#12 0x08288ae3 in ProcessQuery (plan=0x99bb404c,
sourceText=0x99b3781c "UPDATE file SET ctime = $1 WHERE fileid = $2",
params=<value optimized out>, dest=0x84d1f38, completionTag=0xbfbfd2f0 "")
at pquery.c:197
#13 0x08288d0a in PortalRunMulti (portal=0x99b3f01c,
isTopLevel=<value optimized out>, dest=dwarf2_read_address: Corrupted DWARF expression.
) at pquery.c:1268
#14 0x0828984a in PortalRun (portal=0x99b3f01c, count=2147483647,
isTopLevel=1 '\001', dest=0x99b07428, altdest=0x99b07428,
completionTag=0xbfbfd2f0 "") at pquery.c:822
#15 0x08286c22 in PostgresMain (argc=2, argv=0xbb9196a4,
username=0xbb9195f8 "takashi") at postgres.c:2004
#16 0x082413f6 in ServerLoop () at postmaster.c:3590
#17 0x082421a8 in PostmasterMain (argc=3, argv=0xbfbfe594) at postmaster.c:1110
#18 0x081e0d09 in main (argc=3, argv=0xbfbfe594) at main.c:199
(gdb) fr 4
#4 0x0827977e in CheckTargetForConflictsIn (targettag=0xbfbfce78)
at predicate.c:3657
3657 Assert(locallock != NULL);
(gdb) list
3652
3653 locallock = (LOCALPREDICATELOCK *)
3654 hash_search_with_hash_value(LocalPredicateLockHash,
3655 targettag, targettaghash,
3656 HASH_FIND, NULL);
3657 Assert(locallock != NULL);
3658 Assert(locallock->held);
3659 locallock->held = false;
3660
3661 if (locallock->childLocks == 0)
(gdb)

drkp@csail.mit.edu

almost 15 years ago

In reply to: Noname (#23)

Re: SSI bug?

On Mon, Feb 21, 2011 at 11:42:36PM +0000, YAMAMOTO Takashi wrote:

i tested ede45e90dd1992bfd3e1e61ce87bad494b81f54d + ssi-multi-update-1.patch
with my application and got the following assertion failure.

#4 0x0827977e in CheckTargetForConflictsIn (targettag=0xbfbfce78)
at predicate.c:3657
3657 Assert(locallock != NULL);

It looks like CheckTargetForConflictsIn is making the assumption that
the backend-local lock table is accurate, which was probably even true
at the time it was written. Unfortunately, it hasn't been for a while,
and the new changes for tuple versions make it more likely that this
will actually come up.

The solution is only slightly more complicated than just removing the
assertion. Unless Kevin beats me to it, I'll put together a patch later
tonight or tomorrow. (I'm at the airport right now.)

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Dan Ports (#24)

Re: SSI bug?

Dan Ports <drkp@csail.mit.edu> wrote:

It looks like CheckTargetForConflictsIn is making the assumption
that the backend-local lock table is accurate, which was probably
even true at the time it was written.

I remember we decided that it could only be false in certain ways
which allowed us to use it as a "lossy" first-cut test in a couple
places. I doubt that we can count on any of that any longer, and
should drop those heuristics.

the new changes for tuple versions make it more likely that this
will actually come up.

Yeah, as far as a I can recall the only divergence was in *page*
level entries for *indexes* until this latest patch. We now have
*tuple* level entries for *heap* relations, too.

The solution is only slightly more complicated than just removing
the assertion.

That's certainly true for that one spot, but we need an overall
review of where we might be trying to use LocalPredicateLockHash for
"first cut" tests as an optimization.

Unless Kevin beats me to it, I'll put together a patch later
tonight or tomorrow. (I'm at the airport right now.)

It would be great if you could get this one. Thanks.

-Kevin

drkp@csail.mit.edu

almost 15 years ago

In reply to: Kevin Grittner (#25)

Re: SSI bug?

On Tue, Feb 22, 2011 at 10:51:05AM -0600, Kevin Grittner wrote:

Dan Ports <drkp@csail.mit.edu> wrote:

It looks like CheckTargetForConflictsIn is making the assumption
that the backend-local lock table is accurate, which was probably
even true at the time it was written.

I remember we decided that it could only be false in certain ways
which allowed us to use it as a "lossy" first-cut test in a couple
places. I doubt that we can count on any of that any longer, and
should drop those heuristics.

Hmm. The theory was before that the local lock table would only have
false negatives, i.e. if it says we hold a lock then we really do. That
makes it a useful heuristic because we can bail out quickly if we're
trying to re-acquire a lock we already hold. It seems worthwhile to try
to preserve that.

This property holds as long as the only time one backend edits
another's lock list is to *add* a new lock, not remove an existing one.
Fortunately, this is true most of the time (at a glance, it seems to be
the case for the new tuple update code).

There are two exceptions; I think they're both OK but we need to be
careful here.
- if we're forced to promote an index page lock's granularity to avoid
running out of shared memory, we remove the fine-grained one. This
is fine as long as we relax our expectations to "if the local
lock table says we hold a lock, then we hold that lock or one that
covers it", which is acceptable for current users of the table.
- if we combine two index pages, we remove the lock entry for the page
being deleted. I think that's OK because the page is being removed
so we will not make any efforts to lock it.

Yeah, as far as a I can recall the only divergence was in *page*
level entries for *indexes* until this latest patch. We now have
*tuple* level entries for *heap* relations, too.

*nod*. I'm slightly concerned about the impact of that on granularity
promotion -- the new locks created by heap updates won't get counted
toward the lock promotion thresholds. That's not a fatal problem of
anything, but it could make granularity promotion less effective at
conserving lock entries.

The solution is only slightly more complicated than just removing
the assertion.

That's certainly true for that one spot, but we need an overall
review of where we might be trying to use LocalPredicateLockHash for
"first cut" tests as an optimization.

Yes, I'd planned to go through the references to LocalPredicateLockHash
to make sure none of them were making any unwarranted assumptions about
the results. Fortunately, there are not too many of them...

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Dan Ports (#26)

Re: SSI bug?

Dan Ports <drkp@csail.mit.edu> wrote:

On Tue, Feb 22, 2011 at 10:51:05AM -0600, Kevin Grittner wrote:

The theory was before that the local lock table would only have
false negatives, i.e. if it says we hold a lock then we really do.
That makes it a useful heuristic because we can bail out quickly
if we're trying to re-acquire a lock we already hold. It seems
worthwhile to try to preserve that.

This property holds as long as the only time one backend edits
another's lock list is to *add* a new lock, not remove an existing
one. Fortunately, this is true most of the time (at a glance, it
seems to be the case for the new tuple update code).

There are two exceptions; I think they're both OK but we need to
be careful here.
- if we're forced to promote an index page lock's granularity to
avoid running out of shared memory, we remove the fine-grained
one. This is fine as long as we relax our expectations to "if
the local lock table says we hold a lock, then we hold that
lock or one that covers it", which is acceptable for current
users of the table.

That makes sense. This one sounds OK.

- if we combine two index pages, we remove the lock entry for the
page being deleted. I think that's OK because the page is being
removed so we will not make any efforts to lock it.

I'm not sure it's safe to assume that the index page won't get
reused before the local lock information is cleared. In the absence
of a clear proof that it is safe, or some enforcement mechanism to
ensure that it is, I don't think we should make this assumption.
Off-hand I can't think of a clever way to make this safe which would
cost less than taking out the LW lock and checking the definitive
shared memory HTAB, but that might be for lack of creative thinking
at the moment..

Yeah, as far as a I can recall the only divergence was in *page*
level entries for *indexes* until this latest patch. We now have
*tuple* level entries for *heap* relations, too.

*nod*. I'm slightly concerned about the impact of that on
granularity promotion -- the new locks created by heap updates
won't get counted toward the lock promotion thresholds. That's not
a fatal problem of anything, but it could make granularity
promotion less effective at conserving lock entries.

This pattern doesn't seem to come up very often in most workloads.
Since it's feeding into a heuristic which already triggers pretty
quickly I think we should be OK with this. It makes me less tempted
to tinker with the threshold for promoting tuple locks to page
locks, though.

The only alternative I see would be to use some form of asynchronous
notification of the new locks so that the local table can be
maintained. That seems overkill without some clear evidence that it
is needed. I *really* wouldn't want to go back to needing LW locks
to maintain this info; as you know (and stated only for the benefit
of the list), it was a pretty serious contention point in early
profiling and adding the local table was a big part of getting an
early benchmark down from a 14+% performance hit for SSI to a 1.8%
performance hit.

-Kevin

drkp@csail.mit.edu

almost 15 years ago

In reply to: Kevin Grittner (#27)

Re: SSI bug?

On Tue, Feb 22, 2011 at 05:54:49PM -0600, Kevin Grittner wrote:

I'm not sure it's safe to assume that the index page won't get
reused before the local lock information is cleared. In the absence
of a clear proof that it is safe, or some enforcement mechanism to
ensure that it is, I don't think we should make this assumption.
Off-hand I can't think of a clever way to make this safe which would
cost less than taking out the LW lock and checking the definitive
shared memory HTAB, but that might be for lack of creative thinking
at the moment..

Hmm. Yeah, I wasn't sure about that one, and having now thought about
it some more I think it isn't safe -- consider adding a lock on an
index page concurrently with another backend merging that page into
another one.

The obvious solution to me is to just keep the lock on both the old and
new page. The downside is that because this requires allocating a new
lock and is in a context where we're not allowed to fail, we'll need to
fall back on acquiring the relation lock just as we do for page splits.
I was going to bemoan the extra complexity this would add -- but
actually, couldn't we just replace PredicateLockPageCombine with a call
to PredicateLockPageSplit since they'd now do the same thing?

The only alternative I see would be to use some form of asynchronous
notification of the new locks so that the local table can be
maintained. That seems overkill without some clear evidence that it
is needed.

I agree. It is certainly weird and undesirable that the backend-local
lock table is not always accurate, but I don't see a good way to keep
it up to date without the cure being worse than the disease.

I *really* wouldn't want to go back to needing LW locks
to maintain this info; as you know (and stated only for the benefit
of the list), it was a pretty serious contention point in early
profiling and adding the local table was a big part of getting an
early benchmark down from a 14+% performance hit for SSI to a 1.8%
performance hit.

Yes, it's definitely important for a backend to be able to check
whether it's already holding a lock (even if that's just a hint)
without having to take locks.

Let me add one more piece of info for the benefit of the list: a
backend's local lock table contains not just locks held by the backend,
but also an entry and refcount for every parent of a lock it holds.
This is used to determine when to promote to one of the coarser-grained
parent locks. It's both unnecessary and undesirable for that info to be
in shared memory.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Dan Ports (#28)

Re: SSI bug?

Dan Ports wrote:

The obvious solution to me is to just keep the lock on both the old
and new page.

That's the creative thinking I was failing to do. Keeping the old
lock will generate some false positives, but it will be rare and
those don't compromise correctness -- they just carry the cost of
starting the transaction over. In exchange you buy a savings in
predicate lock acquisition. In particular, by dodging LW locking for
a large percentage of calls, you help scalability. It can't cause
false negatives, so correctness is OK -- it's all about net costs.

I was going to bemoan the extra complexity this would add -- but
actually, couldn't we just replace PredicateLockPageCombine with a
call to PredicateLockPageSplit since they'd now do the same thing?

I'd be inclined to leave the external interface alone, in case we
conceive of an even better implementation. We can just remove the
removeOld bool from the parameter list of the static
TransferPredicateLocksToNewTarget function, and keep the behavior
from the "false" case.

-Kevin

Import Notes

Resolved by subject fallback

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Kevin Grittner (#29)

Re: SSI bug?

On 23.02.2011 07:20, Kevin Grittner wrote:

Dan Ports wrote:

The obvious solution to me is to just keep the lock on both the old
and new page.

That's the creative thinking I was failing to do. Keeping the old
lock will generate some false positives, but it will be rare and
those don't compromise correctness -- they just carry the cost of
starting the transaction over.

Sounds reasonable, but let me throw in another idea while we're at it:
if there's a lock on the index page we're about to delete, we could just
choose to not delete it. The next vacuum will pick it up. Presumably it
will happen rarely, so index bloat won't be an issue.

I was going to bemoan the extra complexity this would add -- but
actually, couldn't we just replace PredicateLockPageCombine with a
call to PredicateLockPageSplit since they'd now do the same thing?

I'd be inclined to leave the external interface alone, in case we
conceive of an even better implementation.

Agreed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Heikki Linnakangas (#30)

Re: SSI bug?

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

On 23.02.2011 07:20, Kevin Grittner wrote:

Dan Ports wrote:

The obvious solution to me is to just keep the lock on both the
old and new page.

That's the creative thinking I was failing to do. Keeping the
old lock will generate some false positives, but it will be rare
and those don't compromise correctness -- they just carry the
cost of starting the transaction over.

Sounds reasonable, but let me throw in another idea while we're at
it: if there's a lock on the index page we're about to delete, we
could just choose to not delete it. The next vacuum will pick it
up. Presumably it will happen rarely, so index bloat won't be an
issue.

Yeah, that's probably better.

-Kevin

drkp@csail.mit.edu

almost 15 years ago

In reply to: Dan Ports (#26)

1 attachment(s)

Re: SSI bug?

An updated patch to address this issue is attached. It fixes a couple
issues related to use of the backend-local lock table hint:

- CheckSingleTargetForConflictsIn now correctly handles the case
where a lock that's being held is not reflected in the local lock
table. This fixes the assertion failure reported in this thread.

- PredicateLockPageCombine now retains locks for the page that is
being removed, rather than removing them. This prevents a
potentially dangerous false-positive inconsistency where the local
lock table believes that a lock is held, but it is actually not.

- add some more comments documenting the times when the local lock
table can be inconsistent with reality, as reflected in the shared
memory table.

This patch also incorporates Kevin's changes to copy locks when
creating a new version of a tuple rather than trying to maintain a
linkage between different versions. So this is a patch that should
apply against HEAD and addresses all outstanding SSI bugs known to
Kevin or myself.

Besides the usual regression and isolation tests, I have tested this
by running DBT-2 on a 16-core machine to verify that there are no
assertion failures that only show up under concurrent access.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Attachments:

ssi-fixes.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index ba01874..7a0e1a9c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -824,7 +824,6 @@ restart:
 	if (_bt_page_recyclable(page))
 	{
 		/* Okay to recycle this page */
-		Assert(!PageIsPredicateLocked(rel, blkno));
 		RecordFreeIndexPage(rel, blkno);
 		vstate->totFreePages++;
 		stats->pages_deleted++;
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d660ce5..580af2a 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -124,10 +124,6 @@
  *	SerializableXactHashLock
  *		- Protects both PredXact and SerializableXidHash.
  *
- *	PredicateLockNextRowLinkLock
- *		- Protects the priorVersionOfRow and nextVersionOfRow fields of
- *			PREDICATELOCKTARGET when linkage is being created or destroyed.
- *
  *
  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -444,8 +440,6 @@ static void ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 						   bool summarize);
 static bool XidIsConcurrent(TransactionId xid);
 static void CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag);
-static bool CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
-						  PREDICATELOCKTARGETTAG *nexttargettag);
 static void FlagRWConflict(SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer);
 static void OnConflict_CheckForSerializationFailure(const SERIALIZABLEXACT *reader,
 										SERIALIZABLEXACT *writer);
@@ -1044,7 +1038,6 @@ InitPredicateLocks(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		PredXact->NeedTargetLinkCleanup = false;
 		requestSize = mul_size((Size) max_table_size,
 							   PredXactListElementDataSize);
 		PredXact->element = ShmemAlloc(requestSize);
@@ -1651,9 +1644,11 @@ PageIsPredicateLocked(const Relation relation, const BlockNumber blkno)
  * Important note: this function may return false even if the lock is
  * being held, because it uses the local lock table which is not
  * updated if another transaction modifies our lock list (e.g. to
- * split an index page). However, it will never return true if the
- * lock is not held. We only use this function in circumstances where
- * such false negatives are acceptable.
+ * split an index page). However, it will almost never return true if
+ * the lock is not held; it can only do so in rare circumstances when
+ * a coarser-granularity lock that covers this one is being held.  We
+ * are careful to only use this function in circumstances where such
+ * errors are acceptable.
  */
 static bool
 PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag)
@@ -1717,6 +1712,9 @@ GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
 /*
  * Check whether the lock we are considering is already covered by a
  * coarser lock for our transaction.
+ *
+ * Like PredicateLockExists, this function might return a false
+ * negative, but it will never return a false positive.
  */
 static bool
 CoarserLockCovers(const PREDICATELOCKTARGETTAG *newtargettag)
@@ -1747,7 +1745,6 @@ static void
 RemoveTargetIfNoLongerUsed(PREDICATELOCKTARGET *target, uint32 targettaghash)
 {
 	PREDICATELOCKTARGET *rmtarget;
-	PREDICATELOCKTARGET *next;
 
 	Assert(LWLockHeldByMe(SerializablePredicateLockListLock));
 
@@ -1755,33 +1752,6 @@ RemoveTargetIfNoLongerUsed(PREDICATELOCKTARGET *target, uint32 targettaghash)
 	if (!SHMQueueEmpty(&target->predicateLocks))
 		return;
 
-	/* Can't remove it if there are locks for a prior row version. */
-	LWLockAcquire(PredicateLockNextRowLinkLock, LW_EXCLUSIVE);
-	if (target->priorVersionOfRow != NULL)
-	{
-		LWLockRelease(PredicateLockNextRowLinkLock);
-		return;
-	}
-
-	/*
-	 * We are going to release this target,  This requires that we let the
-	 * next version of the row (if any) know that it's previous version is
-	 * done.
-	 *
-	 * It might be that the link was all that was keeping the other target
-	 * from cleanup, but we can't clean that up here -- LW locking is all
-	 * wrong for that.	We'll pass the HTAB in the general cleanup function to
-	 * get rid of such "dead" targets.
-	 */
-	next = target->nextVersionOfRow;
-	if (next != NULL)
-	{
-		next->priorVersionOfRow = NULL;
-		if (SHMQueueEmpty(&next->predicateLocks))
-			PredXact->NeedTargetLinkCleanup = true;
-	}
-	LWLockRelease(PredicateLockNextRowLinkLock);
-
 	/* Actually remove the target. */
 	rmtarget = hash_search_with_hash_value(PredicateLockTargetHash,
 										   &target->tag,
@@ -2065,11 +2035,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 				 errmsg("out of shared memory"),
 				 errhint("You might need to increase max_pred_locks_per_transaction.")));
 	if (!found)
-	{
 		SHMQueueInit(&(target->predicateLocks));
-		target->priorVersionOfRow = NULL;
-		target->nextVersionOfRow = NULL;
-	}
 
 	/* We've got the sxact and target, make sure they're joined. */
 	locktag.myTarget = target;
@@ -2125,8 +2091,6 @@ PredicateLockAcquire(const PREDICATELOCKTARGETTAG *targettag)
 		hash_search_with_hash_value(LocalPredicateLockHash,
 									targettag, targettaghash,
 									HASH_ENTER, &found);
-	/* We should not hold the lock (but its entry might still exist) */
-	Assert(!found || !locallock->held);
 	locallock->held = true;
 	if (!found)
 		locallock->childLocks = 0;
@@ -2215,6 +2179,7 @@ PredicateLockTuple(const Relation relation, const HeapTuple tuple)
 {
 	PREDICATELOCKTARGETTAG tag;
 	ItemPointer tid;
+	TransactionId targetxmin;
 
 	if (SkipSerialization(relation))
 		return;
@@ -2224,15 +2189,16 @@ PredicateLockTuple(const Relation relation, const HeapTuple tuple)
 	 */
 	if (relation->rd_index == NULL)
 	{
-		TransactionId myxid = GetTopTransactionIdIfAny();
+		TransactionId	myxid;
+
+		targetxmin = HeapTupleHeaderGetXmin(tuple->t_data);
 
+		myxid = GetTopTransactionIdIfAny();
 		if (TransactionIdIsValid(myxid))
 		{
-			TransactionId xid = HeapTupleHeaderGetXmin(tuple->t_data);
-
-			if (TransactionIdFollowsOrEquals(xid, TransactionXmin))
+			if (TransactionIdFollowsOrEquals(targetxmin, TransactionXmin))
 			{
-				xid = SubTransGetTopmostTransaction(xid);
+				TransactionId xid = SubTransGetTopmostTransaction(targetxmin);
 				if (TransactionIdEquals(xid, myxid))
 				{
 					/* We wrote it; we already have a write lock. */
@@ -2241,6 +2207,8 @@ PredicateLockTuple(const Relation relation, const HeapTuple tuple)
 			}
 		}
 	}
+	else
+		targetxmin = InvalidTransactionId;
 
 	/*
 	 * Do quick-but-not-definitive test for a relation lock first.	This will
@@ -2259,116 +2227,48 @@ PredicateLockTuple(const Relation relation, const HeapTuple tuple)
 									 relation->rd_node.dbNode,
 									 relation->rd_id,
 									 ItemPointerGetBlockNumber(tid),
-									 ItemPointerGetOffsetNumber(tid));
+									 ItemPointerGetOffsetNumber(tid),
+									 targetxmin);
 	PredicateLockAcquire(&tag);
 }
 
 /*
- * If the old tuple has any predicate locks, create a lock target for the
- * new tuple and point them at each other.	Conflict detection needs to
- * look for locks against prior versions of the row.
+ * If the old tuple has any predicate locks, copy them to the new target.
  */
 void
 PredicateLockTupleRowVersionLink(const Relation relation,
 								 const HeapTuple oldTuple,
 								 const HeapTuple newTuple)
 {
-	PREDICATELOCKTARGETTAG oldtargettag;
+	PREDICATELOCKTARGETTAG oldtupletargettag;
+	PREDICATELOCKTARGETTAG oldpagetargettag;
 	PREDICATELOCKTARGETTAG newtargettag;
-	PREDICATELOCKTARGET *oldtarget;
-	PREDICATELOCKTARGET *newtarget;
-	PREDICATELOCKTARGET *next;
-	uint32		oldtargettaghash;
-	LWLockId	oldpartitionLock;
-	uint32		newtargettaghash;
-	LWLockId	newpartitionLock;
-	bool		found;
 
-	SET_PREDICATELOCKTARGETTAG_TUPLE(oldtargettag,
+	SET_PREDICATELOCKTARGETTAG_TUPLE(oldtupletargettag,
 									 relation->rd_node.dbNode,
 									 relation->rd_id,
 							  ItemPointerGetBlockNumber(&(oldTuple->t_self)),
-							ItemPointerGetOffsetNumber(&(oldTuple->t_self)));
-	oldtargettaghash = PredicateLockTargetTagHashCode(&oldtargettag);
-	oldpartitionLock = PredicateLockHashPartitionLock(oldtargettaghash);
+							ItemPointerGetOffsetNumber(&(oldTuple->t_self)),
+									 HeapTupleHeaderGetXmin(oldTuple->t_data));
+
+	SET_PREDICATELOCKTARGETTAG_PAGE(oldpagetargettag,
+									relation->rd_node.dbNode,
+									relation->rd_id,
+							  ItemPointerGetBlockNumber(&(oldTuple->t_self)));
 
 	SET_PREDICATELOCKTARGETTAG_TUPLE(newtargettag,
 									 relation->rd_node.dbNode,
 									 relation->rd_id,
 							  ItemPointerGetBlockNumber(&(newTuple->t_self)),
-							ItemPointerGetOffsetNumber(&(newTuple->t_self)));
-	newtargettaghash = PredicateLockTargetTagHashCode(&newtargettag);
-	newpartitionLock = PredicateLockHashPartitionLock(newtargettaghash);
+							ItemPointerGetOffsetNumber(&(newTuple->t_self)),
+									 HeapTupleHeaderGetXmin(newTuple->t_data));
 
-	/* Lock lower numbered partition first. */
-	if (oldpartitionLock < newpartitionLock)
-	{
-		LWLockAcquire(oldpartitionLock, LW_SHARED);
-		LWLockAcquire(newpartitionLock, LW_EXCLUSIVE);
-	}
-	else if (newpartitionLock < oldpartitionLock)
-	{
-		LWLockAcquire(newpartitionLock, LW_EXCLUSIVE);
-		LWLockAcquire(oldpartitionLock, LW_SHARED);
-	}
-	else
-		LWLockAcquire(newpartitionLock, LW_EXCLUSIVE);
-
-	oldtarget = (PREDICATELOCKTARGET *)
-		hash_search_with_hash_value(PredicateLockTargetHash,
-									&oldtargettag, oldtargettaghash,
-									HASH_FIND, NULL);
-
-	/* Only need to link if there is an old target already. */
-	if (oldtarget)
-	{
-		LWLockAcquire(PredicateLockNextRowLinkLock, LW_EXCLUSIVE);
-
-		/* Guard against stale pointers from rollback. */
-		next = oldtarget->nextVersionOfRow;
-		if (next != NULL)
-		{
-			next->priorVersionOfRow = NULL;
-			oldtarget->nextVersionOfRow = NULL;
-		}
-
-		/* Find or create the new target, and link old and new. */
-		newtarget = (PREDICATELOCKTARGET *)
-			hash_search_with_hash_value(PredicateLockTargetHash,
-										&newtargettag, newtargettaghash,
-										HASH_ENTER, &found);
-		if (!newtarget)
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of shared memory"),
-					 errhint("You might need to increase max_pred_locks_per_transaction.")));
-		if (!found)
-		{
-			SHMQueueInit(&(newtarget->predicateLocks));
-			newtarget->nextVersionOfRow = NULL;
-		}
-		else
-			Assert(newtarget->priorVersionOfRow == NULL);
+	LWLockAcquire(SerializablePredicateLockListLock, LW_EXCLUSIVE);
 
-		newtarget->priorVersionOfRow = oldtarget;
-		oldtarget->nextVersionOfRow = newtarget;
+	TransferPredicateLocksToNewTarget(oldtupletargettag, newtargettag, false);
+	TransferPredicateLocksToNewTarget(oldpagetargettag, newtargettag, false);
 
-		LWLockRelease(PredicateLockNextRowLinkLock);
-	}
-
-	/* Release lower number partition last. */
-	if (oldpartitionLock < newpartitionLock)
-	{
-		LWLockRelease(newpartitionLock);
-		LWLockRelease(oldpartitionLock);
-	}
-	else if (newpartitionLock < oldpartitionLock)
-	{
-		LWLockRelease(oldpartitionLock);
-		LWLockRelease(newpartitionLock);
-	}
-	else
-		LWLockRelease(newpartitionLock);
+	LWLockRelease(SerializablePredicateLockListLock);
 }
 
 
@@ -2437,6 +2337,17 @@ DeleteLockTarget(PREDICATELOCKTARGET *target, uint32 targettaghash)
  * removeOld is set (by using the reserved entry in
  * PredicateLockTargetHash for scratch space).
  *
+ * Warning: the "removeOld" option should be used only with care,
+ * because this function does not (indeed, can not) update other
+ * backends' LocalPredicateLockHash. If we are only adding new
+ * entries, this is not a problem: the local lock table is used only
+ * as a hint, so missing entries for locks that are held are
+ * OK. Having entries for locks that are no longer held, as can happen
+ * when using "removeOld", is not in general OK. We can only use it
+ * safely when replacing a lock with a coarser-granularity lock that
+ * covers it, or if we are absolutely certain that no one will need to
+ * refer to that lock in the future.
+ *
  * Caller must hold SerializablePredicateLockListLock.
  */
 static bool
@@ -2533,11 +2444,7 @@ TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
 
 		/* If we created a new entry, initialize it */
 		if (!found)
-		{
 			SHMQueueInit(&(newtarget->predicateLocks));
-			newtarget->priorVersionOfRow = NULL;
-			newtarget->nextVersionOfRow = NULL;
-		}
 
 		newpredlocktag.myTarget = newtarget;
 
@@ -2704,7 +2611,14 @@ PredicateLockPageSplit(const Relation relation, const BlockNumber oldblkno,
 											&newtargettag);
 		Assert(success);
 
-		/* Move the locks to the parent. This shouldn't fail.  */
+		/*
+		 * Move the locks to the parent. This shouldn't fail.
+		 *
+		 * Note that here we are removing locks held by other
+		 * backends, leading to a possible inconsistency in their
+		 * local lock hash table. This is OK because we're replacing
+		 * it with a lock that covers the old one.
+		 */
 		success = TransferPredicateLocksToNewTarget(oldtargettag,
 													newtargettag,
 													true);
@@ -2727,36 +2641,17 @@ void
 PredicateLockPageCombine(const Relation relation, const BlockNumber oldblkno,
 						 const BlockNumber newblkno)
 {
-	PREDICATELOCKTARGETTAG oldtargettag;
-	PREDICATELOCKTARGETTAG newtargettag;
-	bool		success;
-
-
-	if (SkipSplitTracking(relation))
-		return;
-
-	Assert(oldblkno != newblkno);
-	Assert(BlockNumberIsValid(oldblkno));
-	Assert(BlockNumberIsValid(newblkno));
-
-	SET_PREDICATELOCKTARGETTAG_PAGE(oldtargettag,
-									relation->rd_node.dbNode,
-									relation->rd_id,
-									oldblkno);
-	SET_PREDICATELOCKTARGETTAG_PAGE(newtargettag,
-									relation->rd_node.dbNode,
-									relation->rd_id,
-									newblkno);
-
-	LWLockAcquire(SerializablePredicateLockListLock, LW_EXCLUSIVE);
-
-	/* Move the locks. This shouldn't fail. */
-	success = TransferPredicateLocksToNewTarget(oldtargettag,
-												newtargettag,
-												true);
-	Assert(success);
-
-	LWLockRelease(SerializablePredicateLockListLock);
+	/*
+	 * Page combines differ from page splits in that we ought to be
+	 * able to remove the locks on the old page after transferring
+	 * them to the new page, instead of duplicating them. However,
+	 * because we can't edit other backends' local lock tables,
+	 * removing the old lock would leave them with an entry in their
+	 * LocalPredicateLockHash for a lock they're not holding, which
+	 * isn't acceptable. So we wind up having to do the same work as a
+	 * page split.
+	 */
+	PredicateLockPageSplit(relation, oldblkno, newblkno);
 }
 
 /*
@@ -3132,9 +3027,6 @@ ClearOldPredicateLocks(void)
 {
 	SERIALIZABLEXACT *finishedSxact;
 	PREDICATELOCK *predlock;
-	int			i;
-	HASH_SEQ_STATUS seqstat;
-	PREDICATELOCKTARGET *locktarget;
 
 	LWLockAcquire(SerializableFinishedListLock, LW_EXCLUSIVE);
 	finishedSxact = (SERIALIZABLEXACT *)
@@ -3232,35 +3124,6 @@ ClearOldPredicateLocks(void)
 
 	LWLockRelease(SerializablePredicateLockListLock);
 	LWLockRelease(SerializableFinishedListLock);
-
-	if (!PredXact->NeedTargetLinkCleanup)
-		return;
-
-	/*
-	 * Clean up any targets which were disconnected from a prior version with
-	 * no predicate locks attached.
-	 */
-	for (i = 0; i < NUM_PREDICATELOCK_PARTITIONS; i++)
-		LWLockAcquire(FirstPredicateLockMgrLock + i, LW_EXCLUSIVE);
-	LWLockAcquire(PredicateLockNextRowLinkLock, LW_SHARED);
-
-	hash_seq_init(&seqstat, PredicateLockTargetHash);
-	while ((locktarget = (PREDICATELOCKTARGET *) hash_seq_search(&seqstat)))
-	{
-		if (SHMQueueEmpty(&locktarget->predicateLocks)
-			&& locktarget->priorVersionOfRow == NULL
-			&& locktarget->nextVersionOfRow == NULL)
-		{
-			hash_search(PredicateLockTargetHash, &locktarget->tag,
-						HASH_REMOVE, NULL);
-		}
-	}
-
-	PredXact->NeedTargetLinkCleanup = false;
-
-	LWLockRelease(PredicateLockNextRowLinkLock);
-	for (i = NUM_PREDICATELOCK_PARTITIONS - 1; i >= 0; i--)
-		LWLockRelease(FirstPredicateLockMgrLock + i);
 }
 
 /*
@@ -3682,32 +3545,10 @@ CheckForSerializableConflictOut(const bool visible, const Relation relation,
 static void
 CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 {
-	PREDICATELOCKTARGETTAG nexttargettag = { 0 };
-	PREDICATELOCKTARGETTAG thistargettag;
-
-	for (;;)
-	{
-		if (!CheckSingleTargetForConflictsIn(targettag, &nexttargettag))
-			break;
-		thistargettag = nexttargettag;
-		targettag = &thistargettag;
-	}
-}
-
-/*
- * Check a particular target for rw-dependency conflict in. If the tuple
- * has prior versions, returns true and *nexttargettag is set to the tag
- * of the prior tuple version.
- */
-static bool
-CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
-								PREDICATELOCKTARGETTAG *nexttargettag)
-{
 	uint32		targettaghash;
 	LWLockId	partitionLock;
 	PREDICATELOCKTARGET *target;
 	PREDICATELOCK *predlock;
-	bool		hasnexttarget = false;
 
 	Assert(MySerializableXact != InvalidSerializableXact);
 
@@ -3717,7 +3558,6 @@ CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
 	targettaghash = PredicateLockTargetTagHashCode(targettag);
 	partitionLock = PredicateLockHashPartitionLock(targettaghash);
 	LWLockAcquire(partitionLock, LW_SHARED);
-	LWLockAcquire(PredicateLockNextRowLinkLock, LW_SHARED);
 	target = (PREDICATELOCKTARGET *)
 		hash_search_with_hash_value(PredicateLockTargetHash,
 									targettag, targettaghash,
@@ -3725,21 +3565,9 @@ CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
 	if (!target)
 	{
 		/* Nothing has this target locked; we're done here. */
-		LWLockRelease(PredicateLockNextRowLinkLock);
 		LWLockRelease(partitionLock);
-		return false;
-	}
-
-	/*
-	 * If the target is linked to a prior version of the row, save the tag so
-	 * that it can be used for iterative calls to this function.
-	 */
-	if (target->priorVersionOfRow != NULL)
-	{
-		*nexttargettag = target->priorVersionOfRow->tag;
-		hasnexttarget = true;
+		return;
 	}
-	LWLockRelease(PredicateLockNextRowLinkLock);
 
 	/*
 	 * Each lock for an overlapping transaction represents a conflict: a
@@ -3828,17 +3656,25 @@ CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
 						hash_search_with_hash_value(LocalPredicateLockHash,
 													targettag, targettaghash,
 													HASH_FIND, NULL);
-					Assert(locallock != NULL);
-					Assert(locallock->held);
-					locallock->held = false;
 
-					if (locallock->childLocks == 0)
+					/*
+					 * Remove entry in local lock table if it exists and has
+					 * no children. It's OK if it doesn't exist; that means
+					 * the lock was transferred to a new target by a
+					 * different backend.
+					 */
+					if (locallock != NULL)
 					{
-						rmlocallock = (LOCALPREDICATELOCK *)
-							hash_search_with_hash_value(LocalPredicateLockHash,
-													targettag, targettaghash,
-														HASH_REMOVE, NULL);
-						Assert(rmlocallock == locallock);
+						locallock->held = false;
+
+						if (locallock->childLocks == 0)
+						{
+							rmlocallock = (LOCALPREDICATELOCK *)
+								hash_search_with_hash_value(LocalPredicateLockHash,
+															targettag, targettaghash,
+															HASH_REMOVE, NULL);
+							Assert(rmlocallock == locallock);
+						}						
 					}
 
 					DecrementParentLocks(targettag);
@@ -3848,7 +3684,7 @@ CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
 					 * the target, bail out before re-acquiring the locks.
 					 */
 					if (rmtarget)
-						return hasnexttarget;
+						return;
 
 					/*
 					 * The list has been altered.  Start over at the front.
@@ -3895,8 +3731,6 @@ CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
 	}
 	LWLockRelease(SerializableXactHashLock);
 	LWLockRelease(partitionLock);
-
-	return hasnexttarget;
 }
 
 /*
@@ -3943,7 +3777,8 @@ CheckForSerializableConflictIn(const Relation relation, const HeapTuple tuple,
 										 relation->rd_node.dbNode,
 										 relation->rd_id,
 						 ItemPointerGetBlockNumber(&(tuple->t_data->t_ctid)),
-					   ItemPointerGetOffsetNumber(&(tuple->t_data->t_ctid)));
+					   ItemPointerGetOffsetNumber(&(tuple->t_data->t_ctid)),
+					   HeapTupleHeaderGetXmin(tuple->t_data));
 		CheckTargetForConflictsIn(&targettag);
 	}
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 99fe37e..ad0bcd7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -78,7 +78,6 @@ typedef enum LWLockId
 	SerializableFinishedListLock,
 	SerializablePredicateLockListLock,
 	OldSerXidLock,
-	PredicateLockNextRowLinkLock,
 	/* Individual lock IDs end here */
 	FirstBufMappingLock,
 	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff --git a/src/include/storage/predicate_internals.h b/src/include/storage/predicate_internals.h
index 32e9a1b..1c784f7 100644
--- a/src/include/storage/predicate_internals.h
+++ b/src/include/storage/predicate_internals.h
@@ -150,8 +150,6 @@ typedef struct PredXactListData
 	SerCommitSeqNo HavePartialClearedThrough;	/* have cleared through this
 												 * seq no */
 	SERIALIZABLEXACT *OldCommittedSxact;		/* shared copy of dummy sxact */
-	bool		NeedTargetLinkCleanup;	/* to save cleanup effort for rare
-										 * case */
 
 	PredXactListElement element;
 } PredXactListData;
@@ -231,9 +229,13 @@ typedef struct SERIALIZABLEXID
 
 /*
  * The PREDICATELOCKTARGETTAG struct identifies a database object which can
- * be the target of predicate locks.  It is designed to fit into 16 bytes
- * with no padding.  Note that this would need adjustment if we widen Oid or
- * BlockNumber to more than 32 bits.
+ * be the target of predicate locks.
+ *
+ * locktag_field6 was added to allow slack space to be filled so that stack-
+ * allocated tags wouldn't have random bytes in the middle.  Moving the only
+ * uint16 to the end didn't work, because the hash function being used
+ * doesn't properly respect tag length -- it will go to a four byte boundary
+ * past the end of the tag.
  *
  * TODO SSI: If we always use the same fields for the same type of value, we
  * should rename these.  Holding off until it's clear there are no exceptions.
@@ -247,8 +249,9 @@ typedef struct PREDICATELOCKTARGETTAG
 	uint32		locktag_field1; /* a 32-bit ID field */
 	uint32		locktag_field2; /* a 32-bit ID field */
 	uint32		locktag_field3; /* a 32-bit ID field */
+	uint32		locktag_field5; /* a 32-bit ID field */
 	uint16		locktag_field4; /* a 16-bit ID field */
-	uint16		locktag_field5; /* a 16-bit ID field */
+	uint16		locktag_field6; /* filler */
 } PREDICATELOCKTARGETTAG;
 
 /*
@@ -260,12 +263,11 @@ typedef struct PREDICATELOCKTARGETTAG
  * already have one.  An entry is removed when the last lock is removed from
  * its list.
  *
- * Because a check for predicate locks on a tuple target should also find
- * locks on previous versions of the same row, if there are any created by
- * overlapping transactions, we keep a pointer to the target for the prior
- * version of the row.	We also keep a pointer to the next version of the
- * row, so that when we no longer have any predicate locks and the back
- * pointer is clear, we can clean up the prior pointer for the next version.
+ * Because a particular target might become obsolete, due to update to a new
+ * version, before the reading transaction is obsolete, we need some way to
+ * prevent errors from reuse of a tuple ID.  Rather than attempting to clean
+ * up the targets as the related tuples are pruned or vacuumed, we check the
+ * xmin on access.  This should be far less costly.
  */
 typedef struct PREDICATELOCKTARGET PREDICATELOCKTARGET;
 
@@ -277,15 +279,6 @@ struct PREDICATELOCKTARGET
 	/* data */
 	SHM_QUEUE	predicateLocks; /* list of PREDICATELOCK objects assoc. with
 								 * predicate lock target */
-
-	/*
-	 * The following two pointers are only used for tuple locks, and are only
-	 * consulted for conflict detection and cleanup; not for granularity
-	 * promotion.
-	 */
-	PREDICATELOCKTARGET *priorVersionOfRow;		/* what other locks to check */
-	PREDICATELOCKTARGET *nextVersionOfRow;		/* who has pointer here for
-												 * more targets */
 };
 
 
@@ -387,21 +380,24 @@ typedef struct PredicateLockData
 	 (locktag).locktag_field2 = (reloid), \
 	 (locktag).locktag_field3 = InvalidBlockNumber, \
 	 (locktag).locktag_field4 = InvalidOffsetNumber, \
-	 (locktag).locktag_field5 = 0)
+	 (locktag).locktag_field5 = InvalidTransactionId, \
+	 (locktag).locktag_field6 = 0)
 
 #define SET_PREDICATELOCKTARGETTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
 	 (locktag).locktag_field3 = (blocknum), \
 	 (locktag).locktag_field4 = InvalidOffsetNumber, \
-	 (locktag).locktag_field5 = 0)
+	 (locktag).locktag_field5 = InvalidTransactionId, \
+	 (locktag).locktag_field6 = 0)
 
-#define SET_PREDICATELOCKTARGETTAG_TUPLE(locktag,dboid,reloid,blocknum,offnum) \
+#define SET_PREDICATELOCKTARGETTAG_TUPLE(locktag,dboid,reloid,blocknum,offnum,xmin) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
 	 (locktag).locktag_field3 = (blocknum), \
 	 (locktag).locktag_field4 = (offnum), \
-	 (locktag).locktag_field5 = 0)
+	 (locktag).locktag_field5 = (xmin), \
+	 (locktag).locktag_field6 = 0)
 
 #define GET_PREDICATELOCKTARGETTAG_DB(locktag) \
 	((locktag).locktag_field1)
@@ -411,6 +407,8 @@ typedef struct PredicateLockData
 	((locktag).locktag_field3)
 #define GET_PREDICATELOCKTARGETTAG_OFFSET(locktag) \
 	((locktag).locktag_field4)
+#define GET_PREDICATELOCKTARGETTAG_XMIN(locktag) \
+	((locktag).locktag_field5)
 #define GET_PREDICATELOCKTARGETTAG_TYPE(locktag)							 \
 	(((locktag).locktag_field4 != InvalidOffsetNumber) ? PREDLOCKTAG_TUPLE : \
 	 (((locktag).locktag_field3 != InvalidBlockNumber) ? PREDLOCKTAG_PAGE :   \

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Dan Ports (#32)

Re: SSI bug?

On 01.03.2011 02:03, Dan Ports wrote:

An updated patch to address this issue is attached. It fixes a couple
issues related to use of the backend-local lock table hint:

- CheckSingleTargetForConflictsIn now correctly handles the case
where a lock that's being held is not reflected in the local lock
table. This fixes the assertion failure reported in this thread.

- PredicateLockPageCombine now retains locks for the page that is
being removed, rather than removing them. This prevents a
potentially dangerous false-positive inconsistency where the local
lock table believes that a lock is held, but it is actually not.

- add some more comments documenting the times when the local lock
table can be inconsistent with reality, as reflected in the shared
memory table.

This patch also incorporates Kevin's changes to copy locks when
creating a new version of a tuple rather than trying to maintain a
linkage between different versions. So this is a patch that should
apply against HEAD and addresses all outstanding SSI bugs known to
Kevin or myself.

Thanks, committed with minor changes.

The ordering of the fields in PREDICATELOCKTAG was bizarre, so I just
expanded the offsetnumber fields to an uint32, instead of having the
padding field. I think that's a lot more readable.

I also added an optimization in PredicateLockTupleRowVersionLink() to
not try to transfer the page locks, if the new tuple is on the same page
as the old one. That's very cheap to check, and it's very common for an
update to stay within a page.

Was there test cases for any of the issues fixed by this patch that we
should add to the suite?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Heikki Linnakangas (#33)

Re: SSI bug?

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

committed with minor changes.

Thanks!

The ordering of the fields in PREDICATELOCKTAG was bizarre, so I
just expanded the offsetnumber fields to an uint32, instead of
having the padding field. I think that's a lot more readable.

I can understand that, but I wonder if we shouldn't have a comment
mentioning that the offsetnumber field is larger than needed, in
case someone later needs to add another 16 bit field for some reason,
or we
go to a hash function without the same quirks.

I also added an optimization in PredicateLockTupleRowVersionLink()
to not try to transfer the page locks, if the new tuple is on the
same page as the old one. That's very cheap to check, and it's
very common for an update to stay within a page.

Thanks. I had it in mind to do that, but lost track of it.
Definitely worth doing.

Was there test cases for any of the issues fixed by this patch
that we should add to the suite?

No, but I'm still intending to look at that some more. It makes me
nervous to have an area which would be pretty easy for someone to
break without any tests to catch such breakage.

-Kevin

drkp@csail.mit.edu

almost 15 years ago

In reply to: Heikki Linnakangas (#33)

Re: SSI bug?

On Tue, Mar 01, 2011 at 07:07:42PM +0200, Heikki Linnakangas wrote:

Was there test cases for any of the issues fixed by this patch that we
should add to the suite?

Some of these issues are tricky to test, e.g. some of the code about
transferring predicate locks to a new target doesn't get exercised
unless an index page gets split while there are concurrent
transactions holding locks on that page.

I have not been able to find a good way to test these other than
system-level testing using a concurrent workload (usually the DBT-2
benchmark). I'd certainly be open to suggestions!

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Noname (#23)

Re: SSI bug?

hi,

thanks for quickly fixing problems.
i tested the later version (a2eb9e0c08ee73208b5419f5a53a6eba55809b92)
and only errors i got was "out of shared memory". i'm not sure if
it was caused by SSI activities or not.

YAMAMOTO Takashi

the following is a snippet from my application log:

PG_DIAG_SEVERITY: WARNING
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: shmem.c
PG_DIAG_SOURCE_LINE: 190
PG_DIAG_SOURCE_FUNCTION: ShmemAlloc

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: dynahash.c
PG_DIAG_SOURCE_LINE: 925
PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value

PG_DIAG_SEVERITY: WARNING
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: shmem.c
PG_DIAG_SOURCE_LINE: 190
PG_DIAG_SOURCE_FUNCTION: ShmemAlloc

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: dynahash.c
PG_DIAG_SOURCE_LINE: 925
PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value

PG_DIAG_SEVERITY: WARNING
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: shmem.c
PG_DIAG_SOURCE_LINE: 190
PG_DIAG_SOURCE_FUNCTION: ShmemAlloc

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: dynahash.c
PG_DIAG_SOURCE_LINE: 925
PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#36)

Re: SSI bug?

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

thanks for quickly fixing problems.

Thanks for the rigorous testing. :-)

i tested the later version
(a2eb9e0c08ee73208b5419f5a53a6eba55809b92) and only errors i got
was "out of shared memory". i'm not sure if it was caused by SSI
activities or not.

PG_DIAG_SEVERITY: WARNING
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: shmem.c
PG_DIAG_SOURCE_LINE: 190
PG_DIAG_SOURCE_FUNCTION: ShmemAlloc

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: dynahash.c
PG_DIAG_SOURCE_LINE: 925
PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value

Nor am I. Some additional information would help.

(1) Could you post the non-default configuration settings?

(2) How many connections are in use in your testing?

(3) Can you give a rough categorization of how many of what types
of transactions are in the mix?

(4) Are there any long-running transactions?

(5) How many of these errors do you get in what amount of time?

(6) Does the application continue to run relatively sanely, or does
it fall over at this point?

(7) The message hint would help pin it down, or a stack trace at
the point of the error would help more. Is it possible to get
either? Looking over the code, it appears that the only places that
SSI could generate that error, it would cancel that transaction with
the hint "You might need to increase
max_pred_locks_per_transaction." and otherwise allow normal
processing.

Even with the above information it may be far from clear where
allocations are going past their maximum, since one HTAB could grab
more than its share and starve another which is staying below its
"maximum". I'll take a look at the possibility of adding a warning
or some such when an HTAB expands past its maximum size.

-Kevin

drkp@csail.mit.edu

almost 15 years ago

In reply to: Kevin Grittner (#37)

Re: SSI bug?

It would probably also be worth monitoring the size of pg_locks to see
how many predicate locks are being held.

On Fri, Mar 18, 2011 at 12:50:16PM -0500, Kevin Grittner wrote:

Even with the above information it may be far from clear where
allocations are going past their maximum, since one HTAB could grab
more than its share and starve another which is staying below its
"maximum". I'll take a look at the possibility of adding a warning
or some such when an HTAB expands past its maximum size.

Yes -- considering how few shared memory HTABs have sizes that are
really dynamic, I'd be inclined to take a close look at SSI and
max_predicate_locks_per_transaction regardless of where the failed
allocation took place. But I am surprised to see that error message
without SSI's hint about increasing max_predicate_locks_per_xact.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Dan Ports (#38)

1 attachment(s)

Re: SSI bug?

Dan Ports <drkp@csail.mit.edu> wrote:

I am surprised to see that error message without SSI's hint about
increasing max_predicate_locks_per_xact.

After reviewing this, I think something along the following lines
might be needed, for a start. I'm not sure the Asserts are actually
needed; they basically are checking that the current behavior of
hash_search doesn't change.

I'm still looking at whether it's sane to try to issue a warning
when an HTAB exceeds the number of entries declared as its max_size
when it was created.

-Kevin

Attachments:

ssi-oosm-1.patchtext/plain; name=ssi-oosm-1.patchDownload

--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1604,12 +1604,7 @@ RegisterPredicateLockingXid(const TransactionId xid)
 	sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash,
 										   &sxidtag,
 										   HASH_ENTER, &found);
-	if (!sxid)
-		/* This should not be possible, based on allocation. */
-		ereport(ERROR,
-				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("out of shared memory")));
-
+	Assert(sxid != NULL);
 	Assert(!found);
 
 	/* Initialize the structure. */
@@ -2046,7 +2041,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 	target = (PREDICATELOCKTARGET *)
 		hash_search_with_hash_value(PredicateLockTargetHash,
 									targettag, targettaghash,
-									HASH_ENTER, &found);
+									HASH_ENTER_NULL, &found);
 	if (!target)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -2061,7 +2056,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 	lock = (PREDICATELOCK *)
 		hash_search_with_hash_value(PredicateLockHash, &locktag,
 			PredicateLockHashCodeFromTargetHashCode(&locktag, targettaghash),
-									HASH_ENTER, &found);
+									HASH_ENTER_NULL, &found);
 	if (!lock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -3252,7 +3247,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 			predlock = hash_search_with_hash_value(PredicateLockHash, &tag,
 								PredicateLockHashCodeFromTargetHashCode(&tag,
 															  targettaghash),
-												   HASH_ENTER, &found);
+												   HASH_ENTER_NULL, &found);
 			if (!predlock)
 				ereport(ERROR,
 						(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -4279,10 +4274,7 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info,
 		sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash,
 											   &sxidtag,
 											   HASH_ENTER, &found);
-		if (!sxid)
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of shared memory")));
+		Assert(sxid != NULL);
 		Assert(!found);
 		sxid->myXact = (SERIALIZABLEXACT *) sxact;

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Kevin Grittner (#39)

1 attachment(s)

Re: SSI bug?

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

I'm still looking at whether it's sane to try to issue a warning
when an HTAB exceeds the number of entries declared as its
max_size when it was created.

I think this does it.

If nothing else, it might be instructive to use it while testing the
SSI patch. Would it make any sense to slip this into 9.1, or should
I add it to the first 9.2 CF?

-Kevin

Attachments:

htab-excess-warning.patchtext/plain; name=htab-excess-warning.patchDownload

*** a/src/backend/storage/ipc/shmem.c
--- b/src/backend/storage/ipc/shmem.c
***************
*** 268,273 **** ShmemInitHash(const char *name, /* table string name for shmem index */
--- 268,274 ----
  	 *
  	 * The shared memory allocator must be specified too.
  	 */
+ 	infoP->max_size = max_size;
  	infoP->dsize = infoP->max_dsize = hash_select_dirsize(max_size);
  	infoP->alloc = ShmemAlloc;
  	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
*** a/src/backend/utils/hash/dynahash.c
--- b/src/backend/utils/hash/dynahash.c
***************
*** 129,134 **** struct HASHHDR
--- 129,135 ----
  	long		ffactor;		/* target fill factor */
  	long		max_dsize;		/* 'dsize' limit if directory is fixed size */
  	long		ssize;			/* segment size --- must be power of 2 */
+ 	long		max_size;		/* maximum number of entries expected */
  	int			sshift;			/* segment shift = log2(ssize) */
  	int			nelem_alloc;	/* number of entries to allocate at once */
  
***************
*** 368,373 **** hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
--- 369,375 ----
  	hdefault(hashp);
  
  	hctl = hashp->hctl;
+ 	hctl->max_size = info->max_size;
  
  	if (flags & HASH_PARTITION)
  	{
***************
*** 1333,1338 **** element_alloc(HTAB *hashp, int nelem)
--- 1335,1341 ----
  	HASHELEMENT *tmpElement;
  	HASHELEMENT *prevElement;
  	int			i;
+ 	bool		warningNeeded;
  
  	/* Each element has a HASHELEMENT header plus user data. */
  	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctlv->entrysize);
***************
*** 1360,1369 **** element_alloc(HTAB *hashp, int nelem)
--- 1363,1378 ----
  	/* freelist could be nonempty if two backends did this concurrently */
  	firstElement->link = hctlv->freeList;
  	hctlv->freeList = prevElement;
+ 	warningNeeded = (hctlv->max_size > 0 && hctlv->nentries == hctlv->max_size);
  
  	if (IS_PARTITIONED(hctlv))
  		SpinLockRelease(&hctlv->mutex);
  
+ 	if (warningNeeded)
+ 		ereport(WARNING,
+ 				(errmsg("hash table \"%s\" has more entries than expected", hashp->tabname),
+ 				 errdetail("The maximum was set to %li on creation.", hctlv->max_size)));
+ 
  	return true;
  }
  
*** a/src/include/utils/hsearch.h
--- b/src/include/utils/hsearch.h
***************
*** 69,74 **** typedef struct HASHCTL
--- 69,75 ----
  	long		dsize;			/* (initial) directory size */
  	long		max_dsize;		/* limit to dsize if dir size is limited */
  	long		ffactor;		/* fill factor */
+ 	long		max_size;		/* maximum number of entries expected */
  	Size		keysize;		/* hash key length in bytes */
  	Size		entrysize;		/* total user element size in bytes */
  	HashValueFunc hash;			/* hash function */

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#37)

Re: SSI bug?

hi,

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

thanks for quickly fixing problems.

Thanks for the rigorous testing. :-)

i tested the later version
(a2eb9e0c08ee73208b5419f5a53a6eba55809b92) and only errors i got
was "out of shared memory". i'm not sure if it was caused by SSI
activities or not.

PG_DIAG_SEVERITY: WARNING
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: shmem.c
PG_DIAG_SOURCE_LINE: 190
PG_DIAG_SOURCE_FUNCTION: ShmemAlloc

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: dynahash.c
PG_DIAG_SOURCE_LINE: 925
PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value

Nor am I. Some additional information would help.

(1) Could you post the non-default configuration settings?

none. it can happen with just initdb+createdb'ed database.

(2) How many connections are in use in your testing?

4.

(3) Can you give a rough categorization of how many of what types
of transactions are in the mix?

all transactions are SERIALIZABLE.

no transactions are with READ ONLY.
(but some of them are actually select-only.)

(4) Are there any long-running transactions?

no.

(5) How many of these errors do you get in what amount of time?

once it start happening, i see them somehow frequently.

(6) Does the application continue to run relatively sanely, or does
it fall over at this point?

my application just exits on the error.

if i re-run the application without rebooting postgres, it seems that
i will get the error sooner than the first run. (but it might be just
a matter of luck)

(7) The message hint would help pin it down, or a stack trace at
the point of the error would help more. Is it possible to get
either? Looking over the code, it appears that the only places that
SSI could generate that error, it would cancel that transaction with
the hint "You might need to increase
max_pred_locks_per_transaction." and otherwise allow normal
processing.

no message hints. these errors are not generated by SSI code,
at least directly.
(please look at PG_DIAG_SOURCE_xxx in the above log.)

YAMAMOTO Takashi

Show quoted text

Even with the above information it may be far from clear where
allocations are going past their maximum, since one HTAB could grab
more than its share and starve another which is staying below its
"maximum". I'll take a look at the possibility of adding a warning
or some such when an HTAB expands past its maximum size.

-Kevin

robertmhaas@gmail.com

almost 15 years ago

In reply to: Kevin Grittner (#40)

Re: SSI bug?

On Fri, Mar 18, 2011 at 5:57 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

I'm still looking at whether it's sane to try to issue a warning
when an HTAB exceeds the number of entries declared as its
max_size when it was created.

I think this does it.

If nothing else, it might be instructive to use it while testing the
SSI patch. Would it make any sense to slip this into 9.1, or should
I add it to the first 9.2 CF?

I don't think it's too late to commit something like this, but I'm not
clear on whether we want it. Is this checking for what should be a
can't-happen case, or are these soft limits that we expect to be
exceeded from time to time?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

robertmhaas@gmail.com

almost 15 years ago

In reply to: Kevin Grittner (#39)

Re: SSI bug?

On Fri, Mar 18, 2011 at 4:51 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Dan Ports <drkp@csail.mit.edu> wrote:

I am surprised to see that error message without SSI's hint about
increasing max_predicate_locks_per_xact.

After reviewing this, I think something along the following lines
might be needed, for a start. I'm not sure the Asserts are actually
needed; they basically are checking that the current behavior of
hash_search doesn't change.

I'm still looking at whether it's sane to try to issue a warning
when an HTAB exceeds the number of entries declared as its max_size
when it was created.

I don't see much advantage in changing these to asserts - in a debug
build, that will promote ERROR to PANIC; whereas in a production
build, they'll cause a random failure somewhere downstream.

The HASH_ENTER to HASH_ENTER_NULL changes look like they might be
needed, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Robert Haas (#42)

Re: SSI bug?

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 18, 2011 at 5:57 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

I'm still looking at whether it's sane to try to issue a warning
when an HTAB exceeds the number of entries declared as its
max_size when it was created.

I don't think it's too late to commit something like this, but I'm not
clear on whether we want it.

We do *not* want that.

Up to now, I believe the lockmgr's lock table is the only shared hash
table that is expected to grow past the declared size; that can happen
anytime a session exceeds max_locks_per_transaction, which we consider
to be only a soft limit. I don't know whether SSI has introduced any
other hash tables that behave similarly. (If it has, we might want to
rethink the amount of "slop" space we leave in shmem...)

There might perhaps be some value in adding a warning like this if it
were enabled per-table (and not enabled by default). But I'd want to
see a specific reason for it, not just "let's see if we can scare users
with a WARNING appearing out of nowhere".

regards, tom lane

robertmhaas@gmail.com

almost 15 years ago

In reply to: Tom Lane (#44)

Re: SSI bug?

On Fri, Mar 25, 2011 at 4:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 18, 2011 at 5:57 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

I'm still looking at whether it's sane to try to issue a warning
when an HTAB exceeds the number of entries declared as its
max_size when it was created.

I don't think it's too late to commit something like this, but I'm not
clear on whether we want it.

We do *not* want that.

Up to now, I believe the lockmgr's lock table is the only shared hash
table that is expected to grow past the declared size; that can happen
anytime a session exceeds max_locks_per_transaction, which we consider
to be only a soft limit. I don't know whether SSI has introduced any
other hash tables that behave similarly. (If it has, we might want to
rethink the amount of "slop" space we leave in shmem...)

There might perhaps be some value in adding a warning like this if it
were enabled per-table (and not enabled by default). But I'd want to
see a specific reason for it, not just "let's see if we can scare users
with a WARNING appearing out of nowhere".

What about a system view that shows declared & actual sizes of all
these hash tables?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

drkp@csail.mit.edu

almost 15 years ago

In reply to: Tom Lane (#44)

Re: SSI bug?

On Fri, Mar 25, 2011 at 04:06:30PM -0400, Tom Lane wrote:

Up to now, I believe the lockmgr's lock table is the only shared hash
table that is expected to grow past the declared size; that can happen
anytime a session exceeds max_locks_per_transaction, which we consider
to be only a soft limit. I don't know whether SSI has introduced any
other hash tables that behave similarly. (If it has, we might want to
rethink the amount of "slop" space we leave in shmem...)

I looked into this recently and the two lockmgr tables were indeed the
only ones that could vary in size. IIRC, the only other shared hash
tables were the shared buffer index and the shmem index.

SSI adds two more analogous tables (per-lock-target and per-xact-lock),
both of which are sized according to max_pred_locks_per_transaction,
which is also a soft limit. It also adds a per-transaction shared hash
table, but that has a clear maximum size.

I find the soft limit on htab size a strange model, and I suspect it
might be a source of problems now that we've got more than one table
that can actually exceed it its limit. (Particularly so given that once
shmem gets allocated to one htab, there's no getting it back.)

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Dan Ports (#46)

Re: SSI bug?

Robert Haas wrote:

I don't see much advantage in changing these to asserts - in a
debug build, that will promote ERROR to PANIC; whereas in a
production build, they'll cause a random failure somewhere
downstream.

The reason Assert is appropriate is that it is *impossible* to hit
that condition right now for two independent reasons. Both would
need to be broken by someone doing clumsy programming changes for
these assertions to fire.

(1) Both assertions follow the attempt to get record from the
structure allocated with this call:

SerializableXidHash = ShmemInitHash("SERIALIZABLEXID hash",
max_table_size,
max_table_size,
&info,
hash_flags);

Note that it is allocated at its maximum size, and that one of these
records can't exist without a corresponding SERIALIZABLEXACT
structure, and these are allocated out of a fixed-sized area which is
allocated based on the same max_table_size value.

(2) The preceding call to hash_search is made with a HASHACTION of
HASH_ENTER, not HASH_ENTER_NULL. So it can't return without a valid
pointer. It will ereport at the ERROR level rather than do that.

So I'm not sure the Assert is even warranted, versus just doing
nothing; but the ereport is certainly pointless.

The HASH_ENTER to HASH_ENTER_NULL changes look like they might be
needed, though.

That would provide a hint pointing toward the GUC which controls
allocation space for the structure we failed to get. That seems
better than the generic message we're now getting. If we don't
change them to HASH_ENTER_NULL we can get rid of our ereport since it
will never be hit.

Of course I'm sure Tom will clean such stuff up if it comes to him in
the current form; it just seems nicer to clean up known issues that
surface from testing before it gets to him, where possible.

-Kevin

Import Notes

Resolved by subject fallback

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Kevin Grittner (#47)

Re: SSI bug?

Tom Lane wrote:

There might perhaps be some value in adding a warning like this if
it were enabled per-table (and not enabled by default).

It only fires where a maximum has been declared and is exceeded.
Most HTABs don't declare a maximum -- they leave it at zero. These
are ignored. Where this fires on a table in shared memory, we're
into a situation where the over-allocation to one table may cause
failure in an unrelated area. If we're not going to change that,
some indication of which one actually exceeded its limits seems like
a helpful bit of diagnostic information.

But I'd want to see a specific reason for it, not just "let's see
if we can scare users with a WARNING appearing out of nowhere".

Perhaps LOG would be more appropriate than WARNING?

-Kevin

Import Notes

Resolved by subject fallback

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Kevin Grittner (#48)

Re: SSI bug?

YAMAMOTO Takashi wrote:

Kevin Grittner wrote:

(1) Could you post the non-default configuration settings?

none. it can happen with just initdb+createdb'ed database.

(2) How many connections are in use in your testing?

4.

(3) Can you give a rough categorization of how many of what types
of transactions are in the mix?

all transactions are SERIALIZABLE.

no transactions are with READ ONLY.
(but some of them are actually select-only.)

(4) Are there any long-running transactions?

no.

(5) How many of these errors do you get in what amount of time?

once it start happening, i see them somehow frequently.

(6) Does the application continue to run relatively sanely, or
does it fall over at this point?

my application just exits on the error.

if i re-run the application without rebooting postgres, it seems
that i will get the error sooner than the first run. (but it might
be just a matter of luck)

If your application hits this again, could you check pg_stat_activity
and pg_locks and see if any SIReadLock entries are lingering after
the owning transaction and all overlapping transactions are
completed? If anything is lingering between runs of your
application, it *should* show up in one or the other of these.

(7) The message hint would help pin it down, or a stack trace at
the point of the error would help more. Is it possible to get
either? Looking over the code, it appears that the only places
that SSI could generate that error, it would cancel that
transaction with the hint "You might need to increase
max_pred_locks_per_transaction." and otherwise allow normal
processing.

no message hints. these errors are not generated by SSI code,
at least directly.

Right, that's because we were using HASH_ENTER instead of
HASH_ENTER_NULL. I've posted a patch which should correct that.

Even with the above information it may be far from clear where
allocations are going past their maximum, since one HTAB could
grab more than its share and starve another which is staying below
its "maximum". I'll take a look at the possibility of adding a
warning or some such when an HTAB expands past its maximum size.

I see from your later post that you are running with this patch. Has
that reported anything yet?

Thanks,

-Kevin

Import Notes

Resolved by subject fallback

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#49)

Re: SSI bug?

hi,

(6) Does the application continue to run relatively sanely, or
does it fall over at this point?

my application just exits on the error.

if i re-run the application without rebooting postgres, it seems
that i will get the error sooner than the first run. (but it might
be just a matter of luck)

If your application hits this again, could you check pg_stat_activity
and pg_locks and see if any SIReadLock entries are lingering after
the owning transaction and all overlapping transactions are
completed? If anything is lingering between runs of your
application, it *should* show up in one or the other of these.

this is 71ac48fd9cebd3d2a873635a04df64096c981f73 with your two patches.
this psql session was the only activity to the server at this point.

hoge=# select * from pg_stat_activity;
-[ RECORD 1 ]----+--------------------------------
datid | 16384
datname | hoge
procpid | 7336
usesysid | 10
usename | takashi
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2011-03-26 12:28:21.882226+09
xact_start | 2011-03-28 11:55:19.300027+09
query_start | 2011-03-28 11:55:19.300027+09
waiting | f
current_query | select * from pg_stat_activity;

hoge=# select count(*) from pg_locks where mode='SIReadLock';
-[ RECORD 1 ]
count | 7057

hoge=# select locktype,count(*) from pg_locks group by locktype; -[ RECORD 1 ]--------
locktype | virtualxid
count | 1
-[ RECORD 2 ]--------
locktype | relation
count | 1
-[ RECORD 3 ]--------
locktype | tuple
count | 7061

hoge=#

(7) The message hint would help pin it down, or a stack trace at
the point of the error would help more. Is it possible to get
either? Looking over the code, it appears that the only places
that SSI could generate that error, it would cancel that
transaction with the hint "You might need to increase
max_pred_locks_per_transaction." and otherwise allow normal
processing.

no message hints. these errors are not generated by SSI code,
at least directly.

Right, that's because we were using HASH_ENTER instead of
HASH_ENTER_NULL. I've posted a patch which should correct that.

sure, with your patch it seems that they turned into different ones.

PG_DIAG_SEVERITY: WARNING
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: shmem.c
PG_DIAG_SOURCE_LINE: 190
PG_DIAG_SOURCE_FUNCTION: ShmemAlloc

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_MESSAGE_HINT: You might need to increase max_pred_locks_per_transaction.
PG_DIAG_SOURCE_FILE: predicate.c
PG_DIAG_SOURCE_LINE: 2049
PG_DIAG_SOURCE_FUNCTION: CreatePredicateLock

Even with the above information it may be far from clear where
allocations are going past their maximum, since one HTAB could
grab more than its share and starve another which is staying below
its "maximum". I'll take a look at the possibility of adding a
warning or some such when an HTAB expands past its maximum size.

I see from your later post that you are running with this patch. Has
that reported anything yet?

i got nothing except the following one. (in the server log)

WARNING: hash table "ShmemIndex" has more entries than expected
DETAIL: The maximum was set to 32 on creation.

YAMAMOTO Takashi

Show quoted text

Thanks,

-Kevin

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#50)

Re: SSI bug?

YAMAMOTO Takashi wrote:

this psql session was the only activity to the server at this
point.

[no residual SIReadLock]

Right, that's because we were using HASH_ENTER instead of
HASH_ENTER_NULL. I've posted a patch which should correct that.

sure, with your patch it seems that they turned into different
ones.

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_MESSAGE_HINT: You might need to increase
max_pred_locks_per_transaction.
PG_DIAG_SOURCE_FILE: predicate.c
PG_DIAG_SOURCE_LINE: 2049
PG_DIAG_SOURCE_FUNCTION: CreatePredicateLock

Even with the above information it may be far from clear where
allocations are going past their maximum, since one HTAB could
grab more than its share and starve another which is staying
below its "maximum". I'll take a look at the possibility of
adding a warning or some such when an HTAB expands past its
maximum size.

I see from your later post that you are running with this patch.
Has that reported anything yet?

i got nothing except the following one. (in the server log)

WARNING: hash table "ShmemIndex" has more entries than expected
DETAIL: The maximum was set to 32 on creation.

That doesn't seem likely to be causing the problem, but maybe the
allocations for that should be bumped a bit.

These results suggest that there is some sort of a leak in the
cleanup of the PredicateLockTargetHash HTAB entries. Will look into
it.

Thanks again.

-Kevin

Import Notes

Resolved by subject fallback

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#51)

Re: SSI bug?

hi,

[no residual SIReadLock]

i read it as there are many (7057) SIReadLocks somehow leaked.
am i wrong?

YAMAMOTO Takashi

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#52)

Re: SSI bug?

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

[no residual SIReadLock]

i read it as there are many (7057) SIReadLocks somehow leaked.
am i wrong?

No, I am. Could you send the full SELECT * of pg_locks when this is
manifest? (Probably best to do that off-list.)

-Kevin

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#50)

1 attachment(s)

Re: SSI bug?

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

hoge=# select locktype,count(*) from pg_locks group by locktype;
-[ RECORD 1 ]--------
locktype | virtualxid
count | 1
-[ RECORD 2 ]--------
locktype | relation
count | 1
-[ RECORD 3 ]--------
locktype | tuple
count | 7061

I've stared at the code for hours and have only come up with one
race condition which can cause this, although the window is so small
it's hard to believe that you would get this volume of orphaned
locks. I'll keep looking, but if you could try this to see if it
has a material impact, that would be great.

I am very sure this patch is needed and that it is safe. It moves a
LWLockAcquire statement up to cover the setup for the loop that it
already covers. It also includes a fix to a comment that got missed
when we switched from the pointer between lock targets to
duplicating the locks.

-Kevin

Attachments:

ssi-old-tuple-locks.patchtext/plain; name=ssi-old-tuple-locks.patchDownload

*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
***************
*** 1755,1763 **** CoarserLockCovers(const PREDICATELOCKTARGETTAG *newtargettag)
  }
  
  /*
!  * Check whether both the list of related predicate locks and the pointer to
!  * a prior version of the row (if this is a tuple lock target) are empty for
!  * a predicate lock target, and remove the target if they are.
   */
  static void
  RemoveTargetIfNoLongerUsed(PREDICATELOCKTARGET *target, uint32 targettaghash)
--- 1755,1762 ----
  }
  
  /*
!  * Check whether the list of related predicate locks is empty for a
!  * predicate lock target, and remove the target if it is.
   */
  static void
  RemoveTargetIfNoLongerUsed(PREDICATELOCKTARGET *target, uint32 targettaghash)
***************
*** 3120,3130 **** ClearOldPredicateLocks(void)
  	/*
  	 * Loop through predicate locks on dummy transaction for summarized data.
  	 */
  	predlock = (PREDICATELOCK *)
  		SHMQueueNext(&OldCommittedSxact->predicateLocks,
  					 &OldCommittedSxact->predicateLocks,
  					 offsetof(PREDICATELOCK, xactLink));
- 	LWLockAcquire(SerializablePredicateLockListLock, LW_SHARED);
  	while (predlock)
  	{
  		PREDICATELOCK *nextpredlock;
--- 3119,3129 ----
  	/*
  	 * Loop through predicate locks on dummy transaction for summarized data.
  	 */
+ 	LWLockAcquire(SerializablePredicateLockListLock, LW_SHARED);
  	predlock = (PREDICATELOCK *)
  		SHMQueueNext(&OldCommittedSxact->predicateLocks,
  					 &OldCommittedSxact->predicateLocks,
  					 offsetof(PREDICATELOCK, xactLink));
  	while (predlock)
  	{
  		PREDICATELOCK *nextpredlock;

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Kevin Grittner (#54)

Re: SSI bug?

On 31.03.2011 16:31, Kevin Grittner wrote:

I've stared at the code for hours and have only come up with one
race condition which can cause this, although the window is so small
it's hard to believe that you would get this volume of orphaned
locks. I'll keep looking, but if you could try this to see if it
has a material impact, that would be great.

I am very sure this patch is needed and that it is safe. It moves a
LWLockAcquire statement up to cover the setup for the loop that it
already covers. It also includes a fix to a comment that got missed
when we switched from the pointer between lock targets to
duplicating the locks.

Ok, committed.

Did we get anywhere with the sizing of the various shared memory
structures? Did we find the cause of the "out of shared memory" warnings?

Would it help if we just pre-allocated all the shared memory hash tables
and didn't allow them to grow? It's bizarre that the hash table that
requests the slack shared memory space first gets it, and it can't be
reused for anything else without a server restart. I feel it would make
better to not allow the tables to grow, so that you get consistent
behavior across restarts.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Heikki Linnakangas (#55)

Re: SSI bug?

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

Did we get anywhere with the sizing of the various shared memory
structures? Did we find the cause of the "out of shared memory"
warnings?

The patch you just committed is related to that. Some tuple locks
for summarized transactions were not getting cleaned up. I found
one access to the list not protected by the appropriate LW lock,
which is what this patch fixed. I'm not satisfied that was the only
issue, though; I'm still looking.

Would it help if we just pre-allocated all the shared memory hash
tables and didn't allow them to grow?

I've been thinking that it might be wise.

It's bizarre that the hash table that requests the slack shared
memory space first gets it, and it can't be reused for anything
else without a server restart. I feel it would make better to not
allow the tables to grow, so that you get consistent behavior
across restarts.

Agreed. I think it was OK in prior releases because there was just
one HTAB in shared memory doing this. With multiple such tables, it
doesn't seem sane to allow unbounded lazy grabbing of the space this
way. The only thing I've been on the fence about is whether it
makes more sense to allocate it all up front or to continue to allow
incremental allocation but set a hard limit on the number of entries
allocated for each shared memory HTAB. Is there a performance-
related reason to choose one path or the other?

-Kevin

drkp@csail.mit.edu

almost 15 years ago

In reply to: Kevin Grittner (#56)

Re: SSI bug?

On Thu, Mar 31, 2011 at 11:06:30AM -0500, Kevin Grittner wrote:

The only thing I've been on the fence about is whether it
makes more sense to allocate it all up front or to continue to allow
incremental allocation but set a hard limit on the number of entries
allocated for each shared memory HTAB. Is there a performance-
related reason to choose one path or the other?

Seems like it would be marginally better to allocate it up front -- then
you don't have the cost of having to split buckets later as it grows.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Dan Ports (#57)

1 attachment(s)

Re: SSI bug?

Dan Ports <drkp@csail.mit.edu> wrote:

On Thu, Mar 31, 2011 at 11:06:30AM -0500, Kevin Grittner wrote:

The only thing I've been on the fence about is whether it
makes more sense to allocate it all up front or to continue to

allow

incremental allocation but set a hard limit on the number of

entries

allocated for each shared memory HTAB. Is there a performance-
related reason to choose one path or the other?

Seems like it would be marginally better to allocate it up front --

then

you don't have the cost of having to split buckets later as it

grows.

The attached patch should cover that.

-Kevin

Attachments:

htab-alloc.patchtext/plain; name=htab-alloc.patchDownload

*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
***************
*** 281,295 **** InitLocks(void)
  {
  	HASHCTL		info;
  	int			hash_flags;
! 	long		init_table_size,
! 				max_table_size;
  
  	/*
  	 * Compute init/max size to request for lock hashtables.  Note these
  	 * calculations must agree with LockShmemSize!
  	 */
  	max_table_size = NLOCKENTS();
- 	init_table_size = max_table_size / 2;
  
  	/*
  	 * Allocate hash table for LOCK structs.  This stores per-locked-object
--- 281,293 ----
  {
  	HASHCTL		info;
  	int			hash_flags;
! 	long		max_table_size;
  
  	/*
  	 * Compute init/max size to request for lock hashtables.  Note these
  	 * calculations must agree with LockShmemSize!
  	 */
  	max_table_size = NLOCKENTS();
  
  	/*
  	 * Allocate hash table for LOCK structs.  This stores per-locked-object
***************
*** 303,316 **** InitLocks(void)
  	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
  
  	LockMethodLockHash = ShmemInitHash("LOCK hash",
! 									   init_table_size,
  									   max_table_size,
  									   &info,
  									   hash_flags);
  
  	/* Assume an average of 2 holders per lock */
  	max_table_size *= 2;
- 	init_table_size *= 2;
  
  	/*
  	 * Allocate hash table for PROCLOCK structs.  This stores
--- 301,313 ----
  	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
  
  	LockMethodLockHash = ShmemInitHash("LOCK hash",
! 									   max_table_size,
  									   max_table_size,
  									   &info,
  									   hash_flags);
  
  	/* Assume an average of 2 holders per lock */
  	max_table_size *= 2;
  
  	/*
  	 * Allocate hash table for PROCLOCK structs.  This stores
***************
*** 323,329 **** InitLocks(void)
  	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
  
  	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
! 										   init_table_size,
  										   max_table_size,
  										   &info,
  										   hash_flags);
--- 320,326 ----
  	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
  
  	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
! 										   max_table_size,
  										   max_table_size,
  										   &info,
  										   hash_flags);
*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
***************
*** 959,966 **** InitPredicateLocks(void)
  {
  	HASHCTL		info;
  	int			hash_flags;
! 	long		init_table_size,
! 				max_table_size;
  	Size		requestSize;
  	bool		found;
  
--- 959,965 ----
  {
  	HASHCTL		info;
  	int			hash_flags;
! 	long		max_table_size;
  	Size		requestSize;
  	bool		found;
  
***************
*** 969,975 **** InitPredicateLocks(void)
  	 * Note these calculations must agree with PredicateLockShmemSize!
  	 */
  	max_table_size = NPREDICATELOCKTARGETENTS();
- 	init_table_size = max_table_size / 2;
  
  	/*
  	 * Allocate hash table for PREDICATELOCKTARGET structs.  This stores
--- 968,973 ----
***************
*** 983,996 **** InitPredicateLocks(void)
  	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
  
  	PredicateLockTargetHash = ShmemInitHash("PREDICATELOCKTARGET hash",
! 											init_table_size,
  											max_table_size,
  											&info,
  											hash_flags);
  
  	/* Assume an average of 2 xacts per target */
  	max_table_size *= 2;
- 	init_table_size *= 2;
  
  	/*
  	 * Reserve an entry in the hash table; we use it to make sure there's
--- 981,993 ----
  	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
  
  	PredicateLockTargetHash = ShmemInitHash("PREDICATELOCKTARGET hash",
! 											max_table_size,
  											max_table_size,
  											&info,
  											hash_flags);
  
  	/* Assume an average of 2 xacts per target */
  	max_table_size *= 2;
  
  	/*
  	 * Reserve an entry in the hash table; we use it to make sure there's
***************
*** 1014,1020 **** InitPredicateLocks(void)
  	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
  
  	PredicateLockHash = ShmemInitHash("PREDICATELOCK hash",
! 									  init_table_size,
  									  max_table_size,
  									  &info,
  									  hash_flags);
--- 1011,1017 ----
  	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
  
  	PredicateLockHash = ShmemInitHash("PREDICATELOCK hash",
! 									  max_table_size,
  									  max_table_size,
  									  &info,
  									  hash_flags);

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Kevin Grittner (#58)

Re: SSI bug?

On 31.03.2011 21:23, Kevin Grittner wrote:

Dan Ports<drkp@csail.mit.edu> wrote:

On Thu, Mar 31, 2011 at 11:06:30AM -0500, Kevin Grittner wrote:

The only thing I've been on the fence about is whether it
makes more sense to allocate it all up front or to continue to

allow

incremental allocation but set a hard limit on the number of

entries

allocated for each shared memory HTAB. Is there a performance-
related reason to choose one path or the other?

Seems like it would be marginally better to allocate it up front --

then

you don't have the cost of having to split buckets later as it

grows.

The attached patch should cover that.

That's not enough. The hash tables can grow beyond the maximum size you
specify in ShmemInitHash. It's just a hint to size the directory within
the hash table.

We'll need to teach dynahash not to allocate any more entries after the
preallocation. A new HASH_NO_GROW flag to hash_create() seems like a
suitable interface.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Heikki Linnakangas (#59)

Re: SSI bug?

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

That's not enough. The hash tables can grow beyond the maximum
size you specify in ShmemInitHash. It's just a hint to size the
directory within the hash table.

We'll need to teach dynahash not to allocate any more entries
after the preallocation. A new HASH_NO_GROW flag to hash_create()
seems like a suitable interface.

OK. If we're doing that, is it worth taking a look at the "safety
margin" added to the size calculations, and try to make the
calculations more accurate?

Would you like me to code a patch for this?

There are a couple other patches which I think should be applied, if
you have time to deal with them.

There was a fix for an assertion failure here:

http://archives.postgresql.org/pgsql-bugs/2011-03/msg00352.php

It just rechecks some conditions after dropping a shared LW lock and
acquiring an exclusive LW lock. Without this recheck there is a
window for the other transaction involved in the conflict to also
detect a conflict and flag first, leading to the assertion.

There's another area I need to review near there, but that is
orthogonal.

There is a patch to improve out-of-shared-memory error handling and
reporting here:

http://archives.postgresql.org/pgsql-hackers/2011-03/msg01170.php

This one is due to my earlier failure to spot the difference between
HASH_ENTER and HASH_ENTER_NULL. For a shared memory HTAB the
HASH_ENTER_NULL will return a null if unable to allocate the entry,
while HASH_ENTER will ereport ERROR with a generic message. This
patch leaves HASH_ENTER on the "can't happen" cases, but replaces
the ereport ERROR after it with an Assert because it's something
which should never happen. The other cases are changed to
HASH_ENTER_NULL so that the error message with the hint will be used
instead of the more generic message.

These patches are both in direct response to problems found during
testing by YAMAMOTO Takashi.

-Kevin

drkp@csail.mit.edu

almost 15 years ago

In reply to: Kevin Grittner (#54)

1 attachment(s)

Re: SSI bug?

I think I see what is going on now. We are sometimes failing to set the
commitSeqNo correctly on the lock. In particular, if a lock assigned to
OldCommittedSxact is marked with InvalidSerCommitNo, it will never be
cleared.

The attached patch corrects this:
TransferPredicateLocksToNewTarget should initialize a new lock
entry's commitSeqNo to that of the old one being transferred, or take
the minimum commitSeqNo if it is merging two lock entries.

Also, CreatePredicateLock should initialize commitSeqNo for to
InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would
actually affect anything, but we should be consistent.)

I also added a couple of assertions I used to track this down: a
lock's commitSeqNo should never be zero, and it should be
InvalidSerCommitSeqNo if and only if the lock is not held by
OldCommittedSxact.

Takashi, does this patch fix your problem with leaked SIReadLocks?

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Attachments:

patch-commitseqno.difftext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index f6e49fe..d166da3 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2067,7 +2067,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 		SHMQueueInsertBefore(&(target->predicateLocks), &(lock->targetLink));
 		SHMQueueInsertBefore(&(sxact->predicateLocks),
 							 &(lock->xactLink));
-		lock->commitSeqNo = 0;
+		lock->commitSeqNo = InvalidSerCommitSeqNo;
 	}
 
 	LWLockRelease(partitionLock);
@@ -2500,6 +2500,7 @@ TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
 			SHM_QUEUE  *predlocktargetlink;
 			PREDICATELOCK *nextpredlock;
 			PREDICATELOCK *newpredlock;
+			SerCommitSeqNo oldCommitSeqNo = oldpredlock->commitSeqNo;
 
 			predlocktargetlink = &(oldpredlock->targetLink);
 			nextpredlock = (PREDICATELOCK *)
@@ -2544,8 +2545,17 @@ TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
 									 &(newpredlock->targetLink));
 				SHMQueueInsertBefore(&(newpredlocktag.myXact->predicateLocks),
 									 &(newpredlock->xactLink));
-				newpredlock->commitSeqNo = InvalidSerCommitSeqNo;
+				newpredlock->commitSeqNo = oldCommitSeqNo;
 			}
+			else
+			{
+				if (newpredlock->commitSeqNo < oldCommitSeqNo)
+					newpredlock->commitSeqNo = oldCommitSeqNo;
+			}
+			
+			Assert(newpredlock->commitSeqNo != 0);
+			Assert((newpredlock->commitSeqNo == InvalidSerCommitSeqNo)
+				   || (newpredlock->tag.myXact == OldCommittedSxact));
 
 			oldpredlock = nextpredlock;
 		}
@@ -3130,6 +3140,8 @@ ClearOldPredicateLocks(void)
 						 offsetof(PREDICATELOCK, xactLink));
 
 		LWLockAcquire(SerializableXactHashLock, LW_SHARED);
+		Assert(predlock->commitSeqNo != 0);
+		Assert(predlock->commitSeqNo != InvalidSerCommitSeqNo);
 		canDoPartialCleanup = (predlock->commitSeqNo <= PredXact->CanPartialClearThrough);
 		LWLockRelease(SerializableXactHashLock);
 
@@ -3254,6 +3266,8 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 						 errhint("You might need to increase max_pred_locks_per_transaction.")));
 			if (found)
 			{
+				Assert(predlock->commitSeqNo != 0);
+				Assert(predlock->commitSeqNo != InvalidSerCommitSeqNo);
 				if (predlock->commitSeqNo < sxact->commitSeqNo)
 					predlock->commitSeqNo = sxact->commitSeqNo;
 			}

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Dan Ports (#61)

Re: SSI bug?

hi,

I think I see what is going on now. We are sometimes failing to set the
commitSeqNo correctly on the lock. In particular, if a lock assigned to
OldCommittedSxact is marked with InvalidSerCommitNo, it will never be
cleared.

The attached patch corrects this:
TransferPredicateLocksToNewTarget should initialize a new lock
entry's commitSeqNo to that of the old one being transferred, or take
the minimum commitSeqNo if it is merging two lock entries.

Also, CreatePredicateLock should initialize commitSeqNo for to
InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would
actually affect anything, but we should be consistent.)

I also added a couple of assertions I used to track this down: a
lock's commitSeqNo should never be zero, and it should be
InvalidSerCommitSeqNo if and only if the lock is not held by
OldCommittedSxact.

Takashi, does this patch fix your problem with leaked SIReadLocks?

i'm currently running bf6848bc8c82e82f857d48185554bc3e6dcf1013 with this
patch applied. i haven't seen the symptom yet. i'll keep it running for
a while.

btw, i've noticed the following message in the server log. is it normal?

LOG: could not truncate directory "pg_serial": apparent wraparound

YAMAMOTO Takashi

Show quoted text

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Noname (#62)

Re: SSI bug?

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

LOG: could not truncate directory "pg_serial": apparent
wraparound

Did you get a warning with this text?:

memory for serializable conflict tracking is nearly exhausted

If not, there's some sort of cleanup bug to fix in the predicate
locking's use of SLRU. It may be benign, but we won't really know
until we find it. I'm investigating.

-Kevin

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Kevin Grittner (#63)

1 attachment(s)

Re: SSI bug?

I wrote:

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

LOG: could not truncate directory "pg_serial": apparent
wraparound

there's some sort of cleanup bug to fix in the predicate
locking's use of SLRU. It may be benign, but we won't really know
until we find it. I'm investigating.

I'm pretty sure I found it. When the number serializable
transactions which need to be tracked gets high enough to push
things to the SLRU summarization, and then drops back down, we
haven't been truncating the head page of the active SLRU region
because if we go back into SLRU summarization that saves us from
zeroing the page again. The problem is that if we don't go back
into SLRU summarization for a long time, we might wrap around to
where SLRU is upset that our head is chasing our tail. This seems
like a bigger problem than we were trying to solve by not truncating
the page.

The issue is complicated a little bit by the fact that the SLRU API
has you specify the *page* for the truncation point, but silently
ignores the request unless the page is in a segment which is past a
segment in use. So adding the number of pages per SLRU segment to
the head page position should do the right thing. But it's all
weird enough that I felt it need a bit of commenting.

While I was there I noticed that we're doing the unnecessary
flushing (so people can glean information about the SLRU activity
from watching the disk files) right before truncating. I switched
the truncation to come before the flushing, since flushing pages to
a file and then deleting that file didn't seem productive.

Attached find a patch which modifies one line of code, switches the
order of two lines of code, and adds comments.

I will add this to the open items for 9.1. Thanks again to YAMAMOTO
Takashi for his rigorous testing.

-Kevin

Attachments:

ssi-slru-truncate-fix.patchtext/plain; name=ssi-slru-truncate-fix.patchDownload

*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
***************
*** 920,945 **** CheckPointPredicate(void)
  	else
  	{
  		/*
! 		 * The SLRU is no longer needed. Truncate everything but the last
! 		 * page. We don't dare to touch the last page in case the SLRU is
! 		 * taken back to use, and the new tail falls on the same page.
  		 */
! 		tailPage = oldSerXidControl->headPage;
  		oldSerXidControl->headPage = -1;
  	}
  
  	LWLockRelease(OldSerXidLock);
  
  	/*
  	 * Flush dirty SLRU pages to disk
  	 *
  	 * This is not actually necessary from a correctness point of view. We do
  	 * it merely as a debugging aid.
  	 */
  	SimpleLruFlush(OldSerXidSlruCtl, true);
- 
- 	/* Truncate away pages that are no longer required */
- 	SimpleLruTruncate(OldSerXidSlruCtl, tailPage);
  }
  
  /*------------------------------------------------------------------------*/
--- 920,957 ----
  	else
  	{
  		/*
! 		 * The SLRU is no longer needed. Truncate everything.  If we try to
! 		 * leave the head page around to avoid re-zeroing it, we might not
! 		 * use the SLRU again until we're past the wrap-around point, which
! 		 * makes SLRU unhappy.
! 		 *
! 		 * While the API asks you to specify truncation by page, it silently
! 		 * ignores the request unless the specified page is in a segment
! 		 * past some allocated portion of the SLRU.  We don't care which
! 		 * page in a later segment we hit, so just add the number of pages
! 		 * per segment to the head page to land us *somewhere* in the next
! 		 * segment.
  		 */
! 		tailPage = oldSerXidControl->headPage + SLRU_PAGES_PER_SEGMENT;
  		oldSerXidControl->headPage = -1;
  	}
  
  	LWLockRelease(OldSerXidLock);
  
+ 	/* Truncate away pages that are no longer required */
+ 	SimpleLruTruncate(OldSerXidSlruCtl, tailPage);
+ 
  	/*
  	 * Flush dirty SLRU pages to disk
  	 *
  	 * This is not actually necessary from a correctness point of view. We do
  	 * it merely as a debugging aid.
+ 	 *
+ 	 * We're doing this after the truncation to avoid writing pages right
+ 	 * before deleting the file in which they sit, which would be completely
+ 	 * pointless.
  	 */
  	SimpleLruFlush(OldSerXidSlruCtl, true);
  }
  
  /*------------------------------------------------------------------------*/

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Kevin Grittner (#63)

Re: SSI bug?

hi,

YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp> wrote:

LOG: could not truncate directory "pg_serial": apparent
wraparound

Did you get a warning with this text?:

memory for serializable conflict tracking is nearly exhausted

there is not such a warning near the above "aparent wraparound" record.
not sure if it was far before the record as i've lost the older log files.

YAMAMOTO Takashi

Show quoted text

If not, there's some sort of cleanup bug to fix in the predicate
locking's use of SLRU. It may be benign, but we won't really know
until we find it. I'm investigating.

-Kevin

yamt@mwd.biglobe.ne.jp

almost 15 years ago

In reply to: Noname (#62)

Re: SSI bug?

hi,

hi,

I think I see what is going on now. We are sometimes failing to set the
commitSeqNo correctly on the lock. In particular, if a lock assigned to
OldCommittedSxact is marked with InvalidSerCommitNo, it will never be
cleared.

The attached patch corrects this:
TransferPredicateLocksToNewTarget should initialize a new lock
entry's commitSeqNo to that of the old one being transferred, or take
the minimum commitSeqNo if it is merging two lock entries.

Also, CreatePredicateLock should initialize commitSeqNo for to
InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would
actually affect anything, but we should be consistent.)

I also added a couple of assertions I used to track this down: a
lock's commitSeqNo should never be zero, and it should be
InvalidSerCommitSeqNo if and only if the lock is not held by
OldCommittedSxact.

Takashi, does this patch fix your problem with leaked SIReadLocks?

i'm currently running bf6848bc8c82e82f857d48185554bc3e6dcf1013 with this
patch applied. i haven't seen the symptom yet. i'll keep it running for
a while.

i haven't seen the symptom since them. so i guess it was fixed by
your patch. thanks!

YAMAMOTO Takashi

Show quoted text

btw, i've noticed the following message in the server log. is it normal?

LOG: could not truncate directory "pg_serial": apparent wraparound

YAMAMOTO Takashi

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Kevin Grittner (#60)

1 attachment(s)

Re: SSI bug?

On 31.03.2011 22:06, Kevin Grittner wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:

That's not enough. The hash tables can grow beyond the maximum
size you specify in ShmemInitHash. It's just a hint to size the
directory within the hash table.

We'll need to teach dynahash not to allocate any more entries
after the preallocation. A new HASH_NO_GROW flag to hash_create()
seems like a suitable interface.

OK. If we're doing that, is it worth taking a look at the "safety
margin" added to the size calculations, and try to make the
calculations more accurate?

Would you like me to code a patch for this?

I finally got around to look at this. Attached patch adds a
HASH_FIXED_SIZE flag, which disables the allocation of new entries after
the initial allocation. I believe we have consensus to make the
predicate lock hash tables fixed-size, so that there's no competition of
the slack shmem space between predicate lock structures and the regular
lock maanager.

I also noticed that there's a few hash_search(HASH_ENTER) calls in
predicate.c followed by check for a NULL result. But with HASH_ENTER,
hash_search never returns NULL, it throws an "out of shared memory"
error internally. I changed those calls to use HASH_ENTER_NULL, so you
now get the intended error message with the hint to raise
max_pred_locks_per_transaction.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

fixed-size-shmem-3.patchtext/x-diff; name=fixed-size-shmem-3.patchDownload

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 401acdb..6ff41fc 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -959,17 +959,15 @@ InitPredicateLocks(void)
 {
 	HASHCTL		info;
 	int			hash_flags;
-	long		init_table_size,
-				max_table_size;
+	long		max_table_size;
 	Size		requestSize;
 	bool		found;
 
 	/*
-	 * Compute init/max size to request for predicate lock target hashtable.
+	 * Compute size of predicate lock target hashtable.
 	 * Note these calculations must agree with PredicateLockShmemSize!
 	 */
 	max_table_size = NPREDICATELOCKTARGETENTS();
-	init_table_size = max_table_size / 2;
 
 	/*
 	 * Allocate hash table for PREDICATELOCKTARGET structs.  This stores
@@ -980,17 +978,16 @@ InitPredicateLocks(void)
 	info.entrysize = sizeof(PREDICATELOCKTARGET);
 	info.hash = tag_hash;
 	info.num_partitions = NUM_PREDICATELOCK_PARTITIONS;
-	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION | HASH_FIXED_SIZE);
 
 	PredicateLockTargetHash = ShmemInitHash("PREDICATELOCKTARGET hash",
-											init_table_size,
+											max_table_size,
 											max_table_size,
 											&info,
 											hash_flags);
 
 	/* Assume an average of 2 xacts per target */
 	max_table_size *= 2;
-	init_table_size *= 2;
 
 	/*
 	 * Reserve an entry in the hash table; we use it to make sure there's
@@ -1011,18 +1008,17 @@ InitPredicateLocks(void)
 	info.entrysize = sizeof(PREDICATELOCK);
 	info.hash = predicatelock_hash;
 	info.num_partitions = NUM_PREDICATELOCK_PARTITIONS;
-	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_PARTITION | HASH_FIXED_SIZE);
 
 	PredicateLockHash = ShmemInitHash("PREDICATELOCK hash",
-									  init_table_size,
+									  max_table_size,
 									  max_table_size,
 									  &info,
 									  hash_flags);
 
 	/*
-	 * Compute init/max size to request for serializable transaction
-	 * hashtable. Note these calculations must agree with
-	 * PredicateLockShmemSize!
+	 * Compute size for serializable transaction hashtable.
+	 * Note these calculations must agree with PredicateLockShmemSize!
 	 */
 	max_table_size = (MaxBackends + max_prepared_xacts);
 
@@ -1093,7 +1089,7 @@ InitPredicateLocks(void)
 	info.keysize = sizeof(SERIALIZABLEXIDTAG);
 	info.entrysize = sizeof(SERIALIZABLEXID);
 	info.hash = tag_hash;
-	hash_flags = (HASH_ELEM | HASH_FUNCTION);
+	hash_flags = (HASH_ELEM | HASH_FUNCTION | HASH_FIXED_SIZE);
 
 	SerializableXidHash = ShmemInitHash("SERIALIZABLEXID hash",
 										max_table_size,
@@ -2045,7 +2041,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 	target = (PREDICATELOCKTARGET *)
 		hash_search_with_hash_value(PredicateLockTargetHash,
 									targettag, targettaghash,
-									HASH_ENTER, &found);
+									HASH_ENTER_NULL, &found);
 	if (!target)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -2060,7 +2056,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 	lock = (PREDICATELOCK *)
 		hash_search_with_hash_value(PredicateLockHash, &locktag,
 			PredicateLockHashCodeFromTargetHashCode(&locktag, targettaghash),
-									HASH_ENTER, &found);
+									HASH_ENTER_NULL, &found);
 	if (!lock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -3251,7 +3247,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 			predlock = hash_search_with_hash_value(PredicateLockHash, &tag,
 								PredicateLockHashCodeFromTargetHashCode(&tag,
 															  targettaghash),
-												   HASH_ENTER, &found);
+												   HASH_ENTER_NULL, &found);
 			if (!predlock)
 				ereport(ERROR,
 						(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index f718785..d902729 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -160,6 +160,7 @@ struct HTAB
 	MemoryContext hcxt;			/* memory context if default allocator used */
 	char	   *tabname;		/* table name (for error messages) */
 	bool		isshared;		/* true if table is in shared memory */
+	bool		isfixed;		/* if true, don't enlarge */
 
 	/* freezing a shared table isn't allowed, so we can keep state here */
 	bool		frozen;			/* true = no more inserts allowed */
@@ -435,6 +436,8 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 					 errmsg("out of memory")));
 	}
 
+	if (flags & HASH_FIXED_SIZE)
+		hashp->isfixed = true;
 	return hashp;
 }
 
@@ -1334,6 +1337,9 @@ element_alloc(HTAB *hashp, int nelem)
 	HASHELEMENT *prevElement;
 	int			i;
 
+	if (hashp->isfixed)
+		return false;
+
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctlv->entrysize);
 
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index bca34d2..ce54c0a 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -92,6 +92,7 @@ typedef struct HASHCTL
 #define HASH_CONTEXT	0x200	/* Set memory allocation context */
 #define HASH_COMPARE	0x400	/* Set user defined comparison function */
 #define HASH_KEYCOPY	0x800	/* Set user defined key-copying function */
+#define HASH_FIXED_SIZE	0x1000	/* Initial size is a hard limit */
 
 
 /* max_dsize value to indicate expansible directory */

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Heikki Linnakangas (#67)

Re: SSI bug?

On 11.04.2011 11:33, Heikki Linnakangas wrote:

I also noticed that there's a few hash_search(HASH_ENTER) calls in
predicate.c followed by check for a NULL result. But with HASH_ENTER,
hash_search never returns NULL, it throws an "out of shared memory"
error internally. I changed those calls to use HASH_ENTER_NULL, so you
now get the intended error message with the hint to raise
max_pred_locks_per_transaction.

Oops, those were already fixed. Never mind.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Heikki Linnakangas (#67)

Re: SSI bug?

On 11.04.2011 11:33, Heikki Linnakangas wrote:

On 31.03.2011 22:06, Kevin Grittner wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:

That's not enough. The hash tables can grow beyond the maximum
size you specify in ShmemInitHash. It's just a hint to size the
directory within the hash table.

We'll need to teach dynahash not to allocate any more entries
after the preallocation. A new HASH_NO_GROW flag to hash_create()
seems like a suitable interface.

OK. If we're doing that, is it worth taking a look at the "safety
margin" added to the size calculations, and try to make the
calculations more accurate?

Would you like me to code a patch for this?

I finally got around to look at this. Attached patch adds a
HASH_FIXED_SIZE flag, which disables the allocation of new entries after
the initial allocation. I believe we have consensus to make the
predicate lock hash tables fixed-size, so that there's no competition of
the slack shmem space between predicate lock structures and the regular
lock maanager.

Ok, committed that.

I left the safety margins in the size calculations alone for now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Dan Ports (#61)

Re: SSI bug?

On 03.04.2011 09:16, Dan Ports wrote:

I think I see what is going on now. We are sometimes failing to set the
commitSeqNo correctly on the lock. In particular, if a lock assigned to
OldCommittedSxact is marked with InvalidSerCommitNo, it will never be
cleared.

The attached patch corrects this:
TransferPredicateLocksToNewTarget should initialize a new lock
entry's commitSeqNo to that of the old one being transferred, or take
the minimum commitSeqNo if it is merging two lock entries.

Also, CreatePredicateLock should initialize commitSeqNo for to
InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would
actually affect anything, but we should be consistent.)

I also added a couple of assertions I used to track this down: a
lock's commitSeqNo should never be zero, and it should be
InvalidSerCommitSeqNo if and only if the lock is not held by
OldCommittedSxact.

Thanks, committed this.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Kevin.Grittner@wicourts.gov

almost 15 years ago

In reply to: Heikki Linnakangas (#67)

Re: SSI bug?

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

I finally got around to look at this. Attached patch adds a
HASH_FIXED_SIZE flag, which disables the allocation of new entries
after the initial allocation. I believe we have consensus to make
the predicate lock hash tables fixed-size, so that there's no
competition of the slack shmem space between predicate lock
structures and the regular lock maanager.

OK, I can see why you preferred this -- the existing exchange of
slack space with the HW lock tables remains unchanged this way, and
only the new tables for predicate locking have the stricter limits.
This makes it very unlikely to break current apps which might be
unknowingly relying on existing allocation behavior in the HW
locking area. Smart.

I hadn't picked up on your intent that the new flag would only be
used for the new tables, which is why it wasn't quite making sense
to me before.

Thanks!

-Kevin