Index corruption with CREATE INDEX CONCURRENTLY

Started by Pavan Deolaseealmost 9 years ago63 messages

pavan.deolasee@gmail.com

almost 9 years ago

3 attachment(s)

Hello All,

While stress testing WARM, I encountered a data consistency problem. It
happens when index is built concurrently. What I thought to be a WARM
induced bug and it took me significant amount of investigation to finally
conclude that it's a bug in the master. In fact, we tested all the way up
to 9.2 release and the bug exists in all releases.

In my test set up, I keep a UNIQUE index on an "aid" column and then have
many other "aid[1-4]" columns, on which indexes are built and dropped. The
script then updates a randomly selected row using any of the aid[1-4]
columns. Even these columns are updated during the tests to force WARM
updates, but within a narrow range of non-overlapping values. So we can
always find the unique row using some kind of range condition. The
reproduction case is much simpler and I'll explain that below.

In the reproduction scenario (attached), it happens fairly quickly, may be
after 3-10 rounds of DROP/CREATE. You may need to adjust the -j value if it
does not happen for you even after a few rounds. Just for the record, I
haven't checked past reports to correlate if the bug explains any of the
index corruption issues we might have received.

To give a short summary of how CIC works with HOT, it consists of three
phases:
1. Create index entry in the catalog, mark it not ready for inserts, commit
the transaction, start new transaction and then wait for all potential lock
holders on the table to end
2. Take a MVCC snapshot, build the index, mark index ready for inserts,
commit and then once again wait for potential lock holders on the table to
end
3. Do a validation phase where we look at MVCC-visible tuples and add
missing entries to the index. We only look at the root line pointer of each
tuple while deciding whether a particular tuple already exists in the index
or not. IOW we look at HOT chains and not tuples themselves.

The interesting case is phase 1 and 2. The phase 2 works on the premise
that every transaction started after we finished waiting for all existing
transactions will definitely see the new index. All these new transactions
then look at the columns used by the new index and consider them while
deciding HOT-safety of updates. Now for reasons which I don't fully
understand, there exists a path where a transaction starts after CIC waited
and took its snapshot at the start of the second phase, but still fails to
see the new index. Or may be it sees the index, but fails to update its
rd_indexattr list to take into account the columns of the new index. That
leads to wrong decisions and we end up with a broken HOT chain, something
the CIC algorithm is not prepared to handle. This in turn leads the new
index missing entries for the update tuples i.e. the index may have aid1 =
10, but the value in the heap could be aid1 = 11.

Based on my investigation so far and the evidence that I've collected, what
probably happens is that after a relcache invalidation arrives at the new
transaction, it recomputes the rd_indexattr but based on the old, cached
rd_indexlist. So the rd_indexattr does not include the new columns. Later
rd_indexlist is updated, but the rd_indexattr remains what it was.

There is definitely an evidence of this sequence of events (collected after
sprinkling code with elog messages), but it's not yet clear to me why it
affects only a small percentage of transactions. One possibility is that it
probably depends on at what stage an invalidation is received and processed
by the backend. I find it a bit confusing that even though rd_indexattr is
tightly coupled to the rd_indexlist, we don't invalidate or recompute them
together. Shouldn't we do that? For example, in the attached patch we
invalidate rd_indexattr whenever we recompute rd_indexlist (i.e. if we see
that rd_indexvalid is 0). This patch fixes the problem for me (I've only
tried master), but I am not sure if this is a correct fix.

Finally, here is the reproduction scenario. Run "cic_bug.sh" and observe
the output. The script runs a continuous UPDATE on the table, while running
DROP INDEX and CREATE INDEX CONCURRENTLY on aid1 column in a loop. As soon
as we build the index, we run a validation query such that rows are fetched
using IOS on the new index and compare them against the rows obtained via a
SeqScan on the table. If things are OK, the rows obtained via both methods
should match. But every often you'll see a row or two differ. I must say
that IOS itself is not a problem, but I'd to setup things that way to
ensure that index access does not have to go to the heap. Otherwise, it
will just return whatever is there in the heap and things will look fine.
May be some index consistency check utility could be handy to check the
corruption. But I think the query should just do the job for now.

Some additional server logs are attached too. I've filtered them to include
only three sessions. One (PID 73119) which is running CIC, second (PID
73090) which processed relcache invalidation correctly and third (PID
73091) which got it wrong. Specifically, look at line #77 where snapshot is
obtained for the second phase:

2017-01-29 10:44:31.238 UTC [73119] [xid=0]LOG: CIC (pgb_a_aid1) building
first phase with snapshot (4670891:4670893)
At line 108, you'll see a new transaction with XID 4670908 still using the
old attribute list.

2017-01-29 10:44:31.889 UTC [73091] [xid=4670908]LOG: heap_update
(cic_tab_accounts), hot_attrs ((b 9))

Other session, which I included for comparison, sees the new index and
recomputes its rd_indexattr in time and hence that update works as expected.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

cic_bug.tar.gzapplication/x-gzip; name=cic_bug.tar.gzDownload

relcache_issue_2.logapplication/octet-stream; name=relcache_issue_2.logDownload

invalidate_indexattr.patchapplication/octet-stream; name=invalidate_indexattr.patchDownload

commit 1fc14744463594f361c76e04b17716574593da81
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Date:   Mon Jan 30 08:22:43 2017 +0530

    Invalidate index attributes when index list is invalidated

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 4a053a3..728b388 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4355,6 +4355,13 @@ RelationGetIndexList(Relation relation)
 		return list_copy(relation->rd_indexlist);
 
 	/*
+	 * If the index list was invalidated, we better also invalidate the index
+	 * attribute list (which should automatically invalidate other attributes
+	 * such as primary key and replica identity)
+	 */
+	relation->rd_indexattr = NULL;
+
+	/*
 	 * We build the list we intend to return (in the caller's context) while
 	 * doing the scan.  After successfully completing the scan, we copy that
 	 * list into the relcache entry.  This avoids cache-context memory leakage

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Pavan Deolasee (#1)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Hello All,

While stress testing WARM, I encountered a data consistency problem. It
happens when index is built concurrently. What I thought to be a WARM
induced bug and it took me significant amount of investigation to finally
conclude that it's a bug in the master. In fact, we tested all the way up to
9.2 release and the bug exists in all releases.

In my test set up, I keep a UNIQUE index on an "aid" column and then have
many other "aid[1-4]" columns, on which indexes are built and dropped. The
script then updates a randomly selected row using any of the aid[1-4]
columns. Even these columns are updated during the tests to force WARM
updates, but within a narrow range of non-overlapping values. So we can
always find the unique row using some kind of range condition. The
reproduction case is much simpler and I'll explain that below.

In the reproduction scenario (attached), it happens fairly quickly, may be
after 3-10 rounds of DROP/CREATE. You may need to adjust the -j value if it
does not happen for you even after a few rounds. Just for the record, I
haven't checked past reports to correlate if the bug explains any of the
index corruption issues we might have received.

To give a short summary of how CIC works with HOT, it consists of three
phases:
1. Create index entry in the catalog, mark it not ready for inserts, commit
the transaction, start new transaction and then wait for all potential lock
holders on the table to end
2. Take a MVCC snapshot, build the index, mark index ready for inserts,
commit and then once again wait for potential lock holders on the table to
end
3. Do a validation phase where we look at MVCC-visible tuples and add
missing entries to the index. We only look at the root line pointer of each
tuple while deciding whether a particular tuple already exists in the index
or not. IOW we look at HOT chains and not tuples themselves.

The interesting case is phase 1 and 2. The phase 2 works on the premise that
every transaction started after we finished waiting for all existing
transactions will definitely see the new index. All these new transactions
then look at the columns used by the new index and consider them while
deciding HOT-safety of updates. Now for reasons which I don't fully
understand, there exists a path where a transaction starts after CIC waited
and took its snapshot at the start of the second phase, but still fails to
see the new index. Or may be it sees the index, but fails to update its
rd_indexattr list to take into account the columns of the new index. That
leads to wrong decisions and we end up with a broken HOT chain, something
the CIC algorithm is not prepared to handle. This in turn leads the new
index missing entries for the update tuples i.e. the index may have aid1 =
10, but the value in the heap could be aid1 = 11.

Based on my investigation so far and the evidence that I've collected, what
probably happens is that after a relcache invalidation arrives at the new
transaction, it recomputes the rd_indexattr but based on the old, cached
rd_indexlist. So the rd_indexattr does not include the new columns. Later
rd_indexlist is updated, but the rd_indexattr remains what it was.

There is definitely an evidence of this sequence of events (collected after
sprinkling code with elog messages), but it's not yet clear to me why it
affects only a small percentage of transactions. One possibility is that it
probably depends on at what stage an invalidation is received and processed
by the backend. I find it a bit confusing that even though rd_indexattr is
tightly coupled to the rd_indexlist, we don't invalidate or recompute them
together. Shouldn't we do that?

During invalidation processing, we do destroy them together in
function RelationDestroyRelation(). So ideally, after invalidation
processing, both should be built again, but not sure why that is not
happening in this case.

For example, in the attached patch we
invalidate rd_indexattr whenever we recompute rd_indexlist (i.e. if we see
that rd_indexvalid is 0).

  /*
+ * If the index list was invalidated, we better also invalidate the index
+ * attribute list (which should automatically invalidate other attributes
+ * such as primary key and replica identity)
+ */

+ relation->rd_indexattr = NULL;
+

I think setting directly to NULL will leak the memory.

This patch fixes the problem for me (I've only
tried master), but I am not sure if this is a correct fix.

I think it is difficult to say that fix is correct unless there is a
clear explanation of what led to this problem. Another angle to
investigate is to find out when relation->rd_indexvalid is set to 0
for the particular session where you are seeing the problem. I could
see few places where we are setting it to 0 and clearing rd_indexlist,
but not rd_indexattr. I am not sure whether that has anything to do
with the problem you are seeing.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Pavan Deolasee (#1)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

Based on my investigation so far and the evidence that I've collected,
what probably happens is that after a relcache invalidation arrives at the
new transaction, it recomputes the rd_indexattr but based on the old,
cached rd_indexlist. So the rd_indexattr does not include the new columns.
Later rd_indexlist is updated, but the rd_indexattr remains what it was.

I was kinda puzzled this report did not get any attention, though it's
clearly a data corruption issue. Anyways, now I know why this is happening
and can reproduce this very easily via debugger and two sessions.

The offending code is RelationGetIndexAttrBitmap(). Here is the sequence of
events:

Taking the same table as in the report:

CREATE TABLE cic_tab_accounts (
aid bigint UNIQUE,
abalance bigint,
aid1 bigint
);

1. Session S1 calls DROP INDEX pgb_a_aid1 and completes
2. Session S2 is in process of rebuilding its rd_indexattr bitmap (because
of previous relcache invalidation caused by DROP INDEX). To be premise,
assume that the session has finished rebuilding its index list. Since DROP
INDEX has just happened,
4793 indexoidlist = RelationGetIndexList(relation);

3. S1 then starts CREATE INDEX CONCURRENTLY pgb_a_aid1 ON
cic_tab_accounts(aid1)
4. S1 creates catalog entry for the new index, commits phase 1 transaction
and builds MVCC snapshot for the second phase and also finishes the initial
index build
5. S2 now proceeds. Now notice that S2 had computed the index list based on
the old view i.e. just one index.

The following comments in relcahce.c are now significant:

4799 /*
4800 * Copy the rd_pkindex and rd_replidindex value computed by
4801 * RelationGetIndexList before proceeding. This is needed because
a
4802 * relcache flush could occur inside index_open below, resetting
the
4803 * fields managed by RelationGetIndexList. (The values we're
computing
4804 * will still be valid, assuming that caller has a sufficient lock
on
4805 * the relation.)
4806 */

That's what precisely goes wrong.

6. When index_open is called, relcache invalidation generated by the first
transaction commit in CIC gets accepted and processed. As part of this
invalidation, rd_indexvalid is set to 0, which invalidates rd_indexlist too
7. But S2 is currently using index list obtained at step 2 above (which has
only old index). It goes ahead and computed rd_indexattr based on just the
old index.
8. While relcahce invalidation processing at step 6 cleared rd_indexattr
(along with rd_indexvalid), it is again set at
4906 relation->rd_indexattr = bms_copy(indexattrs);

But what gets set at step 8 is based on the old view. There is no way
rd_indexattr will be invalidated until S2 receives another relcache
invalidation. This will happen when the CIC proceeds. But until then, every
update done by S2 will generate broken HOT chains, leading to data
corruption.

I can reproduce this entire scenario using gdb sessions. This also explains
why the patch I sent earlier helps to solve the problem.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Alvaro Herrera

alvherre@2ndquadrant.com

almost 9 years ago

In reply to: Pavan Deolasee (#3)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Pavan Deolasee wrote:

I can reproduce this entire scenario using gdb sessions. This also explains
why the patch I sent earlier helps to solve the problem.

Ouch. Great detective work there.

I think it's quite possible that this bug explains some index errors,
such as primary keys (or unique indexes) that when recreated fail with
duplicate values: the first value inserted would get in via an UPDATE
during the CIC race window, and sometime later the second value is
inserted correctly. Because the first one is missing from the index,
the second one does not give rise to a dupe error upon insertion, but
they are of course both detected when the index is recreated.

I'm going to study the bug a bit more, and put in some patch before the
upcoming minor tag on Monday.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#2)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Thu, Feb 2, 2017 at 6:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

/*
+ * If the index list was invalidated, we better also invalidate the index
+ * attribute list (which should automatically invalidate other attributes
+ * such as primary key and replica identity)
+ */

+ relation->rd_indexattr = NULL;
+

I think setting directly to NULL will leak the memory.

Good catch, thanks. Will fix.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Alvaro Herrera (#4)

1 attachment(s)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Thu, Feb 2, 2017 at 10:14 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

I'm going to study the bug a bit more, and put in some patch before the
upcoming minor tag on Monday.

Looking at the history and some past discussions, it seems Tomas reported
somewhat similar problem and Andres proposed a patch here
/messages/by-id/20140514155204.GE23943@awork2.anarazel.de
which got committed via b23b0f5588d2e2. Not exactly the same issue, but
related to relcache flush happening in index_open().

I wonder if we should just add a recheck logic
in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at
the end, we go back and start again from RelationGetIndexList(). Or is
there any code path that relies on finding the old information?

The following comment at the top of RelationGetIndexAttrBitmap() also seems
like a problem to me. CIC works with lower strength lock and hence it can
change the set of indexes even though caller is holding the
RowExclusiveLock, no? The comment seems to suggest otherwise.

4748 * Caller had better hold at least RowExclusiveLock on the target
relation
4749 * to ensure that it has a stable set of indexes. This also makes it
safe
4750 * (deadlock-free) for us to take locks on the relation's indexes.

I've attached a revised patch based on these thoughts. It looks a bit more
invasive than the earlier one-liner, but looks better to me.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

invalidate_indexattr_v2.patchapplication/octet-stream; name=invalidate_indexattr_v2.patchDownload

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560..f06d345 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4745,9 +4745,11 @@ RelationGetIndexPredicate(Relation relation)
  * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
  * we can include system attributes (e.g., OID) in the bitmap representation.
  *
- * Caller had better hold at least RowExclusiveLock on the target relation
- * to ensure that it has a stable set of indexes.  This also makes it safe
- * (deadlock-free) for us to take locks on the relation's indexes.
+ * While all callers should at least RowExclusiveLock on the target relation,
+ * we still can't guarantee a stable set of indexes because CREATE INDEX
+ * CONCURRENTLY and DROP INDEX CONCURRENTLY can change set of indexes, without
+ * taking any conflicting lock. So we must be prepared to handle changes in the
+ * set of indexes and recompute bitmaps, when necessary.
  *
  * The returned result is palloc'd in the caller's memory context and should
  * be bms_free'd when not needed anymore.
@@ -4787,6 +4789,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	if (!RelationGetForm(relation)->relhasindex)
 		return NULL;
 
+recheck:
 	/*
 	 * Get cached list of index OIDs
 	 */
@@ -4798,15 +4801,16 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 
 	/*
 	 * Copy the rd_pkindex and rd_replidindex value computed by
-	 * RelationGetIndexList before proceeding.  This is needed because a
-	 * relcache flush could occur inside index_open below, resetting the
-	 * fields managed by RelationGetIndexList. (The values we're computing
-	 * will still be valid, assuming that caller has a sufficient lock on
-	 * the relation.)
+	 * RelationGetIndexList before proceeding. If a relcache flush could occur
+	 * inside index_open below, resetting the fields managed by
+	 * RelationGetIndexList, we compare the old and new values and recompute
+	 * attribute maps again.
 	 */
 	relpkindex = relation->rd_pkindex;
 	relreplindex = relation->rd_replidindex;
 
+	Assert(relation->rd_indexvalid != 0);
+
 	/*
 	 * For each index, add referenced attributes to indexattrs.
 	 *
@@ -4882,6 +4886,30 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 
 	list_free(indexoidlist);
 
+	/*
+	 * If a relcache flush happened during index_open, we may have computed
+	 * stale values of the bitmaps since we used the old information. Since all
+	 * these bitmaps are closely tied with the index list as well as other
+	 * information such as the primary index and the replica identity, we must
+	 * recompute everything. Otherwise we could be end up with a wrongly cached
+	 * values of these bitmaps.
+	 *
+	 * Note that a subsequent call to RelationGetIndexList() will reload the
+	 * new list of indexes, but that does not invalidate the cached attribute
+	 * bitmaps. So we handle that here itself.
+	 */
+	if (relpkindex != relation->rd_pkindex ||
+		relreplindex != relation->rd_replidindex ||
+		relation->rd_indexvalid == 0)
+	{
+		bms_free(uindexattrs);
+		bms_free(pkindexattrs);
+		bms_free(idindexattrs);
+		bms_free(indexattrs);
+
+		goto recheck;
+	}
+
 	/* Don't leak the old values of these bitmaps, if any */
 	bms_free(relation->rd_indexattr);
 	relation->rd_indexattr = NULL;

Alvaro Herrera

alvherre@2ndquadrant.com

almost 9 years ago

In reply to: Pavan Deolasee (#6)

1 attachment(s)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Pavan Deolasee wrote:

Looking at the history and some past discussions, it seems Tomas reported
somewhat similar problem and Andres proposed a patch here
/messages/by-id/20140514155204.GE23943@awork2.anarazel.de
which got committed via b23b0f5588d2e2. Not exactly the same issue, but
related to relcache flush happening in index_open().

I wonder if we should just add a recheck logic
in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at
the end, we go back and start again from RelationGetIndexList(). Or is
there any code path that relies on finding the old information?

Good find on that old report. It occured to me to re-run Tomas' test
case with this second patch you proposed (the "ddl" test takes 11
minutes in my laptop), and lo and behold -- your "recheck" code enters
an infinite loop. Not good. I tried some more tricks that came to
mind, but I didn't get any good results. Pavan and I discussed it
further and he came up with the idea of returning the bitmapset we
compute, but if an invalidation occurs in index_open, simply do not
cache the bitmapsets so that next time this is called they are all
computed again. Patch attached.

This appears to have appeased both test_decoding's ddl test under
CACHE_CLOBBER_ALWAYS, as well as the CREATE INDEX CONCURRENTLY bug.

I intend to commit this soon to all branches, to ensure it gets into the
next set of minors.

Here is a detailed reproducer for the CREATE INDEX CONCURRENTLY bug.
Line numbers are as of today's master; comments are supposed to help
locating good spots in other branches. Given the use of gdb
breakpoints, there's no good way to integrate this with regression
tests, which is a pity because this stuff looks a little brittle to me,
and if it breaks it is hard to detect.

Prep:
DROP TABLE IF EXISTS testtab;
CREATE TABLE testtab (a int unique, b int, c int);
INSERT INTO testtab VALUES (1, 100, 10);
CREATE INDEX testindx ON testtab (b);

S1:
SELECT * FROM testtab;
UPDATE testtab SET b = b + 1 WHERE a = 1;
SELECT pg_backend_pid();
attach GDB to this session.
break relcache.c:4800
# right before entering the per-index loop
cont

S2:
SELECT pg_backend_pid();
DROP INDEX testindx;
attach GDB to this session
handle SIGUSR1 nostop
break indexcmds.c:666
# right after index_create
break indexcmds.c:790
# right before the second CommitTransactionCommand
cont
CREATE INDEX CONCURRENTLY testindx ON testtab (b);
-- this blocks in breakpoint 1. Leave it there.

S1:
UPDATE testtab SET b = b + 1 WHERE a = 1;
-- this blocks in breakpoint 1. Leave it there.

S2:
gdb -> cont
-- This blocks waiting for S1

S1:
gdb -> cont
-- S1 finishes the update. S2 awakens and blocks again in breakpoint 2.

S1:
-- Now run more UPDATEs:
UPDATE testtab SET b = b + 1 WHERE a = 1;
This produces a broken HOT chain.
This can be done several times; all these updates should be non-HOT
because of the index created by CIC, but they are marked HOT.

S2: "cont"
From this point onwards, update are correct.

SELECT * FROM testtab;
SET enable_seqscan TO 0;
SET enable_bitmapscan TO 0;
SELECT * FROM testtab WHERE b between 100 and 110;

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

invalidate_indexattr_v3.patchtext/plain; charset=us-asciiDownload

commit 7054c21883154f67058d42180606e0c5428d2745[m
Author:     Alvaro Herrera <alvherre@alvh.no-ip.org>
AuthorDate: Fri Feb 3 15:40:37 2017 -0300
CommitDate: Fri Feb 3 15:41:59 2017 -0300

    Fix CREATE INDEX CONCURRENTLY relcache bug

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560..003ccfa 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4745,9 +4745,11 @@ RelationGetIndexPredicate(Relation relation)
  * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
  * we can include system attributes (e.g., OID) in the bitmap representation.
  *
- * Caller had better hold at least RowExclusiveLock on the target relation
- * to ensure that it has a stable set of indexes.  This also makes it safe
- * (deadlock-free) for us to take locks on the relation's indexes.
+ * While all callers should at least RowExclusiveLock on the target relation,
+ * we still can't guarantee a stable set of indexes because CREATE INDEX
+ * CONCURRENTLY and DROP INDEX CONCURRENTLY can change set of indexes, without
+ * taking any conflicting lock. So we must be prepared to handle changes in the
+ * set of indexes and recompute bitmaps, when necessary.
  *
  * The returned result is palloc'd in the caller's memory context and should
  * be bms_free'd when not needed anymore.
@@ -4763,7 +4765,6 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	Oid			relpkindex;
 	Oid			relreplindex;
 	ListCell   *l;
-	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the result. */
 	if (relation->rd_indexattr != NULL)
@@ -4807,6 +4808,8 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	relpkindex = relation->rd_pkindex;
 	relreplindex = relation->rd_replidindex;
 
+	Assert(relation->rd_indexvalid != 0);
+
 	/*
 	 * For each index, add referenced attributes to indexattrs.
 	 *
@@ -4893,18 +4896,23 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	relation->rd_idattr = NULL;
 
 	/*
-	 * Now save copies of the bitmaps in the relcache entry.  We intentionally
-	 * set rd_indexattr last, because that's the one that signals validity of
-	 * the values; if we run out of memory before making that copy, we won't
-	 * leave the relcache entry looking like the other ones are valid but
-	 * empty.
+	 * Now save copies of the bitmaps in the relcache entry, but only if no
+	 * invalidation occured in the meantime.  We intentionally set rd_indexattr
+	 * last, because that's the one that signals validity of the values; if we
+	 * run out of memory before making that copy, we won't leave the relcache
+	 * entry looking like the other ones are valid but empty.
 	 */
-	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-	relation->rd_keyattr = bms_copy(uindexattrs);
-	relation->rd_pkattr = bms_copy(pkindexattrs);
-	relation->rd_idattr = bms_copy(idindexattrs);
-	relation->rd_indexattr = bms_copy(indexattrs);
-	MemoryContextSwitchTo(oldcxt);
+	if (relation->rd_indexvalid != 0)
+	{
+		MemoryContext oldcxt;
+
+		oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+		relation->rd_keyattr = bms_copy(uindexattrs);
+		relation->rd_pkattr = bms_copy(pkindexattrs);
+		relation->rd_idattr = bms_copy(idindexattrs);
+		relation->rd_indexattr = bms_copy(indexattrs);
+		MemoryContextSwitchTo(oldcxt);
+	}
 
 	/* We return our original working copy for caller to play with */
 	switch (attrKind)

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Alvaro Herrera (#7)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sat, Feb 4, 2017 at 12:12 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Pavan Deolasee wrote:

Looking at the history and some past discussions, it seems Tomas reported
somewhat similar problem and Andres proposed a patch here
/messages/by-id/20140514155204.GE23943@awork2.anarazel.de
which got committed via b23b0f5588d2e2. Not exactly the same issue, but
related to relcache flush happening in index_open().

I wonder if we should just add a recheck logic
in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at
the end, we go back and start again from RelationGetIndexList(). Or is
there any code path that relies on finding the old information?

Good find on that old report. It occured to me to re-run Tomas' test
case with this second patch you proposed (the "ddl" test takes 11
minutes in my laptop), and lo and behold -- your "recheck" code enters
an infinite loop. Not good. I tried some more tricks that came to
mind, but I didn't get any good results. Pavan and I discussed it
further and he came up with the idea of returning the bitmapset we
compute, but if an invalidation occurs in index_open, simply do not
cache the bitmapsets so that next time this is called they are all
computed again. Patch attached.

This appears to have appeased both test_decoding's ddl test under
CACHE_CLOBBER_ALWAYS, as well as the CREATE INDEX CONCURRENTLY bug.

I intend to commit this soon to all branches, to ensure it gets into the
next set of minors.

Thanks for detailed explanation, using steps mentioned in your mail
and Pavan's previous mail steps, I could reproduce the problem.

Regarding your fix:

+ * Now save copies of the bitmaps in the relcache entry, but only if no
+ * invalidation occured in the meantime.  We intentionally set rd_indexattr
+ * last, because that's the one that signals validity of the values; if we
+ * run out of memory before making that copy, we won't leave the relcache
+ * entry looking like the other ones are valid but empty.
  */
- oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
- relation->rd_keyattr = bms_copy(uindexattrs);
- relation->rd_pkattr = bms_copy(pkindexattrs);
- relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
- MemoryContextSwitchTo(oldcxt);
+ if (relation->rd_indexvalid != 0)
+ {
+ MemoryContext oldcxt;
+
+ oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+ relation->rd_keyattr = bms_copy(uindexattrs);
+ relation->rd_pkattr = bms_copy(pkindexattrs);
+ relation->rd_idattr = bms_copy(idindexattrs);
+ relation->rd_indexattr = bms_copy(indexattrs);
+ MemoryContextSwitchTo(oldcxt);
+ }

If we do above, then I think primary key attrs won't be returned
because for those we are using relation copy rather than an original
working copy of attrs. See code below:

switch (attrKind)
{
..
case INDEX_ATTR_BITMAP_PRIMARY_KEY:
return bms_copy(relation->rd_pkattr);

Apart from above, I think after the proposed patch, it will be quite
possible that in heap_update(), hot_attrs, key_attrs and id_attrs are
computed using different index lists (consider relcache flush happens
after computation of hot_attrs or key_attrs) which was previously not
possible. I think in the later code we assume that all the three
should be computed using the same indexlist. For ex. see the below
code:

HeapSatisfiesHOTandKeyUpdate()
{
..
Assert(bms_is_subset(id_attrs, key_attrs));
Assert(bms_is_subset(key_attrs, hot_attrs));
..
}

Also below comments in heap_update indicate that we shouldn't worry
about relcache flush between the computation of hot_attrs, key_attrs
and id_attrs.

heap_update()
{
..

* Note that we get a copy here, so we need not worry about relcache flush
* happening midway through.
*/

hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
..
}

Typo.
/occured/occurred

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#8)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sat, Feb 4, 2017 at 12:10 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

If we do above, then I think primary key attrs won't be returned
because for those we are using relation copy rather than an original
working copy of attrs. See code below:

switch (attrKind)
{
..
case INDEX_ATTR_BITMAP_PRIMARY_KEY:
return bms_copy(relation->rd_pkattr);

I don't think that would a problem since if relation_rd_indexattr is NULL,
primary key attrs will be recomputed and returned.

Apart from above, I think after the proposed patch, it will be quite
possible that in heap_update(), hot_attrs, key_attrs and id_attrs are
computed using different index lists (consider relcache flush happens
after computation of hot_attrs or key_attrs) which was previously not
possible. I think in the later code we assume that all the three
should be computed using the same indexlist. For ex. see the below
code:

This seems like a real problem to me. I don't know what the consequences
are, but definitely having various attribute lists to have different view
of the set of indexes doesn't seem right.

The problem we are trying to fix is very clear by now. Relcache flush
clears rd_indexvalid and rd_indexattr together, but because of the race,
rd_indexattr is recomputed with the old information and gets cached again,
while rd_indexvalid (and rd_indexlist) remains unset. That leads to the
index corruption in CIC and may cause other issues too that we're not aware
right now. We want cached attribute lists to become invalid whenever index
list is invalidated and that's not happening right now.

So we have tried three approaches so far.

1. Simply reset rd_indexattr whenever we detect rd_indexvalid has been
reset. This was the first patch. It's a very simple patch and has worked
for me with CACHE_CLOBBER_ALWAYS and even fixes the CIC bug. The only
downside it seems that we're invalidating cached attribute lists outside
the normal relcache invalidation path. It also works on the premise that
RelationGetIndexList() will be called every time
before RelationGetIndexAttrBitmap() is called, otherwise we could still end
up using stale cached copies. I wonder if cached plans can be a problem
here.

2. In the second patch, we tried to recompute attribute lists if a relcache
flush happens in between and index list is invalidated. We've seen problems
with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
flushes keep happening.

3. We tried the third approach where we don't cache attribute lists if know
that index set has changed and hope that the next caller gets the correct
info. But Amit rightly identified problems with that approach too.

So we either need to find a 4th approach or stay with the first patch
unless we see a problem there too (cached plans, anything else). Or may be
fix CREATE INDEX CONCURRENTLY so that at least the first phase which add
the index entry to the catalog happens with a strong lock. This will lead
to momentary blocking of write queries, but protect us from the race. I'm
assuming this is the only code that can cause the problem, and I haven't
checked other potential hazards.

Ideas/suggestions?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Alvaro Herrera (#7)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I intend to commit this soon to all branches, to ensure it gets into the
next set of minors.

Based on Pavan's comments, I think trying to force this into next week's
releases would be extremely unwise. If the bug went undetected this long,
it can wait for a fix for another three months. That seems better than
risking new breakage when it's barely 48 hours to the release wrap
deadline. We do not have time to recover from any mistakes.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Tom Lane (#10)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Based on Pavan's comments, I think trying to force this into next week's
releases would be extremely unwise. If the bug went undetected this long,
it can wait for a fix for another three months.

Yes, I think bug existed ever since and went unnoticed. One reason could be
that the race happens only when the new index turns HOT updates into
non-HOT updates. Another reason could be that we don't have checks in place
to catch these kinds of corruption. Having said that, since we have
discovered the bug, at least many 2ndQuadrant customers have expressed
worry and want to know if the fix will be available in 9.6.2 and other
minor releases. Since the bug can lead to data corruption, the worry is
justified. Until we fix the bug, there will be a constant worry about using
CIC.

If we can have some kind of band-aid fix to plug in the hole, that might be
enough as well. I tested my first patch (which will need some polishing)
and that works well AFAIK. I was worried about prepared queries and all,
but that seems ok too. RelationGetIndexList() always get called within
ExecInitModifyTable. The fix seems quite unlikely to cause any other side
effects.

Another possible band-aid is to throw another relcache invalidation in CIC.
Like adding a dummy index_set_state_flags() within yet another transaction.
Seems ugly, but should fix the problem for now and not cause any impact on
relcache mechanism whatsoever.

That seems better than

risking new breakage when it's barely 48 hours to the release wrap
deadline. We do not have time to recover from any mistakes.

I'm not sure what the project policies are, but can we consider delaying
the release by a week for issues like these? Or do you think it will be
hard to come up with a proper fix for the issue and it will need some
serious refactoring?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12

Pavel Stehule

pavel.stehule@gmail.com

almost 9 years ago

In reply to: Pavan Deolasee (#11)

Re: Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 7:54 GMT+01:00 Pavan Deolasee <pavan.deolasee@gmail.com>:

On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Based on Pavan's comments, I think trying to force this into next week's
releases would be extremely unwise. If the bug went undetected this long,
it can wait for a fix for another three months.

Yes, I think bug existed ever since and went unnoticed. One reason could
be that the race happens only when the new index turns HOT updates into
non-HOT updates. Another reason could be that we don't have checks in place
to catch these kinds of corruption. Having said that, since we have
discovered the bug, at least many 2ndQuadrant customers have expressed
worry and want to know if the fix will be available in 9.6.2 and other
minor releases. Since the bug can lead to data corruption, the worry is
justified. Until we fix the bug, there will be a constant worry about using
CIC.

If we can have some kind of band-aid fix to plug in the hole, that might
be enough as well. I tested my first patch (which will need some polishing)
and that works well AFAIK. I was worried about prepared queries and all,
but that seems ok too. RelationGetIndexList() always get called within
ExecInitModifyTable. The fix seems quite unlikely to cause any other side
effects.

Another possible band-aid is to throw another relcache invalidation in
CIC. Like adding a dummy index_set_state_flags() within yet another
transaction. Seems ugly, but should fix the problem for now and not cause
any impact on relcache mechanism whatsoever.

That seems better than

risking new breakage when it's barely 48 hours to the release wrap
deadline. We do not have time to recover from any mistakes.

I'm not sure what the project policies are, but can we consider delaying
the release by a week for issues like these? Or do you think it will be
hard to come up with a proper fix for the issue and it will need some
serious refactoring?

I agree with Pavan - a release with known important bug is not good idea.

Pavel

Show quoted text

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Pavel Stehule (#12)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I agree with Pavan - a release with known important bug is not good idea.

This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Michael Paquier (#13)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Michael Paquier <michael.paquier@gmail.com> writes:

On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I agree with Pavan - a release with known important bug is not good idea.

This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.

I think the way to think about this sort of thing is, if we had found
this bug when a release wasn't imminent, would we consider it bad enough
to justify an unscheduled release cycle? I have to think the answer for
this one is "probably not".

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Pavel Stehule

pavel.stehule@gmail.com

almost 9 years ago

In reply to: Tom Lane (#14)

Re: Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 18:51 GMT+01:00 Tom Lane <tgl@sss.pgh.pa.us>:

Michael Paquier <michael.paquier@gmail.com> writes:

On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com>

wrote:

I agree with Pavan - a release with known important bug is not good

idea.

This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.

I think the way to think about this sort of thing is, if we had found
this bug when a release wasn't imminent, would we consider it bad enough
to justify an unscheduled release cycle? I have to think the answer for
this one is "probably not".

There are a questions

1. is it critical bug?
2. if is it, will be reason for special minor release?
3. how much time is necessary for fixing of this bug?

These questions can be unimportant if there is some security fixes in next
release, what I cannot to know.

Regards

Pavel

Show quoted text

regards, tom lane

#16

Martín Marqués

martin@2ndquadrant.com

almost 9 years ago

In reply to: Michael Paquier (#13)

Re: Index corruption with CREATE INDEX CONCURRENTLY

El 05/02/17 a las 10:03, Michael Paquier escribió:

On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I agree with Pavan - a release with known important bug is not good idea.

This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.

The fact that the bug has been around for a long time doesn't mean we
shouldn't take it seriously.

IMO any kind of corruption (heap or index) should be prioritized.
There's also been comments about maybe this being the cause of old
reports about index corruption.

I ask myself if it's a good idea to make a point release with a know
corruption bug in it.

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Pavan Deolasee (#9)

1 attachment(s)

Re: Index corruption with CREATE INDEX CONCURRENTLY

[ Having now read the whole thread, I'm prepared to weigh in ... ]

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

This seems like a real problem to me. I don't know what the consequences
are, but definitely having various attribute lists to have different view
of the set of indexes doesn't seem right.

For sure.

2. In the second patch, we tried to recompute attribute lists if a relcache
flush happens in between and index list is invalidated. We've seen problems
with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
flushes keep happening.

Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
happen whenever it possibly could. The way to deal with that without
looping is to test whether the index set *actually* changed, not whether
it just might have changed.

I do not like any of the other patches proposed in this thread, because
they fail to guarantee delivering an up-to-date attribute bitmap to the
caller. I think we need a retry loop, and I think that it needs to look
like the attached.

(Note that the tests whether rd_pkindex and rd_replidindex changed might
be redundant; but I'm not totally sure of that, and they're cheap.)

regards, tom lane

Attachments:

invalidate_indexattr_v4.patchtext/x-diff; charset=us-ascii; name=invalidate_indexattr_v4.patchDownload

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560..b659994 100644
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*************** RelationGetFKeyList(Relation relation)
*** 4327,4335 ****
   * it is the pg_class OID of a unique index on OID when the relation has one,
   * and InvalidOid if there is no such index.
   *
!  * In exactly the same way, we update rd_replidindex, which is the pg_class
!  * OID of an index to be used as the relation's replication identity index,
!  * or InvalidOid if there is no such index.
   */
  List *
  RelationGetIndexList(Relation relation)
--- 4327,4336 ----
   * it is the pg_class OID of a unique index on OID when the relation has one,
   * and InvalidOid if there is no such index.
   *
!  * In exactly the same way, we update rd_pkindex, which is the OID of the
!  * relation's primary key index if any, else InvalidOid; and rd_replidindex,
!  * which is the pg_class OID of an index to be used as the relation's
!  * replication identity index, or InvalidOid if there is no such index.
   */
  List *
  RelationGetIndexList(Relation relation)
*************** RelationGetIndexPredicate(Relation relat
*** 4746,4753 ****
   * we can include system attributes (e.g., OID) in the bitmap representation.
   *
   * Caller had better hold at least RowExclusiveLock on the target relation
!  * to ensure that it has a stable set of indexes.  This also makes it safe
!  * (deadlock-free) for us to take locks on the relation's indexes.
   *
   * The returned result is palloc'd in the caller's memory context and should
   * be bms_free'd when not needed anymore.
--- 4747,4756 ----
   * we can include system attributes (e.g., OID) in the bitmap representation.
   *
   * Caller had better hold at least RowExclusiveLock on the target relation
!  * to ensure it is safe (deadlock-free) for us to take locks on the relation's
!  * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
!  * that lock level doesn't guarantee a stable set of indexes, so we have to
!  * be prepared to retry here in case of a change in the set of indexes.
   *
   * The returned result is palloc'd in the caller's memory context and should
   * be bms_free'd when not needed anymore.
*************** RelationGetIndexAttrBitmap(Relation rela
*** 4760,4765 ****
--- 4763,4769 ----
  	Bitmapset  *pkindexattrs;	/* columns in the primary index */
  	Bitmapset  *idindexattrs;	/* columns in the replica identity */
  	List	   *indexoidlist;
+ 	List	   *newindexoidlist;
  	Oid			relpkindex;
  	Oid			relreplindex;
  	ListCell   *l;
*************** RelationGetIndexAttrBitmap(Relation rela
*** 4788,4795 ****
  		return NULL;
  
  	/*
! 	 * Get cached list of index OIDs
  	 */
  	indexoidlist = RelationGetIndexList(relation);
  
  	/* Fall out if no indexes (but relhasindex was set) */
--- 4792,4800 ----
  		return NULL;
  
  	/*
! 	 * Get cached list of index OIDs. If we have to start over, we do so here.
  	 */
+ restart:
  	indexoidlist = RelationGetIndexList(relation);
  
  	/* Fall out if no indexes (but relhasindex was set) */
*************** RelationGetIndexAttrBitmap(Relation rela
*** 4800,4808 ****
  	 * Copy the rd_pkindex and rd_replidindex value computed by
  	 * RelationGetIndexList before proceeding.  This is needed because a
  	 * relcache flush could occur inside index_open below, resetting the
! 	 * fields managed by RelationGetIndexList. (The values we're computing
! 	 * will still be valid, assuming that caller has a sufficient lock on
! 	 * the relation.)
  	 */
  	relpkindex = relation->rd_pkindex;
  	relreplindex = relation->rd_replidindex;
--- 4805,4812 ----
  	 * Copy the rd_pkindex and rd_replidindex value computed by
  	 * RelationGetIndexList before proceeding.  This is needed because a
  	 * relcache flush could occur inside index_open below, resetting the
! 	 * fields managed by RelationGetIndexList.  We need to do the work with
! 	 * stable values of these fields.
  	 */
  	relpkindex = relation->rd_pkindex;
  	relreplindex = relation->rd_replidindex;
*************** RelationGetIndexAttrBitmap(Relation rela
*** 4880,4886 ****
  		index_close(indexDesc, AccessShareLock);
  	}
  
! 	list_free(indexoidlist);
  
  	/* Don't leak the old values of these bitmaps, if any */
  	bms_free(relation->rd_indexattr);
--- 4884,4915 ----
  		index_close(indexDesc, AccessShareLock);
  	}
  
! 	/*
! 	 * During one of the index_opens in the above loop, we might have received
! 	 * a relcache flush event on this relcache entry, which might have been
! 	 * signaling a change in the rel's index list.  If so, we'd better start
! 	 * over to ensure we deliver up-to-date attribute bitmaps.
! 	 */
! 	newindexoidlist = RelationGetIndexList(relation);
! 	if (equal(indexoidlist, newindexoidlist) &&
! 		relpkindex == relation->rd_pkindex &&
! 		relreplindex == relation->rd_replidindex)
! 	{
! 		/* Still the same index set, so proceed */
! 		list_free(newindexoidlist);
! 		list_free(indexoidlist);
! 	}
! 	else
! 	{
! 		/* Gotta do it over ... might as well not leak memory */
! 		list_free(newindexoidlist);
! 		list_free(indexoidlist);
! 		bms_free(uindexattrs);
! 		bms_free(pkindexattrs);
! 		bms_free(idindexattrs);
! 		bms_free(indexattrs);
! 		goto restart;
! 	}
  
  	/* Don't leak the old values of these bitmaps, if any */
  	bms_free(relation->rd_indexattr);

#18

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Martín Marqués (#16)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sun, Feb 5, 2017 at 1:34 PM, Martín Marqués <martin@2ndquadrant.com> wrote:

El 05/02/17 a las 10:03, Michael Paquier escribió:

On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I agree with Pavan - a release with known important bug is not good idea.

This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.

The fact that the bug has been around for a long time doesn't mean we
shouldn't take it seriously.

IMO any kind of corruption (heap or index) should be prioritized.
There's also been comments about maybe this being the cause of old
reports about index corruption.

I ask myself if it's a good idea to make a point release with a know
corruption bug in it.

I don't think this kind of black-and-white thinking is very helpful.
Obviously, data corruption is bad. However, this bug has (from what
one can tell from this thread) been with us for over a decade; it must
necessarily be either low-probability or low-severity, or somebody
would've found it and fixed it before now. Indeed, the discovery of
this bug was driven by new feature development, not a user report. It
seems pretty clear that if we try to patch this and get it wrong, the
effects of our mistake could easily be a lot more serious than the
original bug.

Now, I do not object to patching this tomorrow before the wrap if the
various senior hackers who have studied this (Tom, Alvaro, Pavan,
Amit) are all in agreement on the way forward. But if there is any
significant chance that we don't fully understand this issue and that
our fix might cause bigger problems, it would be more prudent to wait.
I'm all in favor of releasing important bug fixes quickly, but
releasing a bug fix before we're sure we've fixed the bug correctly is
taking a good principle to an irrational extreme.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Robert Haas (#18)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think this kind of black-and-white thinking is very helpful.
Obviously, data corruption is bad. However, this bug has (from what
one can tell from this thread) been with us for over a decade; it must
necessarily be either low-probability or low-severity, or somebody
would've found it and fixed it before now. Indeed, the discovery of
this bug was driven by new feature development, not a user report. It
seems pretty clear that if we try to patch this and get it wrong, the
effects of our mistake could easily be a lot more serious than the
original bug.

+1. The fact that it wasn't driven by a user report convinces me that
this is the way to go.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Tom Lane (#14)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On 2017-02-05 12:51:09 -0500, Tom Lane wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I agree with Pavan - a release with known important bug is not good idea.

This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.

I think the way to think about this sort of thing is, if we had found
this bug when a release wasn't imminent, would we consider it bad enough
to justify an unscheduled release cycle? I have to think the answer for
this one is "probably not".

+1. I don't think we serve our users by putting out a nontrivial bugfix
hastily. Nor do I think we serve them in this instance by delaying the
release to cover this fix; there's enough other fixes in the release
imo.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Tom Lane (#17)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On 2017-02-05 15:14:59 -0500, Tom Lane wrote:

I do not like any of the other patches proposed in this thread, because
they fail to guarantee delivering an up-to-date attribute bitmap to the
caller. I think we need a retry loop, and I think that it needs to look
like the attached.

That looks like a reasonable approach, although I'm not sure it's the
best one.

(Note that the tests whether rd_pkindex and rd_replidindex changed might
be redundant; but I'm not totally sure of that, and they're cheap.)

I don't think they can, but I'm all for re-checking.

RelationGetIndexList(Relation relation)
@@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
* we can include system attributes (e.g., OID) in the bitmap representation.
*
* Caller had better hold at least RowExclusiveLock on the target relation
- * to ensure that it has a stable set of indexes.  This also makes it safe
- * (deadlock-free) for us to take locks on the relation's indexes.
+ * to ensure it is safe (deadlock-free) for us to take locks on the relation's
+ * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
+ * that lock level doesn't guarantee a stable set of indexes, so we have to
+ * be prepared to retry here in case of a change in the set of indexes.

I've not yet read the full thread, but I'm a bit confused so far. We
obviously can get changing information about indexes here, but isn't
that something we have to deal with anyway? If we guarantee that we
don't loose knowledge that there's a pending invalidation, why do we
have to retry? Pretty much by definition the knowledge in a relcache
entry can be outdated as soon as returned unless locking prevents that
from being possible - which is not the case here.

ISTM it'd be better not to have retry logic here, but to follow the more
general pattern of making sure that we know whether the information
needs to be recomputed at the next access. We could either do that by
having an additional bit of information about the validity of the
bitmaps (so we have invalid, building, valid - and only set valid at the
end of computing the bitmaps when still building and not invalid again),
or we simply move the bitmap computation into the normal relcache build.

BTW, the interactions of RelationSetIndexList with
RelationGetIndexAttrBitmap() look more than a bit fragile to me, even if
documented:
*
* We deliberately do not change rd_indexattr here: even when operating
* with a temporary partial index list, HOT-update decisions must be made
* correctly with respect to the full index set. It is up to the caller
* to ensure that a correct rd_indexattr set has been cached before first
* calling RelationSetIndexList; else a subsequent inquiry might cause a
* wrong rd_indexattr set to get computed and cached. Likewise, we do not
* touch rd_keyattr, rd_pkattr or rd_idattr.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 9 years ago

In reply to: Peter Geoghegan (#19)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On 02/06/2017 01:11 AM, Peter Geoghegan wrote:

On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think this kind of black-and-white thinking is very
helpful. Obviously, data corruption is bad. However, this bug has
(from what one can tell from this thread) been with us for over a
decade; it must necessarily be either low-probability or
low-severity, or somebody would've found it and fixed it before
now. Indeed, the discovery of this bug was driven by new feature
development, not a user report. It seems pretty clear that if we
try to patch this and get it wrong, the effects of our mistake
could easily be a lot more serious than the original bug.

+1. The fact that it wasn't driven by a user report convinces me
that this is the way to go.

+1 to not rushing fixes into releases. While I think we now finally
understand the mechanics of this bug, the fact that we came up with
three different fixes in this thread, only to discover issues with each
of them, warrants some caution.

OTOH I disagree with the notion that bugs that are not driven by user
reports are somehow less severe. Some data corruption bugs cause quite
visible breakage - segfaults, immediate crashes, etc. Those are pretty
clear bugs, and are reported by users.

Other data corruption bugs are much more subtle - for example this bug
may lead to incorrect results to some queries, failing to detect values
violating UNIQUE constraints, issues with foreign keys, etc.

It's damn impossible to notice incorrect query results that only affect
tiny subset of the rows (e.g. rows updated when the CIC was running),
especially when the issue may go away after a while due to additional
non-HOT updates.

Regarding the other symptoms - I wonder how many strange 'duplicate
value' errors were misdiagnosed, wrongly attributed to a recent power
outage, etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Andres Freund (#21)

1 attachment(s)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Hi,

On 2017-02-05 16:37:33 -0800, Andres Freund wrote:

RelationGetIndexList(Relation relation)
@@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
* we can include system attributes (e.g., OID) in the bitmap representation.
*
* Caller had better hold at least RowExclusiveLock on the target relation
- * to ensure that it has a stable set of indexes.  This also makes it safe
- * (deadlock-free) for us to take locks on the relation's indexes.
+ * to ensure it is safe (deadlock-free) for us to take locks on the relation's
+ * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
+ * that lock level doesn't guarantee a stable set of indexes, so we have to
+ * be prepared to retry here in case of a change in the set of indexes.
I've not yet read the full thread, but I'm a bit confused so far. We
obviously can get changing information about indexes here, but isn't
that something we have to deal with anyway? If we guarantee that we
don't loose knowledge that there's a pending invalidation, why do we
have to retry? Pretty much by definition the knowledge in a relcache
entry can be outdated as soon as returned unless locking prevents that
from being possible - which is not the case here.

ISTM it'd be better not to have retry logic here, but to follow the more
general pattern of making sure that we know whether the information
needs to be recomputed at the next access. We could either do that by
having an additional bit of information about the validity of the
bitmaps (so we have invalid, building, valid - and only set valid at the
end of computing the bitmaps when still building and not invalid again),
or we simply move the bitmap computation into the normal relcache build.

To show what I mean here's an *unpolished* and *barely tested* patch
implementing the first of my suggestions.

Alvaro, Pavan, I think should address the issue as well?

- Andres

Attachments:

WIP-relcache-bitmapsvalid.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560e46..9e94495e75 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4745,9 +4745,12 @@ RelationGetIndexPredicate(Relation relation)
  * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
  * we can include system attributes (e.g., OID) in the bitmap representation.
  *
- * Caller had better hold at least RowExclusiveLock on the target relation
- * to ensure that it has a stable set of indexes.  This also makes it safe
- * (deadlock-free) for us to take locks on the relation's indexes.
+ * Caller had better hold at least RowExclusiveLock on the target relation to
+ * ensure that it has a stable set of indexes.  This also makes it safe
+ * (deadlock-free) for us to take locks on the relation's indexes.  Note that
+ * a concurrent CREATE/DROP INDEX CONCURRENTLY can lead to an outdated list
+ * being returned (will be recomputed at the next access), the CONCURRENTLY
+ * code deals with that.
  *
  * The returned result is palloc'd in the caller's memory context and should
  * be bms_free'd when not needed anymore.
@@ -4766,7 +4769,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the result. */
-	if (relation->rd_indexattr != NULL)
+	if (relation->rd_bitmapsvalid == 2)
 	{
 		switch (attrKind)
 		{
@@ -4788,6 +4791,14 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 		return NULL;
 
 	/*
+	 * Signal that the attribute bitmaps are being built. If there's any
+	 * relcache invalidations while building them, rd_bitmapsvalid will be
+	 * reset to 0.  In that case return the bitmaps, but don't mark them as
+	 * valid.
+	 */
+	relation->rd_bitmapsvalid = 1;
+
+	/*
 	 * Get cached list of index OIDs
 	 */
 	indexoidlist = RelationGetIndexList(relation);
@@ -4892,13 +4903,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	bms_free(relation->rd_idattr);
 	relation->rd_idattr = NULL;
 
-	/*
-	 * Now save copies of the bitmaps in the relcache entry.  We intentionally
-	 * set rd_indexattr last, because that's the one that signals validity of
-	 * the values; if we run out of memory before making that copy, we won't
-	 * leave the relcache entry looking like the other ones are valid but
-	 * empty.
-	 */
+	/* now save copies of the bitmaps in the relcache entry */
 	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 	relation->rd_keyattr = bms_copy(uindexattrs);
 	relation->rd_pkattr = bms_copy(pkindexattrs);
@@ -4906,6 +4911,18 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	relation->rd_indexattr = bms_copy(indexattrs);
 	MemoryContextSwitchTo(oldcxt);
 
+	/*
+	 * If there've been no invalidations while building the entry, mark the
+	 * stored bitmaps as being valid.  Need to do so after the copies above,
+	 * as we could run out of memory while doing so.
+	 *
+	 * NB: No relcache accesses should happen inside this routine after this.
+	 */
+	if (relation->rd_bitmapsvalid == 1)
+	{
+		relation->rd_bitmapsvalid = 2;
+	}
+
 	/* We return our original working copy for caller to play with */
 	switch (attrKind)
 	{
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a617a7cf56..4fbf6632a0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -137,6 +137,7 @@ typedef struct RelationData
 	Oid			rd_replidindex; /* OID of replica identity index, if any */
 
 	/* data managed by RelationGetIndexAttrBitmap: */
+	int rd_bitmapsvalid; /* 0 invalid, 1 building, 2 valid */
 	Bitmapset  *rd_indexattr;	/* identifies columns used in indexes */
 	Bitmapset  *rd_keyattr;		/* cols that can be ref'd by foreign keys */
 	Bitmapset  *rd_pkattr;		/* cols included in primary key */

#24

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Tomas Vondra (#22)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sun, Feb 5, 2017 at 4:57 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

OTOH I disagree with the notion that bugs that are not driven by user
reports are somehow less severe. Some data corruption bugs cause quite
visible breakage - segfaults, immediate crashes, etc. Those are pretty clear
bugs, and are reported by users.

I meant that I find the fact that there were no user reports in all
these years to be a good reason to not proceed for now in this
instance.

I wrote amcheck to detect the latter variety of bug, so clearly I
think that they are very serious.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Tom Lane (#17)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. In the second patch, we tried to recompute attribute lists if a

relcache

flush happens in between and index list is invalidated. We've seen

problems

with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
flushes keep happening.

Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
happen whenever it possibly could.

Ah, ok. That explains why the retry logic as originally proposed was
causing infinite loop.

The way to deal with that without
looping is to test whether the index set *actually* changed, not whether
it just might have changed.

I like this approach. I'll run the patch on a build with
CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
fact that retry logic is now limited to only when index set changes, which
fits in the situation we're dealing with.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#26

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Andres Freund (#23)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 6:27 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2017-02-05 16:37:33 -0800, Andres Freund wrote:
RelationGetIndexList(Relation relation)
@@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
* we can include system attributes (e.g., OID) in the bitmap representation.
*
* Caller had better hold at least RowExclusiveLock on the target relation
- * to ensure that it has a stable set of indexes.  This also makes it safe
- * (deadlock-free) for us to take locks on the relation's indexes.
+ * to ensure it is safe (deadlock-free) for us to take locks on the relation's
+ * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
+ * that lock level doesn't guarantee a stable set of indexes, so we have to
+ * be prepared to retry here in case of a change in the set of indexes.
I've not yet read the full thread, but I'm a bit confused so far. We
obviously can get changing information about indexes here, but isn't
that something we have to deal with anyway? If we guarantee that we
don't loose knowledge that there's a pending invalidation, why do we
have to retry? Pretty much by definition the knowledge in a relcache
entry can be outdated as soon as returned unless locking prevents that
from being possible - which is not the case here.

ISTM it'd be better not to have retry logic here, but to follow the more
general pattern of making sure that we know whether the information
needs to be recomputed at the next access. We could either do that by
having an additional bit of information about the validity of the
bitmaps (so we have invalid, building, valid - and only set valid at the
end of computing the bitmaps when still building and not invalid again),
or we simply move the bitmap computation into the normal relcache build.
To show what I mean here's an *unpolished* and *barely tested* patch
implementing the first of my suggestions.

+ * Signal that the attribute bitmaps are being built. If there's any
+ * relcache invalidations while building them, rd_bitmapsvalid will be
+ * reset to 0.  In that case return the bitmaps, but don't mark them as
+ * valid.
+ */

I don't see in your patch that you are setting rd_bitmapsvalid to 0.
Also, I think this suffers from the same problem as the patch proposed
by Alvaro which is that different attrs (hot_attrs, key_attrs and
id_attrs) will get computed based on different index lists which can
cause a problem as mentioned in my mail up thread. However, I think
your general approach and idea that we have to set the things up for
next time is on spot. I am not sure but I think your solution can be
extended to make it somewhat similar to what we do with rd_indexvalid
(basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily
forced) at rel cache invalidation and then clean it up transaction
end). I can try something on those lines.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Peter Geoghegan (#19)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think this kind of black-and-white thinking is very helpful.
Obviously, data corruption is bad. However, this bug has (from what
one can tell from this thread) been with us for over a decade; it must
necessarily be either low-probability or low-severity, or somebody
would've found it and fixed it before now. Indeed, the discovery of
this bug was driven by new feature development, not a user report. It
seems pretty clear that if we try to patch this and get it wrong, the
effects of our mistake could easily be a lot more serious than the
original bug.

+1. The fact that it wasn't driven by a user report convinces me that
this is the way to go.

I'm not sure that just because the bug wasn't reported by a user, makes it
any less critical. As Tomas pointed down thread, the nature of the bug is
such that the users may not discover it very easily, but that doesn't mean
it couldn't be affecting them all the time. We can now correlate many past
reports of index corruption to this bug, but we don't have evidence to
prove that. Lack of any good tool or built-in checks probably makes it even
harder.

The reason why I discovered this with WARM is because it now has a built-in
recheck logic, which was discarding some tuples returned by the index scan.
It took me lots of code review and inspection to finally conclude that this
must be an existing bug. Even then for lack of any utility, I could not
detect this easily with master. That doesn't mean I wasn't able to
reproduce, but detection was a problem.

If you look at the reproduction setup, one in every 10, if not 5, CREATE
INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10%
probability. And this is on a low end laptop, with just 5-10 concurrent
clients running. Probability of hitting the problem will be much higher on
a bigger machine, with many users (on a decent AWS machine, I would find
every third CIC to be corrupt). Moreover the setup is not doing anything
extraordinary. Just concurrent updates which change between HOT to non-HOT
and a CIC.

Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#28

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Pavan Deolasee (#25)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. In the second patch, we tried to recompute attribute lists if a
relcache
flush happens in between and index list is invalidated. We've seen
problems
with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
flushes keep happening.

Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
happen whenever it possibly could.

Ah, ok. That explains why the retry logic as originally proposed was causing
infinite loop.

The way to deal with that without
looping is to test whether the index set *actually* changed, not whether
it just might have changed.

I like this approach. I'll run the patch on a build with
CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
fact that retry logic is now limited to only when index set changes, which
fits in the situation we're dealing with.

Don't you think it also has the same problem as mentioned by me in
Alvaro's patch and you also agreed on it? Do you see any need to fix
it locally in
RelationGetIndexAttrBitmap(), why can't we clear at transaction end?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Pavan Deolasee (#27)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sun, Feb 5, 2017 at 6:42 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I'm not sure that just because the bug wasn't reported by a user, makes it
any less critical. As Tomas pointed down thread, the nature of the bug is
such that the users may not discover it very easily, but that doesn't mean
it couldn't be affecting them all the time. We can now correlate many past
reports of index corruption to this bug, but we don't have evidence to prove
that. Lack of any good tool or built-in checks probably makes it even
harder.

I think that we need to make an automated checker tool a requirement
for very complicated development projects in the future. We're behind
here.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Amit Kapila (#26)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On 2017-02-06 08:08:01 +0530, Amit Kapila wrote:

I don't see in your patch that you are setting rd_bitmapsvalid to 0.

IIRC a plain relcache rebuild should do that (note there's also no place
that directly resets rd_indexattrs).

Also, I think this suffers from the same problem as the patch proposed
by Alvaro which is that different attrs (hot_attrs, key_attrs and
id_attrs) will get computed based on different index lists which can
cause a problem as mentioned in my mail up thread.

I'm not convinced that that's a legitimate concern. If that's an issue
with CIC, then we have a lot bigger problems, because relcache entries
and such will be inconsistent anyway if you have non-exclusive locks
while changing catalog data. In the situations where that happens it
better be harmless (otherwis CiC is broken), and the next access will
recompute them.

I am not sure but I think your solution can be
extended to make it somewhat similar to what we do with rd_indexvalid
(basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily
forced) at rel cache invalidation and then clean it up transaction
end). I can try something on those lines.

Not sure I understand what you mean with "clean it up at transaction end"?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Andres Freund (#21)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Andres Freund <andres@anarazel.de> writes:

I've not yet read the full thread, but I'm a bit confused so far. We
obviously can get changing information about indexes here, but isn't
that something we have to deal with anyway? If we guarantee that we
don't loose knowledge that there's a pending invalidation, why do we
have to retry?

We don't *have to* retry; the core of the fix is to not enter stale data
into the cache after we've already received an invalidation signal. The
current caller can survive with stale data; or if that's not true, C.I.C.
has got more fundamental problems than posited here. But it seems like a
generally bad idea for relcache to return data that it knows (or at least
has reason to believe) is stale.

Also, even if you're okay with return-stale-data-but-don't-cache-it,
I have little faith that invalidate_indexattr_v3.patch successfully
implements that. Consider the sequence: inval received during
RelationGetIndexAttrBitmap, we clear rd_indexvalid, something asks for
the relation OID list, we recalculate that and set rd_indexvalid, then
we reach the end of RelationGetIndexAttrBitmap and see that rd_indexvalid
is set. If that happened, we'd cache the bitmaps whether or not they were
really up to date. Now admittedly, it's a bit hard to think of a reason
why anything would ask for the index list of anything but a system catalog
in that sequence, so as long as you assume that we don't allow C.I.C. (or
more realistically, REINDEX CONCURRENTLY) on system catalogs, this failure
mode is unreachable. But I much prefer having a positive verification
that the index list is still what it was when we started.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#28)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. In the second patch, we tried to recompute attribute lists if a
relcache
flush happens in between and index list is invalidated. We've seen
problems
with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently

relcache

flushes keep happening.

Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache

flush

happen whenever it possibly could.

Ah, ok. That explains why the retry logic as originally proposed was

causing

infinite loop.

The way to deal with that without
looping is to test whether the index set *actually* changed, not whether
it just might have changed.

I like this approach. I'll run the patch on a build with
CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
fact that retry logic is now limited to only when index set changes,

which

fits in the situation we're dealing with.

Don't you think it also has the same problem as mentioned by me in
Alvaro's patch and you also agreed on it?

No, I don't think so. There can't be any more relcache flushes occurring
between the first RelationGetIndexAttrBitmap() and the subsequent
RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed
approach was that we were not caching, yet returning stale information.
That implied the next call will again recompute a possibly different value.
The retry logic is trying to close a small race yet important race
condition where index set invalidation happens, but we cache stale
information.

Do you see any need to fix
it locally in
RelationGetIndexAttrBitmap(), why can't we clear at transaction end?

We're trying to fix it in RelationGetIndexAttrBitmap() because that's where
the race exists. I'm not saying there aren't other and better ways of
handling it, but we don't know (I've studied Andres's proposal yet).

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#33

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Andres Freund (#20)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-02-05 12:51:09 -0500, Tom Lane wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com>

wrote:

I agree with Pavan - a release with known important bug is not good

idea.

This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.

I think the way to think about this sort of thing is, if we had found
this bug when a release wasn't imminent, would we consider it bad enough
to justify an unscheduled release cycle? I have to think the answer for
this one is "probably not".

+1. I don't think we serve our users by putting out a nontrivial bugfix
hastily. Nor do I think we serve them in this instance by delaying the
release to cover this fix; there's enough other fixes in the release
imo.

I'm bit a surprised with this position. What do we tell our users now that
we know this bug exists? They can't fully trust CIC and they don't have any
mechanism to confirm if the newly built index is corruption free or not.
I'm not suggesting that we should hastily refactor the code, but at the
very least some sort of band-aid fix which helps the situation, yet have
minimal side-effects, is warranted. I believe proposed retry patch does
exactly that.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#34

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Tom Lane (#31)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On 2017-02-05 21:49:57 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I've not yet read the full thread, but I'm a bit confused so far. We
obviously can get changing information about indexes here, but isn't
that something we have to deal with anyway? If we guarantee that we
don't loose knowledge that there's a pending invalidation, why do we
have to retry?

We don't *have to* retry; the core of the fix is to not enter stale data
into the cache after we've already received an invalidation signal.

Right, the bug is that we do so without remembering that fact.

The current caller can survive with stale data; or if that's not true,
C.I.C. has got more fundamental problems than posited here.

Indeed.

But it seems like a generally bad idea for relcache to return data
that it knows (or at least has reason to believe) is stale.

I'm not convinced by this - this kind of staleness can only occur if
changes happen during the execution of the cache building. And the
information returned can be outdated by pretty much the moment you
return anyway. It's also how a number of the current caches
essentially already work.

Also, even if you're okay with return-stale-data-but-don't-cache-it,
I have little faith that invalidate_indexattr_v3.patch successfully
implements that.

I sent a prototype clarifying what I mean. It's not really like
invalidate_indexattr_v3.patch

Btw, it seems RelationGetIndexList() avoids a similar issue onlye due to
either luck or a lot of undocumented assumptions. The only reason that
setting relation->rd_indexlist / rd_indexvalid at the end doesn't cause
a similar issue seems to be that no invalidations appear to be processed
after the systable_beginscan(). And that in turn seems to not process
relcache invals after RegisterSnapshot(GetCatalogSnapshot(relid)).
Phew, that seems a bit fragile.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Pavan Deolasee (#33)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund <andres@anarazel.de> wrote:

+1. I don't think we serve our users by putting out a nontrivial bugfix
hastily. Nor do I think we serve them in this instance by delaying the
release to cover this fix; there's enough other fixes in the release
imo.

I'm bit a surprised with this position.

The point is that there's a nontrivial chance of a hasty fix introducing
worse problems than we fix.

Given the lack of consensus about exactly how to fix this, I'm feeling
like it's a good idea if whatever we come up with gets some time to age
awhile in git before we ship it.

Obviously, 2ndQ or EDB or any other distributor can choose to ship a patch
in their own builds if they're sufficiently comfortable with the particular
patch. That doesn't translate to there having to be a fix in the
community's wraps this week.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Tom Lane (#35)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On 2017-02-05 22:34:34 -0500, Tom Lane wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> writes:
The point is that there's a nontrivial chance of a hasty fix introducing
worse problems than we fix.

Given the lack of consensus about exactly how to fix this, I'm feeling
like it's a good idea if whatever we come up with gets some time to age
awhile in git before we ship it.

Right. And I'm not even convinced that we really know the extent of the
bug; it seems fairly plausible that there's further incidences of this.
There's also the issue that the mechanics in the older backbranches are
different again, because of SnapshotNow.

I'm bit a surprised with this position. What do we tell our users now that
we know this bug exists?

That we're scheduling a bugfix for the next point release. I don't
think we can truthfully claim that there's no known corruption bugs in
any of the release in the last few years.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Pavan Deolasee (#32)

1 attachment(s)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 8:35 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. In the second patch, we tried to recompute attribute lists if a
relcache
flush happens in between and index list is invalidated. We've seen
problems
with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently
relcache
flushes keep happening.

Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache
flush
happen whenever it possibly could.

Ah, ok. That explains why the retry logic as originally proposed was
causing
infinite loop.

The way to deal with that without
looping is to test whether the index set *actually* changed, not
whether
it just might have changed.

I like this approach. I'll run the patch on a build with
CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
fact that retry logic is now limited to only when index set changes,
which
fits in the situation we're dealing with.

Don't you think it also has the same problem as mentioned by me in
Alvaro's patch and you also agreed on it?

No, I don't think so. There can't be any more relcache flushes occurring
between the first RelationGetIndexAttrBitmap() and the subsequent
RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed
approach was that we were not caching, yet returning stale information. That
implied the next call will again recompute a possibly different value. The
retry logic is trying to close a small race yet important race condition
where index set invalidation happens, but we cache stale information.

Hmm. Consider that the first time relcahe invalidation occurs while
computing id_attrs, so now the retry logic will compute the correct
set of attrs (considering two indexes, if we take the example given by
Alvaro above.). However, the attrs computed for hot_* and key_* will
be using only one index, so this will lead to a situation where two of
the attrs (hot_attrs and key_attrs) are computed with one index and
id_attrs is computed with two indexes. I think this can lead to
Assertion in HeapSatisfiesHOTandKeyUpdate().

Do you see any need to fix
it locally in
RelationGetIndexAttrBitmap(), why can't we clear at transaction end?

We're trying to fix it in RelationGetIndexAttrBitmap() because that's where
the race exists. I'm not saying there aren't other and better ways of
handling it, but we don't know (I've studied Andres's proposal yet).

Okay, please find attached patch which is an extension of Tom's and
Andres's patch and I think it fixes the above problem noted by me.
Basically, invalidate the bitmaps at transaction end rather than in
RelationGetIndexAttrBitmap(). I think it is okay for
RelationGetIndexAttrBitmap() to use stale information until
transaction end because all the updates in the current running
transaction will be consumed by the second phase of CIC.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

invalidate_indexattr_v6.patchapplication/octet-stream; name=invalidate_indexattr_v6.patchDownload

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560..d1543f1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3053,6 +3053,23 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
 		relation->rd_replidindex = InvalidOid;
 		relation->rd_indexvalid = 0;
 	}
+
+	/*
+	 * Flush any temporary index bitmaps.
+	 */
+	if (relation->rd_bitmapsvalid == 2)
+	{
+		bms_free(relation->rd_indexattr);
+		relation->rd_indexattr = NULL;
+		bms_free(relation->rd_keyattr);
+		relation->rd_keyattr = NULL;
+		bms_free(relation->rd_pkattr);
+		relation->rd_pkattr = NULL;
+		bms_free(relation->rd_idattr);
+		relation->rd_idattr = NULL;
+
+		relation->rd_bitmapsvalid = 0;
+	}
 }
 
 /*
@@ -3166,6 +3183,23 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
 		relation->rd_replidindex = InvalidOid;
 		relation->rd_indexvalid = 0;
 	}
+
+	/*
+	 * Flush any temporary index bitmaps.
+	 */
+	if (relation->rd_bitmapsvalid == 2)
+	{
+		bms_free(relation->rd_indexattr);
+		relation->rd_indexattr = NULL;
+		bms_free(relation->rd_keyattr);
+		relation->rd_keyattr = NULL;
+		bms_free(relation->rd_pkattr);
+		relation->rd_pkattr = NULL;
+		bms_free(relation->rd_idattr);
+		relation->rd_idattr = NULL;
+
+		relation->rd_bitmapsvalid = 0;
+	}
 }
 
 
@@ -4745,9 +4779,12 @@ RelationGetIndexPredicate(Relation relation)
  * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
  * we can include system attributes (e.g., OID) in the bitmap representation.
  *
- * Caller had better hold at least RowExclusiveLock on the target relation
- * to ensure that it has a stable set of indexes.  This also makes it safe
- * (deadlock-free) for us to take locks on the relation's indexes.
+ * Caller had better hold at least RowExclusiveLock on the target relation to
+ * ensure that it has a stable set of indexes.  This also makes it safe
+ * (deadlock-free) for us to take locks on the relation's indexes.  Note that
+ * a concurrent CREATE/DROP INDEX CONCURRENTLY can lead to an outdated list
+ * being returned (will be recomputed at the next access), the CONCURRENTLY
+ * code deals with that.
  *
  * The returned result is palloc'd in the caller's memory context and should
  * be bms_free'd when not needed anymore.
@@ -4760,13 +4797,14 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	Bitmapset  *pkindexattrs;	/* columns in the primary index */
 	Bitmapset  *idindexattrs;	/* columns in the replica identity */
 	List	   *indexoidlist;
+	List	   *newindexoidlist;
 	Oid			relpkindex;
 	Oid			relreplindex;
 	ListCell   *l;
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the result. */
-	if (relation->rd_indexattr != NULL)
+	if (relation->rd_bitmapsvalid != 0)
 	{
 		switch (attrKind)
 		{
@@ -4880,8 +4918,6 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 		index_close(indexDesc, AccessShareLock);
 	}
 
-	list_free(indexoidlist);
-
 	/* Don't leak the old values of these bitmaps, if any */
 	bms_free(relation->rd_indexattr);
 	relation->rd_indexattr = NULL;
@@ -4892,13 +4928,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	bms_free(relation->rd_idattr);
 	relation->rd_idattr = NULL;
 
-	/*
-	 * Now save copies of the bitmaps in the relcache entry.  We intentionally
-	 * set rd_indexattr last, because that's the one that signals validity of
-	 * the values; if we run out of memory before making that copy, we won't
-	 * leave the relcache entry looking like the other ones are valid but
-	 * empty.
-	 */
+	/* now save copies of the bitmaps in the relcache entry */
 	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 	relation->rd_keyattr = bms_copy(uindexattrs);
 	relation->rd_pkattr = bms_copy(pkindexattrs);
@@ -4906,6 +4936,48 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	relation->rd_indexattr = bms_copy(indexattrs);
 	MemoryContextSwitchTo(oldcxt);
 
+	/*
+	 * During one of the index_opens in the above loop, we might have received
+	 * a relcache flush event on this relcache entry, which might have been
+	 * signaling a change in the rel's index list.  If so, we'd better start
+	 * over to ensure we deliver up-to-date attribute bitmaps.
+	 */
+	newindexoidlist = RelationGetIndexList(relation);
+	if (equal(indexoidlist, newindexoidlist) &&
+		relpkindex == relation->rd_pkindex &&
+		relreplindex == relation->rd_replidindex)
+	{
+		/* Still the same index set, so proceed */
+		list_free(newindexoidlist);
+		list_free(indexoidlist);
+	}
+	else
+	{
+		/*
+		 * Signal that the attribute bitmaps are being built. If there's any
+		 * relcache invalidations while building them, rd_bitmapsvalid will be
+		 * be set to 2.  In that case return the bitmaps, but don't mark them as
+		 * valid.
+		 */
+		relation->rd_bitmapsvalid = 2;
+		/* Flag relation as needing eoxact cleanup (to reset the list) */
+		EOXactListAdd(relation);
+
+		/* Gotta do it over ... might as well not leak memory */
+		list_free(newindexoidlist);
+		list_free(indexoidlist);
+	}
+
+	/*
+	 * If there've been no invalidations while building the entry, mark the
+	 * stored bitmaps as being valid.  Need to do so after the copies above,
+	 * as we could run out of memory while doing so.
+	 *
+	 * NB: No relcache accesses should happen inside this routine after this.
+	 */
+	if (relation->rd_bitmapsvalid != 2)
+		relation->rd_bitmapsvalid = 1;
+
 	/* We return our original working copy for caller to play with */
 	switch (attrKind)
 	{
@@ -5519,6 +5591,7 @@ load_relcache_init_file(bool shared)
 		else
 			rel->rd_refcnt = 0;
 		rel->rd_indexvalid = 0;
+		rel->rd_bitmapsvalid = 0;
 		rel->rd_fkeylist = NIL;
 		rel->rd_fkeyvalid = false;
 		rel->rd_indexlist = NIL;
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a617a7c..6442127 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -137,6 +137,8 @@ typedef struct RelationData
 	Oid			rd_replidindex; /* OID of replica identity index, if any */
 
 	/* data managed by RelationGetIndexAttrBitmap: */
+	int rd_bitmapsvalid; /* state of rd_indexattr: 0 = not valid, 1 =
+						  * valid, 2 = temporarily forced */
 	Bitmapset  *rd_indexattr;	/* identifies columns used in indexes */
 	Bitmapset  *rd_keyattr;		/* cols that can be ref'd by foreign keys */
 	Bitmapset  *rd_pkattr;		/* cols included in primary key */

#38

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#37)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm. Consider that the first time relcahe invalidation occurs while
computing id_attrs, so now the retry logic will compute the correct
set of attrs (considering two indexes, if we take the example given by
Alvaro above.).

I don't quite get that. Since rd_idattr must be already cached at that
point and we don't expect to process a relcache flush between successive
calls to RelationGetIndexAttrBitmap(), we should return a consistent copy
of rd_idattr. But may be I am missing something.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#39

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Pavan Deolasee (#25)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

I like this approach. I'll run the patch on a build with
CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok.

While it looks certain that the fix will miss this point release, FWIW I
ran this patch with CACHE_CLOBBER_ALWAYS and it seems to be working as
expected.. Hard to run all the tests, but a small subset of regression and
test_decoding seems ok.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#40

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Pavan Deolasee (#38)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 9:47 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm. Consider that the first time relcahe invalidation occurs while
computing id_attrs, so now the retry logic will compute the correct
set of attrs (considering two indexes, if we take the example given by
Alvaro above.).

I don't quite get that. Since rd_idattr must be already cached at that point
and we don't expect to process a relcache flush between successive calls to
RelationGetIndexAttrBitmap(), we should return a consistent copy of
rd_idattr.

Right, I was mistaken. However, I am not sure if there is a need to
perform retry logic in RelationGetIndexAttrBitmap(). It is safe to do
that at transaction end. I feel even though Tom's fix looks reliable
with respect to the problem reported in this thread, we should take
some time and evaluate different proposals and see which one is the
best way to move forward.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Vladimir Borodin

root@simply.name

almost 9 years ago

In reply to: Peter Geoghegan (#24)

Re: Index corruption with CREATE INDEX CONCURRENTLY

6 февр. 2017 г., в 4:57, Peter Geoghegan <pg@bowt.ie> написал(а):

I meant that I find the fact that there were no user reports in all
these years to be a good reason to not proceed for now in this
instance.

Well, we had some strange situations with indexes (see below for example) but I couldn’t even think that CIC is the problem.

And it is really difficult to give diagnostics for problems of such kind. Because 1. you may see the problem several days after last major change in the database and 2. you don’t even know how to start investigating the problem.

xdb314g/maildb M # show enable_indexscan ;
enable_indexscan
------------------
off
(1 row)

Time: 0.120 ms
xdb314g/maildb M # select * from mail.folders where uid=448300241 and fid=1;
-[ RECORD 1 ]---+------------------------------
uid | 448300241
fid | 1
<...>

Time: 30398.637 ms
xdb314g/maildb M # set enable_indexscan to true;
SET
Time: 0.111 ms
xdb314g/maildb M # select * from mail.folders where uid=448300241 and fid=1;
(0 rows)

Time: 0.312 ms

xdb314g/maildb M #

The row of course hasn’t been deleted.

--
May the force be with you…
https://simply.name

#42

Martín Marqués

martin@2ndquadrant.com

almost 9 years ago

In reply to: Tomas Vondra (#22)

Re: Index corruption with CREATE INDEX CONCURRENTLY

El 05/02/17 a las 21:57, Tomas Vondra escribió:

+1 to not rushing fixes into releases. While I think we now finally
understand the mechanics of this bug, the fact that we came up with
three different fixes in this thread, only to discover issues with each
of them, warrants some caution.

I agree also with Robert on not rushing the patch. My point was if we
had to rush the release.

OTOH I disagree with the notion that bugs that are not driven by user
reports are somehow less severe. Some data corruption bugs cause quite
visible breakage - segfaults, immediate crashes, etc. Those are pretty
clear bugs, and are reported by users.

Other data corruption bugs are much more subtle - for example this bug
may lead to incorrect results to some queries, failing to detect values
violating UNIQUE constraints, issues with foreign keys, etc.

I was recalling just yesterday after sending the mail a logical
replication setup we did on a 9.3 server of a customer which brought up
data inconsistencies on the primary key of one of the tables. The table
had duplicate values.

As Tomas says, it's subtle and hard to find unless you constantly run
index checks (query a sample of the data from the heap and from the
index and check they match). In our case, the customer was not aware of
the dups until we found them.

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Alvaro Herrera

alvherre@2ndquadrant.com

almost 9 years ago

In reply to: Tom Lane (#17)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Tom Lane wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

2. In the second patch, we tried to recompute attribute lists if a relcache
flush happens in between and index list is invalidated. We've seen problems
with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
flushes keep happening.

Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
happen whenever it possibly could. The way to deal with that without
looping is to test whether the index set *actually* changed, not whether
it just might have changed.

Ah, that's a nice and simple way to fix that patch! I see that Pavan
confirmed that it fixes the tests we saw failing too. It seems to me
that we should go with this one, and it looks unlikely that this causes
other problems.

BTW, while neiter Pavan nor I sent this patch to -hackers, this
implementation is pretty much the same thing we did. Pavan deserves
credit as co-author in this commit.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Alvaro Herrera

alvherre@2ndquadrant.com

almost 9 years ago

In reply to: Andres Freund (#23)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Andres Freund wrote:

To show what I mean here's an *unpolished* and *barely tested* patch
implementing the first of my suggestions.

Alvaro, Pavan, I think should address the issue as well?

Hmm, interesting idea. Maybe a patch along these lines is a good way to
make index-list cache less brittle going forward. However, I'm against
putting it in the stable branches -- and we should definitely not stall
an immediate fix in order to get this patch ready. IMO we should get
Tom's patch in tree for all branches very soon, and then after the
release you can finalize this one and put it to master.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Amit Kapila (#37)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Amit Kapila <amit.kapila16@gmail.com> writes:

Hmm. Consider that the first time relcahe invalidation occurs while
computing id_attrs, so now the retry logic will compute the correct
set of attrs (considering two indexes, if we take the example given by
Alvaro above.). However, the attrs computed for hot_* and key_* will
be using only one index, so this will lead to a situation where two of
the attrs (hot_attrs and key_attrs) are computed with one index and
id_attrs is computed with two indexes. I think this can lead to
Assertion in HeapSatisfiesHOTandKeyUpdate().

That seems like something we'd be best off to fix by changing the
assertion. I doubt it's really going to be practical to guarantee that
those bitmapsets are always 100% consistent throughout a transaction.

Okay, please find attached patch which is an extension of Tom's and
Andres's patch and I think it fixes the above problem noted by me.

I don't like this patch at all. It only fixes the above issue if the
relcache inval arrives while RelationGetIndexAttrBitmap is executing.
If the inval arrives between two such calls, you still have the problem
of the second one delivering a bitmapset that might be inconsistent
with the first one.

To go in this direction, I think we would have to hot-wire
RelationGetIndexAttrBitmap so that once any bitmapset has been requested
in a transaction, we intentionally ignore all index set updates for the
remainder of the transaction. And that would very likely create more
problems than it solves (consider locally initiated DDL within the
transaction, for example).

Better to fix the callers so that they don't have the assumption you
refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap
so that it returns all the sets needed by a given calling module at
once, which would allow us to guarantee they're consistent.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Alvaro Herrera

alvherre@2ndquadrant.com

almost 9 years ago

In reply to: Tom Lane (#45)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Tom Lane wrote:

Better to fix the callers so that they don't have the assumption you
refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap
so that it returns all the sets needed by a given calling module at
once, which would allow us to guarantee they're consistent.

Note that my "interesting attrs" patch does away with these independent
bitmaps (which was last posted by Pavan as part of his WARM set). I
think we should fix just this bug now, and for the future look at that
other approach.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Alvaro Herrera (#46)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

Better to fix the callers so that they don't have the assumption you
refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap
so that it returns all the sets needed by a given calling module at
once, which would allow us to guarantee they're consistent.

Note that my "interesting attrs" patch does away with these independent
bitmaps (which was last posted by Pavan as part of his WARM set). I
think we should fix just this bug now, and for the future look at that
other approach.

BTW, if there is a risk of the assertion failure that Amit posits,
it seems like it should have happened in the tests that Pavan was doing
originally. I'd sort of like to see a demonstration that it can actually
happen before we spend any great amount of time fixing it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Robert Treat

rob@xzilla.net

almost 9 years ago

In reply to: Pavan Deolasee (#27)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sun, Feb 5, 2017 at 9:42 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think this kind of black-and-white thinking is very helpful.
Obviously, data corruption is bad. However, this bug has (from what
one can tell from this thread) been with us for over a decade; it must
necessarily be either low-probability or low-severity, or somebody
would've found it and fixed it before now. Indeed, the discovery of
this bug was driven by new feature development, not a user report. It
seems pretty clear that if we try to patch this and get it wrong, the
effects of our mistake could easily be a lot more serious than the
original bug.

+1. The fact that it wasn't driven by a user report convinces me that
this is the way to go.

I'm not sure that just because the bug wasn't reported by a user, makes it
any less critical. As Tomas pointed down thread, the nature of the bug is
such that the users may not discover it very easily, but that doesn't mean
it couldn't be affecting them all the time. We can now correlate many past
reports of index corruption to this bug, but we don't have evidence to prove
that. Lack of any good tool or built-in checks probably makes it even
harder.

The reason why I discovered this with WARM is because it now has a built-in
recheck logic, which was discarding some tuples returned by the index scan.
It took me lots of code review and inspection to finally conclude that this
must be an existing bug. Even then for lack of any utility, I could not
detect this easily with master. That doesn't mean I wasn't able to
reproduce, but detection was a problem.

If you look at the reproduction setup, one in every 10, if not 5, CREATE
INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10%
probability. And this is on a low end laptop, with just 5-10 concurrent
clients running. Probability of hitting the problem will be much higher on a
bigger machine, with many users (on a decent AWS machine, I would find every
third CIC to be corrupt). Moreover the setup is not doing anything
extraordinary. Just concurrent updates which change between HOT to non-HOT
and a CIC.

Not that I am advocating that we should do a release just for this;
having a fix we believe in is certainly as important a factor, but
that the idea that the bug has been around a long time means it is
less of an issue does seem wrong. We've certainly seen plenty of cases
over the years where bugs have existed in the code in seldom used code
paths, only to be exposed by new features or other code changes over
time. In general, we should be less worried about the age of a bug vs
our expectations that users are likely to hit that bug now, which does
seem high based on the above numbers.

In the meantime, it's certainly worth warning users, providing help on
how to determine if this is a likely problem for them, and possibly
rolling a patch ahead of upstream in cases where that's feasible.

Robert Treat
play: xzilla.net
work: omniti.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Robert Treat (#48)

Re: Index corruption with CREATE INDEX CONCURRENTLY

After some discussion among the release team, we've concluded that the
best thing to do is to push Pavan's/my patch into today's releases.
This does not close the matter by any means: we should continue to
study whether there are related bugs or whether there's a more principled
way of fixing this bug. But that patch clearly makes things better,
and we shouldn't let worries about whether there are more bugs stop
us from providing some kind of fix to users.

I've made the push, and barring negative reports from the buildfarm,
it will be in today's releases.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Tom Lane (#49)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After some discussion among the release team, we've concluded that the
best thing to do is to push Pavan's/my patch into today's releases.
This does not close the matter by any means: we should continue to
study whether there are related bugs or whether there's a more principled
way of fixing this bug. But that patch clearly makes things better,
and we shouldn't let worries about whether there are more bugs stop
us from providing some kind of fix to users.

I've made the push, and barring negative reports from the buildfarm,
it will be in today's releases.

Thank you for taking care of it. Buildfarm is looking green until now.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#51

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Tom Lane (#45)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Hmm. Consider that the first time relcahe invalidation occurs while
computing id_attrs, so now the retry logic will compute the correct
set of attrs (considering two indexes, if we take the example given by
Alvaro above.). However, the attrs computed for hot_* and key_* will
be using only one index, so this will lead to a situation where two of
the attrs (hot_attrs and key_attrs) are computed with one index and
id_attrs is computed with two indexes. I think this can lead to
Assertion in HeapSatisfiesHOTandKeyUpdate().

That seems like something we'd be best off to fix by changing the
assertion.

The above-mentioned problem doesn't exist in your version of the patch
(which is now committed) because the index attrs are cached after
invalidation and I have mentioned the same in my yesterday's followup
mail. The guarantee of safety is that we don't handle invalidation
between two consecutive calls to RelationGetIndexAttrBitmap() in
heap_update, but if we do handle due to any reason which is not
apparent to me, then I think there is some chance of hitting the
assertion.

I doubt it's really going to be practical to guarantee that
those bitmapsets are always 100% consistent throughout a transaction.

Agreed. As the code stands, I think it is only expected to be
guaranteed across three consecutive calls to
RelationGetIndexAttrBitmap() in heap_update. Now, if the assertion in
HeapSatisfiesHOTandKeyUpdate() is useless and we can remove it, then
probably we don't need a guarantee to be consistent in three
consecutive calls as well.

Okay, please find attached patch which is an extension of Tom's and
Andres's patch and I think it fixes the above problem noted by me.

I don't like this patch at all. It only fixes the above issue if the
relcache inval arrives while RelationGetIndexAttrBitmap is executing.
If the inval arrives between two such calls, you still have the problem
of the second one delivering a bitmapset that might be inconsistent
with the first one.

I think it won't happen between two consecutive calls to
RelationGetIndexAttrBitmap in heap_update. It can happen during
multi-row update, but that should be fine. I think this version of
patch only defers the creation of fresh bitmapsets where ever
possible.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Keith Fiske

keith@omniti.com

almost 9 years ago

In reply to: Amit Kapila (#51)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Mon, Feb 6, 2017 at 10:17 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Hmm. Consider that the first time relcahe invalidation occurs while
computing id_attrs, so now the retry logic will compute the correct
set of attrs (considering two indexes, if we take the example given by
Alvaro above.). However, the attrs computed for hot_* and key_* will
be using only one index, so this will lead to a situation where two of
the attrs (hot_attrs and key_attrs) are computed with one index and
id_attrs is computed with two indexes. I think this can lead to
Assertion in HeapSatisfiesHOTandKeyUpdate().

That seems like something we'd be best off to fix by changing the
assertion.

The above-mentioned problem doesn't exist in your version of the patch
(which is now committed) because the index attrs are cached after
invalidation and I have mentioned the same in my yesterday's followup
mail. The guarantee of safety is that we don't handle invalidation
between two consecutive calls to RelationGetIndexAttrBitmap() in
heap_update, but if we do handle due to any reason which is not
apparent to me, then I think there is some chance of hitting the
assertion.

I doubt it's really going to be practical to guarantee that
those bitmapsets are always 100% consistent throughout a transaction.

Agreed. As the code stands, I think it is only expected to be
guaranteed across three consecutive calls to
RelationGetIndexAttrBitmap() in heap_update. Now, if the assertion in
HeapSatisfiesHOTandKeyUpdate() is useless and we can remove it, then
probably we don't need a guarantee to be consistent in three
consecutive calls as well.

Okay, please find attached patch which is an extension of Tom's and
Andres's patch and I think it fixes the above problem noted by me.

I don't like this patch at all. It only fixes the above issue if the
relcache inval arrives while RelationGetIndexAttrBitmap is executing.
If the inval arrives between two such calls, you still have the problem
of the second one delivering a bitmapset that might be inconsistent
with the first one.

I think it won't happen between two consecutive calls to
RelationGetIndexAttrBitmap in heap_update. It can happen during
multi-row update, but that should be fine. I think this version of
patch only defers the creation of fresh bitmapsets where ever
possible.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Was just curious if anyone was able to come up with any sort of method to
test whether an index was corrupted by this bug, other than just waiting
for bad query results? We've used concurrent index rebuilding quite
extensively over the years to remove bloat from busy systems, but
reindexing the entire database "just in case" is unrealistic in many of our
cases.

#53

Alvaro Herrera

alvherre@2ndquadrant.com

almost 9 years ago

In reply to: Keith Fiske (#52)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Keith Fiske wrote:

Was just curious if anyone was able to come up with any sort of method to
test whether an index was corrupted by this bug, other than just waiting
for bad query results? We've used concurrent index rebuilding quite
extensively over the years to remove bloat from busy systems, but
reindexing the entire database "just in case" is unrealistic in many of our
cases.

As stated, if the CREATE INDEX operates on columns that are previously
already indexed (which is normally the case when you rebuild because of
bloat) then there is no chance of index corruption.

Scanning indexes+tables is just as load-intensive as rebuilding the
indexes anyway. You don't save any work. I suppose it can be a problem
if you have an index big enough that it doesn't fit on your remaining
free space (but in that case you have a pre-existing problem which you
should solve anyway).

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Keith Fiske

keith@omniti.com

almost 9 years ago

In reply to: Alvaro Herrera (#53)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Fri, Feb 17, 2017 at 11:12 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Keith Fiske wrote:

Was just curious if anyone was able to come up with any sort of method to
test whether an index was corrupted by this bug, other than just waiting
for bad query results? We've used concurrent index rebuilding quite
extensively over the years to remove bloat from busy systems, but
reindexing the entire database "just in case" is unrealistic in many of

our

cases.

As stated, if the CREATE INDEX operates on columns that are previously
already indexed (which is normally the case when you rebuild because of
bloat) then there is no chance of index corruption.

Scanning indexes+tables is just as load-intensive as rebuilding the
indexes anyway. You don't save any work. I suppose it can be a problem
if you have an index big enough that it doesn't fit on your remaining
free space (but in that case you have a pre-existing problem which you
should solve anyway).

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

It's not the load I'm worried about, it's the locks that are required at
some point during the rebuild. Doing an index rebuild here and there isn't
a big deal, but trying to do it for an entire heavily loaded,
multi-terabyte database is hardly trivial. And I'd say doing a scan is far
less invasive than actually rebuilding the index since little to no writing
is actually being done.

I can understandable if it's simply not possible, but if it is, I think in
any cases of data corruption, having some means to check for it to be sure
you're in the clear would be useful.

Keith

#55

Alvaro Herrera

alvherre@2ndquadrant.com

almost 9 years ago

In reply to: Keith Fiske (#54)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Keith Fiske wrote:

I can understandable if it's simply not possible, but if it is, I think in
any cases of data corruption, having some means to check for it to be sure
you're in the clear would be useful.

Maybe it is possible. I just didn't try, since it didn't seem very
useful.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Keith Fiske (#54)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Fri, Feb 17, 2017 at 8:23 AM, Keith Fiske <keith@omniti.com> wrote:

It's not the load I'm worried about, it's the locks that are required at
some point during the rebuild. Doing an index rebuild here and there isn't a
big deal, but trying to do it for an entire heavily loaded, multi-terabyte
database is hardly trivial. And I'd say doing a scan is far less invasive
than actually rebuilding the index since little to no writing is actually
being done.

Certainly, the checks that amcheck performs that don't require a
heavier lock (just a SELECT-style AccessShareLock) were vastly less
expensive than reindexing, and of course had only negligible potential
to block other operations. And, REINDEX certainly is a foot-gun,
lock-wise, which is why I try to avoid it.

The difference with a test that could detect this variety of
corruption is that that would need to visit the heap, which tends to
be much larger than any one index, or even all indexes. That would
probably need to be random I/O, too. It might be possible to mostly
not visit the heap, though -- I'm not sure offhand. I'd have to study
the problem in detail, which I have no time for at the moment.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Peter Geoghegan (#56)

Re: Index corruption with CREATE INDEX CONCURRENTLY

Peter Geoghegan <pg@bowt.ie> writes:

The difference with a test that could detect this variety of
corruption is that that would need to visit the heap, which tends to
be much larger than any one index, or even all indexes. That would
probably need to be random I/O, too. It might be possible to mostly
not visit the heap, though -- I'm not sure offhand. I'd have to study
the problem in detail, which I have no time for at the moment.

Unless I misunderstand this problem, the issue is that there might be some
broken HOT chains, that is chains in which not all the heap tuples have
the same values for the index columns, and in particular the live entry at
the end of the chain doesn't agree with the index. It seems pretty
impossible to detect that by examining the index alone.

However, you might be able to find it without so much random I/O.
I'm envisioning a seqscan over the table, in which you simply look for
HOT chains in which the indexed columns aren't all the same. When you
find one, you'd have to do a pretty expensive index lookup to confirm
whether things are OK or not, but hopefully that'd be rare enough to not
be a big performance sink.

This seems like it'd be quite a different tool than amcheck, though.
Also, it would only find broken-HOT-chain corruption, which might be
a rare enough issue to not deserve a single-purpose tool.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Tom Lane (#57)

Re: Index corruption with CREATE INDEX CONCURRENTLY

I wrote:

However, you might be able to find it without so much random I/O.
I'm envisioning a seqscan over the table, in which you simply look for
HOT chains in which the indexed columns aren't all the same. When you
find one, you'd have to do a pretty expensive index lookup to confirm
whether things are OK or not, but hopefully that'd be rare enough to not
be a big performance sink.

Ah, nah, scratch that. If any post-index-build pruning had occurred on a
page, the evidence would be gone --- the non-matching older tuples would
be removed and what remained would look consistent. But it wouldn't
match the index. You might be able to find problems if you were willing
to do the expensive check on *every* HOT chain, but that seems none too
practical.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Tom Lane (#57)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Fri, Feb 17, 2017 at 9:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This seems like it'd be quite a different tool than amcheck, though.
Also, it would only find broken-HOT-chain corruption, which might be
a rare enough issue to not deserve a single-purpose tool.

FWIW, my ambition for amcheck is that it will have checks for a large
variety of invariants that involve the heap, and related SLRU
structures such as MultiXacts. Though, that would probably necessitate
code written by other people that are subject matter experts in areas
that I am not.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Tom Lane (#58)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

However, you might be able to find it without so much random I/O.
I'm envisioning a seqscan over the table, in which you simply look for
HOT chains in which the indexed columns aren't all the same. When you
find one, you'd have to do a pretty expensive index lookup to confirm
whether things are OK or not, but hopefully that'd be rare enough to not
be a big performance sink.

Ah, nah, scratch that. If any post-index-build pruning had occurred on a
page, the evidence would be gone --- the non-matching older tuples would
be removed and what remained would look consistent. But it wouldn't
match the index. You might be able to find problems if you were willing
to do the expensive check on *every* HOT chain, but that seems none too
practical.

If the goal is just to detect tuples that aren't indexed and should
be, what about scanning each index and building a TIDBitmap of all of
the index entries, and then scanning the heap for non-HOT tuples and
probing the TIDBitmap for each one? If you find a heap entry for
which you didn't find an index entry, go and recheck the index and see
if one got added concurrently. If not, whine.

One problem is that you'd need to make sure that the TIDBitmap didn't
lossify, but that could be prevented either by specifying a very large
maxbytes setting or by using something that serves the same function
instead of a TIDBitmap per se.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Pavan Deolasee

pavan.deolasee@gmail.com

almost 9 years ago

In reply to: Robert Haas (#60)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sun, Feb 19, 2017 at 3:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ah, nah, scratch that. If any post-index-build pruning had occurred on a
page, the evidence would be gone --- the non-matching older tuples would
be removed and what remained would look consistent. But it wouldn't
match the index. You might be able to find problems if you were willing
to do the expensive check on *every* HOT chain, but that seems none too
practical.

If the goal is just to detect tuples that aren't indexed and should
be, what about scanning each index and building a TIDBitmap of all of
the index entries, and then scanning the heap for non-HOT tuples and
probing the TIDBitmap for each one? If you find a heap entry for
which you didn't find an index entry, go and recheck the index and see
if one got added concurrently. If not, whine.

This particular case of corruption results in a heap tuple getting indexed
by a wrong key (or to be precise, indexed by its old value). So the only
way to detect the corruption is to look at each index key and check if it
matches with the corresponding heap tuple. We could write some kind of self
join that can use a sequential scan and an index-only scan (David Rowley
had suggested something of that sort internally here), but we can't
guarantee index-only scan on a table which is being concurrently updated.
So not sure even that will catch every possible case.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#62

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Pavan Deolasee (#61)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On Sun, Feb 19, 2017 at 3:52 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

This particular case of corruption results in a heap tuple getting indexed
by a wrong key (or to be precise, indexed by its old value). So the only way
to detect the corruption is to look at each index key and check if it
matches with the corresponding heap tuple. We could write some kind of self
join that can use a sequential scan and an index-only scan (David Rowley had
suggested something of that sort internally here), but we can't guarantee
index-only scan on a table which is being concurrently updated. So not sure
even that will catch every possible case.

Oh, so the problem isn't index entries that are altogether missing? I
guess I was confused.

You can certainly guarantee an index-only scan if you write the
validation code in C rather than using SQL. I think the issue is that
if the table is large enough that keeping a TID -> index value mapping
in a hash table is impractical, there's not going to be a real
efficient strategy for this. Ignoring the question of whether you use
the main executor for this or just roll your own code, your options
for a large table are (1) a multi-batch hash join, (2) a nested loop,
and (3) a merge join. (2) is easy to implement but will generate a
ton of random I/O if the table is not resident in RAM. (3) is most
suitable for very large tables but takes more work to code, and is
also likely to be a lot slower for small tables than a hash or
nestloop-based approach.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 9 years ago

In reply to: Robert Haas (#62)

Re: Index corruption with CREATE INDEX CONCURRENTLY

On 2/19/17 5:27 AM, Robert Haas wrote:

(1) a multi-batch hash join, (2) a nested loop,
and (3) a merge join. (2) is easy to implement but will generate a
ton of random I/O if the table is not resident in RAM. (3) is most
suitable for very large tables but takes more work to code, and is
also likely to be a lot slower for small tables than a hash or
nestloop-based approach.

As I understand it, #3 is already in place for validate_index(). I think
you'd just need a different callback that checks the heap key.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers