Index corruption with CREATE INDEX CONCURRENTLY
Hello All,
While stress testing WARM, I encountered a data consistency problem. It
happens when index is built concurrently. What I thought to be a WARM
induced bug and it took me significant amount of investigation to finally
conclude that it's a bug in the master. In fact, we tested all the way up
to 9.2 release and the bug exists in all releases.
In my test set up, I keep a UNIQUE index on an "aid" column and then have
many other "aid[1-4]" columns, on which indexes are built and dropped. The
script then updates a randomly selected row using any of the aid[1-4]
columns. Even these columns are updated during the tests to force WARM
updates, but within a narrow range of non-overlapping values. So we can
always find the unique row using some kind of range condition. The
reproduction case is much simpler and I'll explain that below.
In the reproduction scenario (attached), it happens fairly quickly, may be
after 3-10 rounds of DROP/CREATE. You may need to adjust the -j value if it
does not happen for you even after a few rounds. Just for the record, I
haven't checked past reports to correlate if the bug explains any of the
index corruption issues we might have received.
To give a short summary of how CIC works with HOT, it consists of three
phases:
1. Create index entry in the catalog, mark it not ready for inserts, commit
the transaction, start new transaction and then wait for all potential lock
holders on the table to end
2. Take a MVCC snapshot, build the index, mark index ready for inserts,
commit and then once again wait for potential lock holders on the table to
end
3. Do a validation phase where we look at MVCC-visible tuples and add
missing entries to the index. We only look at the root line pointer of each
tuple while deciding whether a particular tuple already exists in the index
or not. IOW we look at HOT chains and not tuples themselves.
The interesting case is phase 1 and 2. The phase 2 works on the premise
that every transaction started after we finished waiting for all existing
transactions will definitely see the new index. All these new transactions
then look at the columns used by the new index and consider them while
deciding HOT-safety of updates. Now for reasons which I don't fully
understand, there exists a path where a transaction starts after CIC waited
and took its snapshot at the start of the second phase, but still fails to
see the new index. Or may be it sees the index, but fails to update its
rd_indexattr list to take into account the columns of the new index. That
leads to wrong decisions and we end up with a broken HOT chain, something
the CIC algorithm is not prepared to handle. This in turn leads the new
index missing entries for the update tuples i.e. the index may have aid1 =
10, but the value in the heap could be aid1 = 11.
Based on my investigation so far and the evidence that I've collected, what
probably happens is that after a relcache invalidation arrives at the new
transaction, it recomputes the rd_indexattr but based on the old, cached
rd_indexlist. So the rd_indexattr does not include the new columns. Later
rd_indexlist is updated, but the rd_indexattr remains what it was.
There is definitely an evidence of this sequence of events (collected after
sprinkling code with elog messages), but it's not yet clear to me why it
affects only a small percentage of transactions. One possibility is that it
probably depends on at what stage an invalidation is received and processed
by the backend. I find it a bit confusing that even though rd_indexattr is
tightly coupled to the rd_indexlist, we don't invalidate or recompute them
together. Shouldn't we do that? For example, in the attached patch we
invalidate rd_indexattr whenever we recompute rd_indexlist (i.e. if we see
that rd_indexvalid is 0). This patch fixes the problem for me (I've only
tried master), but I am not sure if this is a correct fix.
Finally, here is the reproduction scenario. Run "cic_bug.sh" and observe
the output. The script runs a continuous UPDATE on the table, while running
DROP INDEX and CREATE INDEX CONCURRENTLY on aid1 column in a loop. As soon
as we build the index, we run a validation query such that rows are fetched
using IOS on the new index and compare them against the rows obtained via a
SeqScan on the table. If things are OK, the rows obtained via both methods
should match. But every often you'll see a row or two differ. I must say
that IOS itself is not a problem, but I'd to setup things that way to
ensure that index access does not have to go to the heap. Otherwise, it
will just return whatever is there in the heap and things will look fine.
May be some index consistency check utility could be handy to check the
corruption. But I think the query should just do the job for now.
Some additional server logs are attached too. I've filtered them to include
only three sessions. One (PID 73119) which is running CIC, second (PID
73090) which processed relcache invalidation correctly and third (PID
73091) which got it wrong. Specifically, look at line #77 where snapshot is
obtained for the second phase:
2017-01-29 10:44:31.238 UTC [73119] [xid=0]LOG: CIC (pgb_a_aid1) building
first phase with snapshot (4670891:4670893)
At line 108, you'll see a new transaction with XID 4670908 still using the
old attribute list.
2017-01-29 10:44:31.889 UTC [73091] [xid=4670908]LOG: heap_update
(cic_tab_accounts), hot_attrs ((b 9))
Other session, which I included for comparison, sees the new index and
recomputes its rd_indexattr in time and hence that update works as expected.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
Hello All,
While stress testing WARM, I encountered a data consistency problem. It
happens when index is built concurrently. What I thought to be a WARM
induced bug and it took me significant amount of investigation to finally
conclude that it's a bug in the master. In fact, we tested all the way up to
9.2 release and the bug exists in all releases.In my test set up, I keep a UNIQUE index on an "aid" column and then have
many other "aid[1-4]" columns, on which indexes are built and dropped. The
script then updates a randomly selected row using any of the aid[1-4]
columns. Even these columns are updated during the tests to force WARM
updates, but within a narrow range of non-overlapping values. So we can
always find the unique row using some kind of range condition. The
reproduction case is much simpler and I'll explain that below.In the reproduction scenario (attached), it happens fairly quickly, may be
after 3-10 rounds of DROP/CREATE. You may need to adjust the -j value if it
does not happen for you even after a few rounds. Just for the record, I
haven't checked past reports to correlate if the bug explains any of the
index corruption issues we might have received.To give a short summary of how CIC works with HOT, it consists of three
phases:
1. Create index entry in the catalog, mark it not ready for inserts, commit
the transaction, start new transaction and then wait for all potential lock
holders on the table to end
2. Take a MVCC snapshot, build the index, mark index ready for inserts,
commit and then once again wait for potential lock holders on the table to
end
3. Do a validation phase where we look at MVCC-visible tuples and add
missing entries to the index. We only look at the root line pointer of each
tuple while deciding whether a particular tuple already exists in the index
or not. IOW we look at HOT chains and not tuples themselves.The interesting case is phase 1 and 2. The phase 2 works on the premise that
every transaction started after we finished waiting for all existing
transactions will definitely see the new index. All these new transactions
then look at the columns used by the new index and consider them while
deciding HOT-safety of updates. Now for reasons which I don't fully
understand, there exists a path where a transaction starts after CIC waited
and took its snapshot at the start of the second phase, but still fails to
see the new index. Or may be it sees the index, but fails to update its
rd_indexattr list to take into account the columns of the new index. That
leads to wrong decisions and we end up with a broken HOT chain, something
the CIC algorithm is not prepared to handle. This in turn leads the new
index missing entries for the update tuples i.e. the index may have aid1 =
10, but the value in the heap could be aid1 = 11.Based on my investigation so far and the evidence that I've collected, what
probably happens is that after a relcache invalidation arrives at the new
transaction, it recomputes the rd_indexattr but based on the old, cached
rd_indexlist. So the rd_indexattr does not include the new columns. Later
rd_indexlist is updated, but the rd_indexattr remains what it was.There is definitely an evidence of this sequence of events (collected after
sprinkling code with elog messages), but it's not yet clear to me why it
affects only a small percentage of transactions. One possibility is that it
probably depends on at what stage an invalidation is received and processed
by the backend. I find it a bit confusing that even though rd_indexattr is
tightly coupled to the rd_indexlist, we don't invalidate or recompute them
together. Shouldn't we do that?
During invalidation processing, we do destroy them together in
function RelationDestroyRelation(). So ideally, after invalidation
processing, both should be built again, but not sure why that is not
happening in this case.
For example, in the attached patch we
invalidate rd_indexattr whenever we recompute rd_indexlist (i.e. if we see
that rd_indexvalid is 0).
/*
+ * If the index list was invalidated, we better also invalidate the index
+ * attribute list (which should automatically invalidate other attributes
+ * such as primary key and replica identity)
+ */
+ relation->rd_indexattr = NULL;
+
I think setting directly to NULL will leak the memory.
This patch fixes the problem for me (I've only
tried master), but I am not sure if this is a correct fix.
I think it is difficult to say that fix is correct unless there is a
clear explanation of what led to this problem. Another angle to
investigate is to find out when relation->rd_indexvalid is set to 0
for the particular session where you are seeing the problem. I could
see few places where we are setting it to 0 and clearing rd_indexlist,
but not rd_indexattr. I am not sure whether that has anything to do
with the problem you are seeing.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:
Based on my investigation so far and the evidence that I've collected,
what probably happens is that after a relcache invalidation arrives at the
new transaction, it recomputes the rd_indexattr but based on the old,
cached rd_indexlist. So the rd_indexattr does not include the new columns.
Later rd_indexlist is updated, but the rd_indexattr remains what it was.
I was kinda puzzled this report did not get any attention, though it's
clearly a data corruption issue. Anyways, now I know why this is happening
and can reproduce this very easily via debugger and two sessions.
The offending code is RelationGetIndexAttrBitmap(). Here is the sequence of
events:
Taking the same table as in the report:
CREATE TABLE cic_tab_accounts (
aid bigint UNIQUE,
abalance bigint,
aid1 bigint
);
1. Session S1 calls DROP INDEX pgb_a_aid1 and completes
2. Session S2 is in process of rebuilding its rd_indexattr bitmap (because
of previous relcache invalidation caused by DROP INDEX). To be premise,
assume that the session has finished rebuilding its index list. Since DROP
INDEX has just happened,
4793 indexoidlist = RelationGetIndexList(relation);
3. S1 then starts CREATE INDEX CONCURRENTLY pgb_a_aid1 ON
cic_tab_accounts(aid1)
4. S1 creates catalog entry for the new index, commits phase 1 transaction
and builds MVCC snapshot for the second phase and also finishes the initial
index build
5. S2 now proceeds. Now notice that S2 had computed the index list based on
the old view i.e. just one index.
The following comments in relcahce.c are now significant:
4799 /*
4800 * Copy the rd_pkindex and rd_replidindex value computed by
4801 * RelationGetIndexList before proceeding. This is needed because
a
4802 * relcache flush could occur inside index_open below, resetting
the
4803 * fields managed by RelationGetIndexList. (The values we're
computing
4804 * will still be valid, assuming that caller has a sufficient lock
on
4805 * the relation.)
4806 */
That's what precisely goes wrong.
6. When index_open is called, relcache invalidation generated by the first
transaction commit in CIC gets accepted and processed. As part of this
invalidation, rd_indexvalid is set to 0, which invalidates rd_indexlist too
7. But S2 is currently using index list obtained at step 2 above (which has
only old index). It goes ahead and computed rd_indexattr based on just the
old index.
8. While relcahce invalidation processing at step 6 cleared rd_indexattr
(along with rd_indexvalid), it is again set at
4906 relation->rd_indexattr = bms_copy(indexattrs);
But what gets set at step 8 is based on the old view. There is no way
rd_indexattr will be invalidated until S2 receives another relcache
invalidation. This will happen when the CIC proceeds. But until then, every
update done by S2 will generate broken HOT chains, leading to data
corruption.
I can reproduce this entire scenario using gdb sessions. This also explains
why the patch I sent earlier helps to solve the problem.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Pavan Deolasee wrote:
I can reproduce this entire scenario using gdb sessions. This also explains
why the patch I sent earlier helps to solve the problem.
Ouch. Great detective work there.
I think it's quite possible that this bug explains some index errors,
such as primary keys (or unique indexes) that when recreated fail with
duplicate values: the first value inserted would get in via an UPDATE
during the CIC race window, and sometime later the second value is
inserted correctly. Because the first one is missing from the index,
the second one does not give rise to a dupe error upon insertion, but
they are of course both detected when the index is recreated.
I'm going to study the bug a bit more, and put in some patch before the
upcoming minor tag on Monday.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 2, 2017 at 6:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
/* + * If the index list was invalidated, we better also invalidate the index + * attribute list (which should automatically invalidate other attributes + * such as primary key and replica identity) + */+ relation->rd_indexattr = NULL;
+I think setting directly to NULL will leak the memory.
Good catch, thanks. Will fix.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 2, 2017 at 10:14 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
I'm going to study the bug a bit more, and put in some patch before the
upcoming minor tag on Monday.
Looking at the history and some past discussions, it seems Tomas reported
somewhat similar problem and Andres proposed a patch here
/messages/by-id/20140514155204.GE23943@awork2.anarazel.de
which got committed via b23b0f5588d2e2. Not exactly the same issue, but
related to relcache flush happening in index_open().
I wonder if we should just add a recheck logic
in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at
the end, we go back and start again from RelationGetIndexList(). Or is
there any code path that relies on finding the old information?
The following comment at the top of RelationGetIndexAttrBitmap() also seems
like a problem to me. CIC works with lower strength lock and hence it can
change the set of indexes even though caller is holding the
RowExclusiveLock, no? The comment seems to suggest otherwise.
4748 * Caller had better hold at least RowExclusiveLock on the target
relation
4749 * to ensure that it has a stable set of indexes. This also makes it
safe
4750 * (deadlock-free) for us to take locks on the relation's indexes.
I've attached a revised patch based on these thoughts. It looks a bit more
invasive than the earlier one-liner, but looks better to me.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
invalidate_indexattr_v2.patchapplication/octet-stream; name=invalidate_indexattr_v2.patchDownload+36-8
Pavan Deolasee wrote:
Looking at the history and some past discussions, it seems Tomas reported
somewhat similar problem and Andres proposed a patch here
/messages/by-id/20140514155204.GE23943@awork2.anarazel.de
which got committed via b23b0f5588d2e2. Not exactly the same issue, but
related to relcache flush happening in index_open().I wonder if we should just add a recheck logic
in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at
the end, we go back and start again from RelationGetIndexList(). Or is
there any code path that relies on finding the old information?
Good find on that old report. It occured to me to re-run Tomas' test
case with this second patch you proposed (the "ddl" test takes 11
minutes in my laptop), and lo and behold -- your "recheck" code enters
an infinite loop. Not good. I tried some more tricks that came to
mind, but I didn't get any good results. Pavan and I discussed it
further and he came up with the idea of returning the bitmapset we
compute, but if an invalidation occurs in index_open, simply do not
cache the bitmapsets so that next time this is called they are all
computed again. Patch attached.
This appears to have appeased both test_decoding's ddl test under
CACHE_CLOBBER_ALWAYS, as well as the CREATE INDEX CONCURRENTLY bug.
I intend to commit this soon to all branches, to ensure it gets into the
next set of minors.
Here is a detailed reproducer for the CREATE INDEX CONCURRENTLY bug.
Line numbers are as of today's master; comments are supposed to help
locating good spots in other branches. Given the use of gdb
breakpoints, there's no good way to integrate this with regression
tests, which is a pity because this stuff looks a little brittle to me,
and if it breaks it is hard to detect.
Prep:
DROP TABLE IF EXISTS testtab;
CREATE TABLE testtab (a int unique, b int, c int);
INSERT INTO testtab VALUES (1, 100, 10);
CREATE INDEX testindx ON testtab (b);
S1:
SELECT * FROM testtab;
UPDATE testtab SET b = b + 1 WHERE a = 1;
SELECT pg_backend_pid();
attach GDB to this session.
break relcache.c:4800
# right before entering the per-index loop
cont
S2:
SELECT pg_backend_pid();
DROP INDEX testindx;
attach GDB to this session
handle SIGUSR1 nostop
break indexcmds.c:666
# right after index_create
break indexcmds.c:790
# right before the second CommitTransactionCommand
cont
CREATE INDEX CONCURRENTLY testindx ON testtab (b);
-- this blocks in breakpoint 1. Leave it there.
S1:
UPDATE testtab SET b = b + 1 WHERE a = 1;
-- this blocks in breakpoint 1. Leave it there.
S2:
gdb -> cont
-- This blocks waiting for S1
S1:
gdb -> cont
-- S1 finishes the update. S2 awakens and blocks again in breakpoint 2.
S1:
-- Now run more UPDATEs:
UPDATE testtab SET b = b + 1 WHERE a = 1;
This produces a broken HOT chain.
This can be done several times; all these updates should be non-HOT
because of the index created by CIC, but they are marked HOT.
S2: "cont"
From this point onwards, update are correct.
SELECT * FROM testtab;
SET enable_seqscan TO 0;
SET enable_bitmapscan TO 0;
SELECT * FROM testtab WHERE b between 100 and 110;
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
invalidate_indexattr_v3.patchtext/plain; charset=us-asciiDownload+23-15
On Sat, Feb 4, 2017 at 12:12 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Pavan Deolasee wrote:
Looking at the history and some past discussions, it seems Tomas reported
somewhat similar problem and Andres proposed a patch here
/messages/by-id/20140514155204.GE23943@awork2.anarazel.de
which got committed via b23b0f5588d2e2. Not exactly the same issue, but
related to relcache flush happening in index_open().I wonder if we should just add a recheck logic
in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at
the end, we go back and start again from RelationGetIndexList(). Or is
there any code path that relies on finding the old information?Good find on that old report. It occured to me to re-run Tomas' test
case with this second patch you proposed (the "ddl" test takes 11
minutes in my laptop), and lo and behold -- your "recheck" code enters
an infinite loop. Not good. I tried some more tricks that came to
mind, but I didn't get any good results. Pavan and I discussed it
further and he came up with the idea of returning the bitmapset we
compute, but if an invalidation occurs in index_open, simply do not
cache the bitmapsets so that next time this is called they are all
computed again. Patch attached.This appears to have appeased both test_decoding's ddl test under
CACHE_CLOBBER_ALWAYS, as well as the CREATE INDEX CONCURRENTLY bug.I intend to commit this soon to all branches, to ensure it gets into the
next set of minors.
Thanks for detailed explanation, using steps mentioned in your mail
and Pavan's previous mail steps, I could reproduce the problem.
Regarding your fix:
+ * Now save copies of the bitmaps in the relcache entry, but only if no
+ * invalidation occured in the meantime. We intentionally set rd_indexattr
+ * last, because that's the one that signals validity of the values; if we
+ * run out of memory before making that copy, we won't leave the relcache
+ * entry looking like the other ones are valid but empty.
*/
- oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
- relation->rd_keyattr = bms_copy(uindexattrs);
- relation->rd_pkattr = bms_copy(pkindexattrs);
- relation->rd_idattr = bms_copy(idindexattrs);
- relation->rd_indexattr = bms_copy(indexattrs);
- MemoryContextSwitchTo(oldcxt);
+ if (relation->rd_indexvalid != 0)
+ {
+ MemoryContext oldcxt;
+
+ oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+ relation->rd_keyattr = bms_copy(uindexattrs);
+ relation->rd_pkattr = bms_copy(pkindexattrs);
+ relation->rd_idattr = bms_copy(idindexattrs);
+ relation->rd_indexattr = bms_copy(indexattrs);
+ MemoryContextSwitchTo(oldcxt);
+ }
If we do above, then I think primary key attrs won't be returned
because for those we are using relation copy rather than an original
working copy of attrs. See code below:
switch (attrKind)
{
..
case INDEX_ATTR_BITMAP_PRIMARY_KEY:
return bms_copy(relation->rd_pkattr);
Apart from above, I think after the proposed patch, it will be quite
possible that in heap_update(), hot_attrs, key_attrs and id_attrs are
computed using different index lists (consider relcache flush happens
after computation of hot_attrs or key_attrs) which was previously not
possible. I think in the later code we assume that all the three
should be computed using the same indexlist. For ex. see the below
code:
HeapSatisfiesHOTandKeyUpdate()
{
..
Assert(bms_is_subset(id_attrs, key_attrs));
Assert(bms_is_subset(key_attrs, hot_attrs));
..
}
Also below comments in heap_update indicate that we shouldn't worry
about relcache flush between the computation of hot_attrs, key_attrs
and id_attrs.
heap_update()
{
..
* Note that we get a copy here, so we need not worry about relcache flush
* happening midway through.
*/
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation,
INDEX_ATTR_BITMAP_IDENTITY_KEY);
..
}
Typo.
/occured/occurred
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Feb 4, 2017 at 12:10 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
If we do above, then I think primary key attrs won't be returned
because for those we are using relation copy rather than an original
working copy of attrs. See code below:switch (attrKind)
{
..
case INDEX_ATTR_BITMAP_PRIMARY_KEY:
return bms_copy(relation->rd_pkattr);
I don't think that would a problem since if relation_rd_indexattr is NULL,
primary key attrs will be recomputed and returned.
Apart from above, I think after the proposed patch, it will be quite
possible that in heap_update(), hot_attrs, key_attrs and id_attrs are
computed using different index lists (consider relcache flush happens
after computation of hot_attrs or key_attrs) which was previously not
possible. I think in the later code we assume that all the three
should be computed using the same indexlist. For ex. see the below
code:
This seems like a real problem to me. I don't know what the consequences
are, but definitely having various attribute lists to have different view
of the set of indexes doesn't seem right.
The problem we are trying to fix is very clear by now. Relcache flush
clears rd_indexvalid and rd_indexattr together, but because of the race,
rd_indexattr is recomputed with the old information and gets cached again,
while rd_indexvalid (and rd_indexlist) remains unset. That leads to the
index corruption in CIC and may cause other issues too that we're not aware
right now. We want cached attribute lists to become invalid whenever index
list is invalidated and that's not happening right now.
So we have tried three approaches so far.
1. Simply reset rd_indexattr whenever we detect rd_indexvalid has been
reset. This was the first patch. It's a very simple patch and has worked
for me with CACHE_CLOBBER_ALWAYS and even fixes the CIC bug. The only
downside it seems that we're invalidating cached attribute lists outside
the normal relcache invalidation path. It also works on the premise that
RelationGetIndexList() will be called every time
before RelationGetIndexAttrBitmap() is called, otherwise we could still end
up using stale cached copies. I wonder if cached plans can be a problem
here.
2. In the second patch, we tried to recompute attribute lists if a relcache
flush happens in between and index list is invalidated. We've seen problems
with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
flushes keep happening.
3. We tried the third approach where we don't cache attribute lists if know
that index set has changed and hope that the next caller gets the correct
info. But Amit rightly identified problems with that approach too.
So we either need to find a 4th approach or stay with the first patch
unless we see a problem there too (cached plans, anything else). Or may be
fix CREATE INDEX CONCURRENTLY so that at least the first phase which add
the index entry to the catalog happens with a strong lock. This will lead
to momentary blocking of write queries, but protect us from the race. I'm
assuming this is the only code that can cause the problem, and I haven't
checked other potential hazards.
Ideas/suggestions?
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
I intend to commit this soon to all branches, to ensure it gets into the
next set of minors.
Based on Pavan's comments, I think trying to force this into next week's
releases would be extremely unwise. If the bug went undetected this long,
it can wait for a fix for another three months. That seems better than
risking new breakage when it's barely 48 hours to the release wrap
deadline. We do not have time to recover from any mistakes.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Based on Pavan's comments, I think trying to force this into next week's
releases would be extremely unwise. If the bug went undetected this long,
it can wait for a fix for another three months.
Yes, I think bug existed ever since and went unnoticed. One reason could be
that the race happens only when the new index turns HOT updates into
non-HOT updates. Another reason could be that we don't have checks in place
to catch these kinds of corruption. Having said that, since we have
discovered the bug, at least many 2ndQuadrant customers have expressed
worry and want to know if the fix will be available in 9.6.2 and other
minor releases. Since the bug can lead to data corruption, the worry is
justified. Until we fix the bug, there will be a constant worry about using
CIC.
If we can have some kind of band-aid fix to plug in the hole, that might be
enough as well. I tested my first patch (which will need some polishing)
and that works well AFAIK. I was worried about prepared queries and all,
but that seems ok too. RelationGetIndexList() always get called within
ExecInitModifyTable. The fix seems quite unlikely to cause any other side
effects.
Another possible band-aid is to throw another relcache invalidation in CIC.
Like adding a dummy index_set_state_flags() within yet another transaction.
Seems ugly, but should fix the problem for now and not cause any impact on
relcache mechanism whatsoever.
That seems better than
risking new breakage when it's barely 48 hours to the release wrap
deadline. We do not have time to recover from any mistakes.
I'm not sure what the project policies are, but can we consider delaying
the release by a week for issues like these? Or do you think it will be
hard to come up with a proper fix for the issue and it will need some
serious refactoring?
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
2017-02-05 7:54 GMT+01:00 Pavan Deolasee <pavan.deolasee@gmail.com>:
On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Based on Pavan's comments, I think trying to force this into next week's
releases would be extremely unwise. If the bug went undetected this long,
it can wait for a fix for another three months.Yes, I think bug existed ever since and went unnoticed. One reason could
be that the race happens only when the new index turns HOT updates into
non-HOT updates. Another reason could be that we don't have checks in place
to catch these kinds of corruption. Having said that, since we have
discovered the bug, at least many 2ndQuadrant customers have expressed
worry and want to know if the fix will be available in 9.6.2 and other
minor releases. Since the bug can lead to data corruption, the worry is
justified. Until we fix the bug, there will be a constant worry about using
CIC.If we can have some kind of band-aid fix to plug in the hole, that might
be enough as well. I tested my first patch (which will need some polishing)
and that works well AFAIK. I was worried about prepared queries and all,
but that seems ok too. RelationGetIndexList() always get called within
ExecInitModifyTable. The fix seems quite unlikely to cause any other side
effects.Another possible band-aid is to throw another relcache invalidation in
CIC. Like adding a dummy index_set_state_flags() within yet another
transaction. Seems ugly, but should fix the problem for now and not cause
any impact on relcache mechanism whatsoever.That seems better than
risking new breakage when it's barely 48 hours to the release wrap
deadline. We do not have time to recover from any mistakes.I'm not sure what the project policies are, but can we consider delaying
the release by a week for issues like these? Or do you think it will be
hard to come up with a proper fix for the issue and it will need some
serious refactoring?
I agree with Pavan - a release with known important bug is not good idea.
Pavel
Show quoted text
Thanks,
Pavan--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I agree with Pavan - a release with known important bug is not good idea.
This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I agree with Pavan - a release with known important bug is not good idea.
This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.
I think the way to think about this sort of thing is, if we had found
this bug when a release wasn't imminent, would we consider it bad enough
to justify an unscheduled release cycle? I have to think the answer for
this one is "probably not".
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2017-02-05 18:51 GMT+01:00 Tom Lane <tgl@sss.pgh.pa.us>:
Michael Paquier <michael.paquier@gmail.com> writes:
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:
I agree with Pavan - a release with known important bug is not good
idea.
This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.I think the way to think about this sort of thing is, if we had found
this bug when a release wasn't imminent, would we consider it bad enough
to justify an unscheduled release cycle? I have to think the answer for
this one is "probably not".
There are a questions
1. is it critical bug?
2. if is it, will be reason for special minor release?
3. how much time is necessary for fixing of this bug?
These questions can be unimportant if there is some security fixes in next
release, what I cannot to know.
Regards
Pavel
Show quoted text
regards, tom lane
El 05/02/17 a las 10:03, Michael Paquier escribió:
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I agree with Pavan - a release with known important bug is not good idea.
This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.
The fact that the bug has been around for a long time doesn't mean we
shouldn't take it seriously.
IMO any kind of corruption (heap or index) should be prioritized.
There's also been comments about maybe this being the cause of old
reports about index corruption.
I ask myself if it's a good idea to make a point release with a know
corruption bug in it.
Regards,
--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
[ Having now read the whole thread, I'm prepared to weigh in ... ]
Pavan Deolasee <pavan.deolasee@gmail.com> writes:
This seems like a real problem to me. I don't know what the consequences
are, but definitely having various attribute lists to have different view
of the set of indexes doesn't seem right.
For sure.
2. In the second patch, we tried to recompute attribute lists if a relcache
flush happens in between and index list is invalidated. We've seen problems
with that, especially it getting into an infinite loop with
CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
flushes keep happening.
Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
happen whenever it possibly could. The way to deal with that without
looping is to test whether the index set *actually* changed, not whether
it just might have changed.
I do not like any of the other patches proposed in this thread, because
they fail to guarantee delivering an up-to-date attribute bitmap to the
caller. I think we need a retry loop, and I think that it needs to look
like the attached.
(Note that the tests whether rd_pkindex and rd_replidindex changed might
be redundant; but I'm not totally sure of that, and they're cheap.)
regards, tom lane
Attachments:
invalidate_indexattr_v4.patchtext/x-diff; charset=us-ascii; name=invalidate_indexattr_v4.patchDownload+49-47
On Sun, Feb 5, 2017 at 1:34 PM, Martín Marqués <martin@2ndquadrant.com> wrote:
El 05/02/17 a las 10:03, Michael Paquier escribió:
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I agree with Pavan - a release with known important bug is not good idea.
This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.The fact that the bug has been around for a long time doesn't mean we
shouldn't take it seriously.IMO any kind of corruption (heap or index) should be prioritized.
There's also been comments about maybe this being the cause of old
reports about index corruption.I ask myself if it's a good idea to make a point release with a know
corruption bug in it.
I don't think this kind of black-and-white thinking is very helpful.
Obviously, data corruption is bad. However, this bug has (from what
one can tell from this thread) been with us for over a decade; it must
necessarily be either low-probability or low-severity, or somebody
would've found it and fixed it before now. Indeed, the discovery of
this bug was driven by new feature development, not a user report. It
seems pretty clear that if we try to patch this and get it wrong, the
effects of our mistake could easily be a lot more serious than the
original bug.
Now, I do not object to patching this tomorrow before the wrap if the
various senior hackers who have studied this (Tom, Alvaro, Pavan,
Amit) are all in agreement on the way forward. But if there is any
significant chance that we don't fully understand this issue and that
our fix might cause bigger problems, it would be more prudent to wait.
I'm all in favor of releasing important bug fixes quickly, but
releasing a bug fix before we're sure we've fixed the bug correctly is
taking a good principle to an irrational extreme.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't think this kind of black-and-white thinking is very helpful.
Obviously, data corruption is bad. However, this bug has (from what
one can tell from this thread) been with us for over a decade; it must
necessarily be either low-probability or low-severity, or somebody
would've found it and fixed it before now. Indeed, the discovery of
this bug was driven by new feature development, not a user report. It
seems pretty clear that if we try to patch this and get it wrong, the
effects of our mistake could easily be a lot more serious than the
original bug.
+1. The fact that it wasn't driven by a user report convinces me that
this is the way to go.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-02-05 12:51:09 -0500, Tom Lane wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I agree with Pavan - a release with known important bug is not good idea.
This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.I think the way to think about this sort of thing is, if we had found
this bug when a release wasn't imminent, would we consider it bad enough
to justify an unscheduled release cycle? I have to think the answer for
this one is "probably not".
+1. I don't think we serve our users by putting out a nontrivial bugfix
hastily. Nor do I think we serve them in this instance by delaying the
release to cover this fix; there's enough other fixes in the release
imo.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers