brin index vacuum versus transaction snapshots

Started by Kevin Grittnerover 10 years ago20 messages

kgrittn@ymail.com

over 10 years ago

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

*** /home/kgrittn/pg/master/src/test/regress/expected/brin.out 2015-07-21 10:21:58.798619111 -0500
--- /home/kgrittn/pg/master/src/test/regress/results/brin.out 2015-07-21 14:00:25.169320059 -0500
***************
*** 405,409 ****
--- 405,410 ----
        box(point(odd, even), point(thousand, twothousand))
  FROM tenk1 ORDER BY unique2 LIMIT 5 OFFSET 5;
  VACUUM brintest; -- force a summarization cycle in brinidx
+ ERROR: brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot
  UPDATE brintest SET int8col = int8col * int4col;
  UPDATE brintest SET textcol = '' WHERE textcol IS NOT NULL;

The stacktrace is:

#0 brinsummarize (index=0x7f1fe0eca6b8, heapRel=0x7f1fe0ed64f8, numSummarized=0x30e1da8, numExisting=0x30e1da8) at brin.c:1080
#1 0x00000000004683d3 in brinvacuumcleanup (fcinfo=0x7ffd41b95c98) at brin.c:731
#2 0x0000000000a69b28 in FunctionCall2Coll (flinfo=0x7ffd41b96078, collation=0, arg1=140725706121624, arg2=0) at fmgr.c:1323
#3 0x00000000004f7b60 in index_vacuum_cleanup (info=0x7ffd41b96198, stats=0x0) at indexam.c:717
#4 0x00000000006e1004 in lazy_cleanup_index (indrel=0x7f1fe0eca6b8, stats=0x0, vacrelstats=0x30e0bb0) at vacuumlazy.c:1397
#5 0x00000000006df637 in lazy_scan_heap (onerel=0x7f1fe0ed64f8, vacrelstats=0x30e0bb0, Irel=0x30e0ca8, nindexes=1, scan_all=0 '\000') at vacuumlazy.c:1108
#6 0x00000000006dda3b in lazy_vacuum_rel (onerel=0x7f1fe0ed64f8, options=1, params=0x7ffd41b96798, bstrategy=0x30d9a38) at vacuumlazy.c:244
#7 0x00000000006dc18d in vacuum_rel (relid=30220, relation=0x2f8d1a8, options=1, params=0x7ffd41b96798) at vacuum.c:1372
#8 0x00000000006db711 in vacuum (options=1, relation=0x2f8d1a8, relid=0, params=0x7ffd41b96798, va_cols=0x0, bstrategy=0x30d9a38, isTopLevel=1 '\001') at vacuum.c:293
#9 0x00000000006db31d in ExecVacuum (vacstmt=0x2f8d200, isTopLevel=1 '\001') at vacuum.c:121
#10 0x00000000008bef36 in standard_ProcessUtility (parsetree=0x2f8d200, queryString=0x2f8c788 "VACUUM brintest;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x2f8d588, completionTag=0x7ffd41b96d50 "") at utility.c:654
#11 0x00000000008be69e in ProcessUtility (parsetree=0x2f8d200, queryString=0x2f8c788 "VACUUM brintest;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x2f8d588, completionTag=0x7ffd41b96d50 "") at utility.c:335
#12 0x00000000008be1a7 in PortalRunUtility (portal=0x2f36b18, utilityStmt=0x2f8d200, isTopLevel=1 '\001', dest=0x2f8d588, completionTag=0x7ffd41b96d50 "") at pquery.c:1187
#13 0x00000000008bd1bd in PortalRunMulti (portal=0x2f36b18, isTopLevel=1 '\001', dest=0x2f8d588, altdest=0x2f8d588, completionTag=0x7ffd41b96d50 "") at pquery.c:1318
#14 0x00000000008bc80d in PortalRun (portal=0x2f36b18, count=9223372036854775807, isTopLevel=1 '\001', dest=0x2f8d588, altdest=0x2f8d588, completionTag=0x7ffd41b96d50 "") at pquery.c:816
#15 0x00000000008b7edf in exec_simple_query (query_string=0x2f8c788 "VACUUM brintest;") at postgres.c:1104
#16 0x00000000008b720c in PostgresMain (argc=1, argv=0x2f1d450, dbname=0x2f1d2b0 "regression", username=0x2f1d290 "kgrittn") at postgres.c:4025
#17 0x000000000081ab99 in BackendRun (port=0x2f3d610) at postmaster.c:4183
#18 0x000000000081a17a in BackendStartup (port=0x2f3d610) at postmaster.c:3859
#19 0x0000000000816753 in ServerLoop () at postmaster.c:1618
#20 0x0000000000813d4a in PostmasterMain (argc=3, argv=0x2f1c460) at postmaster.c:1263
#21 0x000000000074ec36 in main (argc=3, argv=0x2f1c460) at main.c:223

Note that the function mentioned in the error message is not
anywhere in the stack trace, and that there was not any explicit
transaction started -- just a VACUUM command for the table, without
any BEGIN/COMMIT.

This is using source at commit 9faa6ae14f6098e4b55f0131f7ec2694a381fb87.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kevin Grittner

kgrittn@ymail.com

over 10 years ago

In reply to: Kevin Grittner (#1)

Re: brin index vacuum versus transaction snapshots

Kevin Grittner <kgrittn@ymail.com> wrote:

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

+ ERROR: brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot

This is using source at commit 9faa6ae14f6098e4b55f0131f7ec2694a381fb87.

Ah, the very next commit fixed this while I was running the tests.

Nothing to see here, folks; move along....

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Kevin Grittner (#2)

Re: brin index vacuum versus transaction snapshots

On Tue, Jul 21, 2015 at 3:10 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Kevin Grittner <kgrittn@ymail.com> wrote:

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

+ ERROR: brin_summarize_new_values() cannot run in a transaction that

has already obtained a snapshot

This is using source at commit 9faa6ae14f6098e4b55f0131f7ec2694a381fb87.

Ah, the very next commit fixed this while I was running the tests.

Nothing to see here, folks; move along....

Still looks broken from here.

Kevin Grittner

kgrittn@ymail.com

over 10 years ago

In reply to: Jeff Janes (#3)

Re: brin index vacuum versus transaction snapshots

Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Jul 21, 2015 at 3:10 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Kevin Grittner <kgrittn@ymail.com> wrote:

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

+ ERROR: brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot

This is using source at commit 9faa6ae14f6098e4b55f0131f7ec2694a381fb87.

Ah, the very next commit fixed this while I was running the tests.

Still looks broken from here.

You are right; I rushed my test to see whether Tom's patch had any
impact, and neglected to set default_transaction_isolation after
intalling the new build. :-( The problem still reliably occurs,
and the error message still references a function name that was not
part of the problem.

I have added this to the "PostgreSQL 9.5 Open Items" Wiki page.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Kevin Grittner (#1)

Re: brin index vacuum versus transaction snapshots

Kevin Grittner wrote:

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

So there's some code that's specifically intended to handle this case:

/*
* If using transaction-snapshot mode, it would be possible
* for another transaction to insert a tuple that's not
* visible to our snapshot if we have already acquired one,
* when in snapshot-isolation mode; therefore, disallow this
* from running in such a transaction unless a snapshot hasn't
* been acquired yet.
*
* This code is called by VACUUM and
* brin_summarize_new_values. Have the error message mention
* the latter because VACUUM cannot run in a transaction and
* thus cannot cause this issue.
*/
if (IsolationUsesXactSnapshot() && FirstSnapshotSet)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot")));

However, this comment is full of it because a snapshot is obtained even
for VACUUM. So the fact that brin_summarize_new_values() is mentioned
as being in trouble is just a very minor issue: fixing that would just
be a matter of passing a flag down from the caller into
summarize_range() to indicate whether it's vacuum or the function. The
real problem is that VACUUM should be allowed to run without error.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Kevin Grittner (#1)

Re: brin index vacuum versus transaction snapshots

Kevin Grittner wrote:

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

I am tempted to say that we should just disallow to run vacuum on a
table containing a brin index in a transaction-snapshot transaction.
It is possible to silence the problem by checking for vacuum flags, but
(without testing) I think there will be a problem because the snapshot
is acquired too early and it is possible for concurrent transactions to
insert tuples in the table that the summarizing scan will not see, which
will cause the index to become effectively corrupt.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Alvaro Herrera (#6)

Re: brin index vacuum versus transaction snapshots

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Kevin Grittner wrote:

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

I am tempted to say that we should just disallow to run vacuum on a
table containing a brin index in a transaction-snapshot transaction.

Huh? We don't allow vacuum inside a (user started) transaction now.
What new restriction are you proposing?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

over 10 years ago

In reply to: Tom Lane (#7)

Re: brin index vacuum versus transaction snapshots

On 30 July 2015 at 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Kevin Grittner wrote:

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

I am tempted to say that we should just disallow to run vacuum on a
table containing a brin index in a transaction-snapshot transaction.

Huh? We don't allow vacuum inside a (user started) transaction now.
What new restriction are you proposing?

Forgive me, but I believe there is a confusion here. Nobody is changing
VACUUM.

The code comment mentioned as "full of it" by Kevin appears to be accurate
and appropriate.

The code is called by VACUUM and brin_summarize_new_values(). It can't be
VACUUM, as you say and as the comment already says, so it must be the
function brin_summarize_new_values().

The test assumes that the default_transaction_isolation is read committed
and so the test fails at other levels. If anything, it is the test that is
at fault.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Kevin Grittner

kgrittn@ymail.com

over 10 years ago

In reply to: Alvaro Herrera (#6)

Re: brin index vacuum versus transaction snapshots

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Kevin Grittner wrote:

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

I am tempted to say that we should just disallow to run vacuum on
a table containing a brin index in a transaction-snapshot
transaction.

You do realize that the VACUUM command was not run inside an
explicit transaction, right? The only change was that in the
postgresql.conf file I set default_transaction_isolation =
'repeatable read'. That was my first step before confirming that
nothing was broken for 'serializable'; but if it doesn't pass in
'repeatable read' it's not going to pass in serializable. Anyway,
for those using serializable transactions to handle race conditions
so that they don't need to use explicit LOCK TABLE statements or
SELECT FOR UPDATE clauses, setting that as the default in
postgresql.conf is standard procedure. Are you suggesting that
those users should need to run VACUUM commands bracketed with
adjustments to that GUC, like this?:

set default_transaction_isolation = 'read committed';
vacuum [...];
reset default_transaction_isolation;

What about autovacuum?

It is possible to silence the problem by checking for vacuum
flags, but (without testing) I think there will be a problem
because the snapshot is acquired too early and it is possible for
concurrent transactions to insert tuples in the table that the
summarizing scan will not see, which will cause the index to
become effectively corrupt.

If that's true, it sounds like vacuum of BRIN indexes might be
broken in a pretty fundamental way. Shouldn't it be using a
snapshot that lets it see everything past the global xmin
threshold, and let the *users* of the index ignore those tuples
that are not visible to *them*? From what you're saying here it
sounds like the BRIN index is failing to include in its bounds any
tuples not visible to some arbitrary snapshot?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Kevin Grittner

kgrittn@ymail.com

over 10 years ago

In reply to: Simon Riggs (#8)

Re: brin index vacuum versus transaction snapshots

Simon Riggs <simon@2ndQuadrant.com> wrote:

On 30 July 2015 at 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Kevin Grittner wrote:

If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

I am tempted to say that we should just disallow to run vacuum
on a table containing a brin index in a transaction-snapshot
transaction.

Huh? We don't allow vacuum inside a (user started) transaction
now. What new restriction are you proposing?

Forgive me, but I believe there is a confusion here. Nobody is
changing VACUUM.

The code comment mentioned as "full of it" by Kevin appears to be
accurate and appropriate.

The code is called by VACUUM and brin_summarize_new_values(). It
can't be VACUUM,

Reread the thread. The error is being hit on a VACUUM command
(which is not, of course in any explicit transaction).

as you say and as the comment already says, so it must be the
function brin_summarize_new_values().

It is, in fact, not brin_summarize_new_values() -- as shown by the
stack trace when the error is thrown. This is why it is such a bad
idea to have any function assume that it knows everywhere it is
being called from, and under what conditions each caller will call
it. (That's distinct from a function having a documented API about
invariants which must be true when calling the function.) A simple
thinko like forgetting that even a VACUUM command starts a
transaction if it is not already in an explicit transaction can
cause confusing problems when the function makes a bad guess at who
"must have" called it. Let the function name itself it if is going
to name something (in a user-facing error message) that is only
visible to those with access to the source code. Better yet, write
an error message that will mean something to those users who don't
have the source code handy.

The test assumes that the default_transaction_isolation is read
committed and so the test fails at other levels. If anything, it
is the test that is at fault.

Yes, the test being wrong is a bigger problem than an arcane and
misleading error message when it causes a failure.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Simon Riggs

simon@2ndQuadrant.com

over 10 years ago

In reply to: Kevin Grittner (#10)

Re: brin index vacuum versus transaction snapshots

On 31 July 2015 at 15:48, Kevin Grittner <kgrittn@ymail.com> wrote:

Reread the thread.

I stand corrected.

The error is being hit on a VACUUM command
(which is not, of course in any explicit transaction).

Seems like VACUUM only ever needs a read-committed snapshot. Whether it
takes a read-committed or repeatable read makes no difference.

Does brin_summarize_new_values() need to track what it reads for
serializable transactions?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Kevin Grittner (#9)

1 attachment(s)

Re: brin index vacuum versus transaction snapshots

Let me explain a bit more what's the mechanism at play.

When scanning a block range to summarize it, we use an MVCC snapshot.
Any tuples that are inserted by transactions that appear as in-progress
to that snapshot will not be seen by that range scan; therefore we need
another mechanism to include their values in the summary tuple.

This mechanism is a "placeholder tuple." We insert this tuple into the
index; any process that tries to insert into the range will update the
placeholder tuple. All insertions that are not visible according to the
MVCC scan must update the placeholder tuple.

So there are two events that are time-critical regarding each other: the
insertion of the placeholder tuple, and the acquisition of the snapshot
for the MVCC scan. All transactions in progress for the snapshot *must*
see the placeholder tuple; therefore, we make sure we insert the
placeholder tuple first, and acquire the snapshot later.

This is all regardless of a transaction being a user-declared
transaction block, or an implicit transaction opened by some command;
regardless of vacuum being user-invoked or autovacuum. The description
of the above is critical for correctness in all those cases.

I assumed (wrongly) that vacuum would not acquire a snapshot. (For one
thing, we already rely for correctness on lazy vacuum not holding xmin
back, so this is not entirely without sense.) The problem is that
PortalRunUtility does it inconditionally. This is not a problem in
read-committed mode, because we simply freshen the snapshot each time
and all is well. In transaction-snapshot mode, taking another snapshot
is a no-op and thus the whole thing is broken. The code is right to
raise an error. (I don't really give a damn whether the error message
is 100% correct or not. Fixing that part is trivial.)

Even assuming that PortalRunUtility would not grab a snapshot at start,
things would still be broken if vacuum processed more than one range in
transaction-snapshot isolation mode: insert placeholder tuple, grab
snapshot for the first range, scan range: all is well up to this point.
Then we need to process the second range: insert placeholder tuple, grab
snapshot ... which reuses the previous snapshot, older than the
placeholder tuple. Oops.

I think the real solution to this problem is to avoid use of
GetTransactionSnapshot(), and instead use GetLatestSnapshot(). As far
as I can see, that should completely close the hole. This requires
patching IndexBuildHeapRangeScan() to allow for that.

Untested patch attached.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

brin-snapshot.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 360b26e..9883c36 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -951,9 +951,15 @@ summarize_range(IndexInfo *indexInfo, BrinBuildState *state, Relation heapRel,
 	 * Execute the partial heap scan covering the heap blocks in the specified
 	 * page range, summarizing the heap tuples in it.  This scan stops just
 	 * short of brinbuildCallback creating the new index entry.
+	 *
+	 * Note that it is critical we use the "latest snapshot" mode of
+	 * IndexBuildHeapRangeScan here: otherwise, in transaction-snapshot
+	 * isolation mode the snapshot used would be earlier than the insertion of
+	 * the placeholder tuple, leading to values inserted concurrently not
+	 * being considered in the summary info of the block range.
 	 */
 	state->bs_currRangeStart = heapBlk;
-	IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false,
+	IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false, true,
 							heapBlk, state->bs_pagesPerRange,
 							brinbuildCallback, (void *) state);
 
@@ -1066,28 +1072,10 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 				 * additional precautions to avoid losing the additional
 				 * tuples; see comments in summarize_range.  Set the
 				 * concurrent flag, which causes IndexBuildHeapRangeScan to
-				 * use a snapshot other than SnapshotAny, and silences
+				 * use LatestSnapshot rather than SnapshotAny, and silences
 				 * warnings emitted there.
 				 */
 				indexInfo->ii_Concurrent = true;
-
-				/*
-				 * If using transaction-snapshot mode, it would be possible
-				 * for another transaction to insert a tuple that's not
-				 * visible to our snapshot if we have already acquired one,
-				 * when in snapshot-isolation mode; therefore, disallow this
-				 * from running in such a transaction unless a snapshot hasn't
-				 * been acquired yet.
-				 *
-				 * This code is called by VACUUM and
-				 * brin_summarize_new_values. Have the error message mention
-				 * the latter because VACUUM cannot run in a transaction and
-				 * thus cannot cause this issue.
-				 */
-				if (IsolationUsesXactSnapshot() && FirstSnapshotSet)
-					ereport(ERROR,
-							(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-							 errmsg("brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot")));
 			}
 			summarize_range(indexInfo, state, heapRel, heapBlk);
 
@@ -1111,7 +1099,10 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 	/* free resources */
 	brinRevmapTerminate(revmap);
 	if (state)
+	{
 		terminate_brin_buildstate(state);
+		pfree(indexInfo);
+	}
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 69f35c9..2390e36 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2161,6 +2161,7 @@ IndexBuildHeapScan(Relation heapRelation,
 {
 	return IndexBuildHeapRangeScan(heapRelation, indexRelation,
 								   indexInfo, allow_sync,
+								   false,
 								   0, InvalidBlockNumber,
 								   callback, callback_state);
 }
@@ -2170,12 +2171,17 @@ IndexBuildHeapScan(Relation heapRelation,
  * number of blocks are scanned.  Scan to end-of-rel can be signalled by
  * passing InvalidBlockNumber as numblocks.  Note that restricting the range
  * to scan cannot be done when requesting syncscan.
+ *
+ * If use_latest_snapshot, the snapshot used is not the transaction snapshot,
+ * but obtained using GetLatestSnapshot().  (Note this only affects scans
+ * marked ii_Concurrent).
  */
 double
 IndexBuildHeapRangeScan(Relation heapRelation,
 						Relation indexRelation,
 						IndexInfo *indexInfo,
 						bool allow_sync,
+						bool use_latest_snapshot,
 						BlockNumber start_blockno,
 						BlockNumber numblocks,
 						IndexBuildCallback callback,
@@ -2234,7 +2240,9 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	 */
 	if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		snapshot = use_latest_snapshot ?
+			GetLatestSnapshot() : GetTransactionSnapshot();
+		snapshot = RegisterSnapshot(snapshot);
 		OldestXmin = InvalidTransactionId;		/* not used */
 	}
 	else
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 8b3b28d..4ad5ac3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -105,6 +105,7 @@ extern double IndexBuildHeapRangeScan(Relation heapRelation,
 						Relation indexRelation,
 						IndexInfo *indexInfo,
 						bool allow_sync,
+						bool use_latest_snapshot,
 						BlockNumber start_blockno,
 						BlockNumber end_blockno,
 						IndexBuildCallback callback,

#13

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Alvaro Herrera (#12)

Re: brin index vacuum versus transaction snapshots

I think the real solution to this problem is to avoid use of
GetTransactionSnapshot(), and instead use GetLatestSnapshot(). As far
as I can see, that should completely close the hole. This requires
patching IndexBuildHeapRangeScan() to allow for that.

Actually I think there's another problem: if a transaction starts and
inserts a tuple into the page range, then goes to sleep, and then
another session does the summarization of the page range, session 1 is
seen as "in progress" by session 2 (so the scan does not see the new
tuple), but the placeholder tuple was not modified either because it was
inserted later than the snapshot. So the update is lost.

I think the only way to close this hole is to have summarize_range()
sleep until all open snapshots are gone after inserting the placeholder
tuple and before acquiring the snapshot, similarly to how CREATE INDEX
CONCURRENTLY does it.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Kevin Grittner

kgrittn@ymail.com

over 10 years ago

In reply to: Alvaro Herrera (#13)

Re: brin index vacuum versus transaction snapshots

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I think the real solution to this problem is to avoid use of
GetTransactionSnapshot(), and instead use GetLatestSnapshot(). As far
as I can see, that should completely close the hole. This requires
patching IndexBuildHeapRangeScan() to allow for that.

Actually I think there's another problem: if a transaction starts and
inserts a tuple into the page range, then goes to sleep, and then
another session does the summarization of the page range, session 1 is
seen as "in progress" by session 2 (so the scan does not see the new
tuple), but the placeholder tuple was not modified either because it was
inserted later than the snapshot. So the update is lost.

I think the only way to close this hole is to have summarize_range()
sleep until all open snapshots are gone after inserting the placeholder
tuple and before acquiring the snapshot, similarly to how CREATE INDEX
CONCURRENTLY does it.

Please excuse my naiveté on the topic, but could you explain (or
point me to the documented explanation) of why we don't scan using
a non-MVCC snapshot and build the page range based on all non-dead
tuples? I understand that the range being scanned would need to be
locked, but we're OK with doing that for creation of other indexes.
(There is no mention of snapshots or locks in the BRIN README....)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Kevin Grittner (#14)

Re: brin index vacuum versus transaction snapshots

Kevin Grittner wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I think the real solution to this problem is to avoid use of
GetTransactionSnapshot(), and instead use GetLatestSnapshot(). As far
as I can see, that should completely close the hole. This requires
patching IndexBuildHeapRangeScan() to allow for that.

Actually I think there's another problem: if a transaction starts and
inserts a tuple into the page range, then goes to sleep, and then
another session does the summarization of the page range, session 1 is
seen as "in progress" by session 2 (so the scan does not see the new
tuple), but the placeholder tuple was not modified either because it was
inserted later than the snapshot. So the update is lost.

I think the only way to close this hole is to have summarize_range()
sleep until all open snapshots are gone after inserting the placeholder
tuple and before acquiring the snapshot, similarly to how CREATE INDEX
CONCURRENTLY does it.

Please excuse my naivetï¿½ on the topic, but could you explain (or
point me to the documented explanation) of why we don't scan using
a non-MVCC snapshot and build the page range based on all non-dead
tuples?

Because I don't know of any mechanism to lock insertion into the block
range while the scan takes place. If you can suggest something
workable, that might be better than what I'm proposing.

Note that in my proposal, we would have to wait for all snapshots to go
away *for every page range*, which seems very troublesome. A better
answer might be to insert all placeholder tuples first, then wait for
concurrent snapshots to go away, then do each scan. But this is a
larger rework of code.

I understand that the range being scanned would need to be
locked, but we're OK with doing that for creation of other indexes.

That might be so, but this is not index creation; it's not acceptable to
lock the table during vacuuming. Do we have anything with finer
granularity than locking the entire table? (I don't think holding
buffer lock on all the pages in the range is acceptable.)

(There is no mention of snapshots or locks in the BRIN README....)

Um.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Kevin Grittner

kgrittn@ymail.com

over 10 years ago

In reply to: Alvaro Herrera (#12)

Re: brin index vacuum versus transaction snapshots

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Untested patch attached.

That fixes the installcheck failure on my machine.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Alvaro Herrera (#13)

Re: brin index vacuum versus transaction snapshots

On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I think the real solution to this problem is to avoid use of
GetTransactionSnapshot(), and instead use GetLatestSnapshot(). As far
as I can see, that should completely close the hole. This requires
patching IndexBuildHeapRangeScan() to allow for that.

Actually I think there's another problem: if a transaction starts and
inserts a tuple into the page range, then goes to sleep, and then
another session does the summarization of the page range, session 1 is
seen as "in progress" by session 2 (so the scan does not see the new
tuple), but the placeholder tuple was not modified either because it was
inserted later than the snapshot. So the update is lost.

I think the only way to close this hole is to have summarize_range()
sleep until all open snapshots are gone after inserting the placeholder
tuple and before acquiring the snapshot, similarly to how CREATE INDEX
CONCURRENTLY does it.

That's gonna be really slow, though, right? Even if you rework things
so that vacuum inserts all the placeholder tuples first, then waits,
then does all the summarization, that could easily turn a vacuum that
would have finished in a second into one that instead takes multiple
hours. During that time an AV worker is pinned down, and all sorts of
badness can ensue.

Maybe I'm misunderstanding, but this whole thing seems like a pretty
serious problem for BRIN. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Robert Haas (#17)

Re: brin index vacuum versus transaction snapshots

Robert Haas wrote:

On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I think the only way to close this hole is to have summarize_range()
sleep until all open snapshots are gone after inserting the placeholder
tuple and before acquiring the snapshot, similarly to how CREATE INDEX
CONCURRENTLY does it.

That's gonna be really slow, though, right? Even if you rework things
so that vacuum inserts all the placeholder tuples first, then waits,
then does all the summarization, that could easily turn a vacuum that
would have finished in a second into one that instead takes multiple
hours. During that time an AV worker is pinned down, and all sorts of
badness can ensue.

Yeah, it is bad and I was concerned about that too. Thankfully I found
another way to solve it, which is to forgo usage of MVCC here and
instead use SnapshotAny. There's already a mode in
IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks
to do what we need. I'm going to propose a patch along those lines
shortly.

Maybe I'm misunderstanding, but this whole thing seems like a pretty
serious problem for BRIN. :-(

With this new approach it shouldn't be.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Alvaro Herrera (#18)

1 attachment(s)

Re: brin index vacuum versus transaction snapshots

Alvaro Herrera wrote:

Thankfully I found
another way to solve it, which is to forgo usage of MVCC here and
instead use SnapshotAny. There's already a mode in
IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks
to do what we need. I'm going to propose a patch along those lines
shortly.

Here's the promised patch. Includes a new isolation test (which fails
with the old code in two ways: first when using VACUUM it causes the
error message Kevin didn't like to be raised; second when using
brin_summarize_new_values it causes the inserted tuple to not be part of
the summary, which is wrong). This test leaves the pageinspect
extension installed in the isolationtest database, but that seems fine
to me. (No other isolation test installs an extension.)

The error message Kevin complained about is gone, as is some related
commentary. This is replaced by tweaks in IndexBuildHeapRangeScan that
know to do a SnapshotAny scan without being noisy about in progress
insertions/deletions.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

brin-snapshot-2.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 360b26e..de32cca 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -951,9 +951,13 @@ summarize_range(IndexInfo *indexInfo, BrinBuildState *state, Relation heapRel,
 	 * Execute the partial heap scan covering the heap blocks in the specified
 	 * page range, summarizing the heap tuples in it.  This scan stops just
 	 * short of brinbuildCallback creating the new index entry.
+	 *
+	 * Note that it is critical we use the "any visible" mode of
+	 * IndexBuildHeapRangeScan here: otherwise, we would miss tuples inserted
+	 * by transactions that are still in progress, among other corner cases.
 	 */
 	state->bs_currRangeStart = heapBlk;
-	IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false,
+	IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false, true,
 							heapBlk, state->bs_pagesPerRange,
 							brinbuildCallback, (void *) state);
 
@@ -1058,36 +1062,6 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 				state = initialize_brin_buildstate(index, revmap,
 												   pagesPerRange);
 				indexInfo = BuildIndexInfo(index);
-
-				/*
-				 * We only have ShareUpdateExclusiveLock on the table, and
-				 * therefore other sessions may insert tuples into the range
-				 * we're going to scan.  This is okay, because we take
-				 * additional precautions to avoid losing the additional
-				 * tuples; see comments in summarize_range.  Set the
-				 * concurrent flag, which causes IndexBuildHeapRangeScan to
-				 * use a snapshot other than SnapshotAny, and silences
-				 * warnings emitted there.
-				 */
-				indexInfo->ii_Concurrent = true;
-
-				/*
-				 * If using transaction-snapshot mode, it would be possible
-				 * for another transaction to insert a tuple that's not
-				 * visible to our snapshot if we have already acquired one,
-				 * when in snapshot-isolation mode; therefore, disallow this
-				 * from running in such a transaction unless a snapshot hasn't
-				 * been acquired yet.
-				 *
-				 * This code is called by VACUUM and
-				 * brin_summarize_new_values. Have the error message mention
-				 * the latter because VACUUM cannot run in a transaction and
-				 * thus cannot cause this issue.
-				 */
-				if (IsolationUsesXactSnapshot() && FirstSnapshotSet)
-					ereport(ERROR,
-							(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-							 errmsg("brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot")));
 			}
 			summarize_range(indexInfo, state, heapRel, heapBlk);
 
@@ -1111,7 +1085,10 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 	/* free resources */
 	brinRevmapTerminate(revmap);
 	if (state)
+	{
 		terminate_brin_buildstate(state);
+		pfree(indexInfo);
+	}
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 69f35c9..e59b163 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2161,6 +2161,7 @@ IndexBuildHeapScan(Relation heapRelation,
 {
 	return IndexBuildHeapRangeScan(heapRelation, indexRelation,
 								   indexInfo, allow_sync,
+								   false,
 								   0, InvalidBlockNumber,
 								   callback, callback_state);
 }
@@ -2170,12 +2171,17 @@ IndexBuildHeapScan(Relation heapRelation,
  * number of blocks are scanned.  Scan to end-of-rel can be signalled by
  * passing InvalidBlockNumber as numblocks.  Note that restricting the range
  * to scan cannot be done when requesting syncscan.
+ *
+ * When "anyvisible" mode is requested, all tuples visible to any transaction
+ * are considered, including those inserted or deleted by transactions that are
+ * still in progress.
  */
 double
 IndexBuildHeapRangeScan(Relation heapRelation,
 						Relation indexRelation,
 						IndexInfo *indexInfo,
 						bool allow_sync,
+						bool anyvisible,
 						BlockNumber start_blockno,
 						BlockNumber numblocks,
 						IndexBuildCallback callback,
@@ -2210,6 +2216,12 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 						   indexInfo->ii_ExclusionOps != NULL);
 
 	/*
+	 * "Any visible" mode is not compatible with uniqueness checks; make sure
+	 * only one of those is requested.
+	 */
+	Assert(!(anyvisible && checking_uniqueness));
+
+	/*
 	 * Need an EState for evaluation of index expressions and partial-index
 	 * predicates.  Also a slot to hold the current tuple.
 	 */
@@ -2236,6 +2248,9 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		OldestXmin = InvalidTransactionId;		/* not used */
+
+		/* "any visible" mode is not compatible with this */
+		Assert(!anyvisible);
 	}
 	else
 	{
@@ -2364,6 +2379,17 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 				case HEAPTUPLE_INSERT_IN_PROGRESS:
 
 					/*
+					 * In "anyvisible" mode, this tuple is visible and we don't
+					 * need any further checks.
+					 */
+					if (anyvisible)
+					{
+						indexIt = true;
+						tupleIsAlive = true;
+						break;
+					}
+
+					/*
 					 * Since caller should hold ShareLock or better, normally
 					 * the only way to see this is if it was inserted earlier
 					 * in our own transaction.  However, it can happen in
@@ -2409,8 +2435,16 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 
 					/*
 					 * As with INSERT_IN_PROGRESS case, this is unexpected
-					 * unless it's our own deletion or a system catalog.
+					 * unless it's our own deletion or a system catalog;
+					 * but in anyvisible mode, this tuple is visible.
 					 */
+					if (anyvisible)
+					{
+						indexIt = true;
+						tupleIsAlive = false;
+						break;
+					}
+
 					xwait = HeapTupleHeaderGetUpdateXid(heapTuple->t_data);
 					if (!TransactionIdIsCurrentTransactionId(xwait))
 					{
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 8b3b28d..a29174b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -105,6 +105,7 @@ extern double IndexBuildHeapRangeScan(Relation heapRelation,
 						Relation indexRelation,
 						IndexInfo *indexInfo,
 						bool allow_sync,
+						bool anyvisible,
 						BlockNumber start_blockno,
 						BlockNumber end_blockno,
 						IndexBuildCallback callback,
diff --git a/src/test/isolation/expected/brin-1.out b/src/test/isolation/expected/brin-1.out
new file mode 100644
index 0000000..ddb90f4
--- /dev/null
+++ b/src/test/isolation/expected/brin-1.out
@@ -0,0 +1,39 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2check s1b s2b s1i s2summ s1c s2c s2check
+step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
+itemoffset     blknum         attnum         allnulls       hasnulls       placeholder    value          
+
+1              0              1              f              f              f              {1 .. 1}       
+step s1b: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s2b: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1;
+?column?       
+
+1              
+step s1i: INSERT INTO brin_iso VALUES (1000);
+step s2summ: SELECT brin_summarize_new_values('brinidx'::regclass);
+brin_summarize_new_values
+
+1              
+step s1c: COMMIT;
+step s2c: COMMIT;
+step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
+itemoffset     blknum         attnum         allnulls       hasnulls       placeholder    value          
+
+1              0              1              f              f              f              {1 .. 1}       
+2              1              1              f              f              f              {1 .. 1000}    
+
+starting permutation: s2check s1b s1i s2vacuum s1c s2check
+step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
+itemoffset     blknum         attnum         allnulls       hasnulls       placeholder    value          
+
+1              0              1              f              f              f              {1 .. 1}       
+step s1b: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1i: INSERT INTO brin_iso VALUES (1000);
+step s2vacuum: VACUUM brin_iso;
+step s1c: COMMIT;
+step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
+itemoffset     blknum         attnum         allnulls       hasnulls       placeholder    value          
+
+1              0              1              f              f              f              {1 .. 1}       
+2              1              1              f              f              f              {1 .. 1000}    
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index c0ed637..5852c6e 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -36,6 +36,7 @@ test: skip-locked
 test: skip-locked-2
 test: skip-locked-3
 test: skip-locked-4
+test: brin-1
 test: drop-index-concurrently-1
 test: alter-table-1
 test: alter-table-2
diff --git a/src/test/isolation/specs/brin-1.spec b/src/test/isolation/specs/brin-1.spec
new file mode 100644
index 0000000..19ac18a
--- /dev/null
+++ b/src/test/isolation/specs/brin-1.spec
@@ -0,0 +1,44 @@
+# This test verifies that values inserted in transactions still in progress
+# are considered during concurrent range summarization (either using the
+# brin_summarize_new_values function or regular VACUUM).
+
+setup
+{
+    CREATE TABLE brin_iso (
+        value int
+    ) WITH (fillfactor=10);
+    CREATE INDEX brinidx ON brin_iso USING brin (value) WITH (pages_per_range=1);
+    -- this fills the first page
+    DO $$
+    DECLARE curtid tid;
+    BEGIN
+      LOOP
+        INSERT INTO brin_iso VALUES (1) RETURNING ctid INTO curtid;
+        EXIT WHEN curtid > tid '(1, 0)';
+      END LOOP;
+    END;
+    $$;
+    CREATE EXTENSION IF NOT EXISTS pageinspect;
+}
+
+teardown
+{
+    DROP TABLE brin_iso;
+}
+
+session "s1"
+step "s1b"		{ BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s1i"		{ INSERT INTO brin_iso VALUES (1000); }
+step "s1c"		{ COMMIT; }
+
+session "s2"
+step "s2b"		{ BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; }
+step "s2summ"	{ SELECT brin_summarize_new_values('brinidx'::regclass); }
+step "s2c"		{ COMMIT; }
+
+step "s2vacuum"	{ VACUUM brin_iso; }
+
+step "s2check"	{ SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass); }
+
+permutation "s2check" "s1b" "s2b" "s1i" "s2summ" "s1c" "s2c" "s2check"
+permutation "s2check" "s1b" "s1i" "s2vacuum" "s1c" "s2check"

#20

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Alvaro Herrera (#19)

Re: brin index vacuum versus transaction snapshots

Alvaro Herrera wrote:

Here's the promised patch.

Pushed to master and 9.5. Thanks for reporting!

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers