pgsql: Fix BRIN to use SnapshotAny during summarization

Started by Alvaro Herreraover 10 years ago10 messages
#1Alvaro Herrera
alvherre@alvh.no-ip.org

Fix BRIN to use SnapshotAny during summarization

For correctness of summarization results, it is critical that the
snapshot used during the summarization scan is able to see all tuples
that are live to all transactions -- including tuples inserted or
deleted by in-progress transactions. Otherwise, it would be possible
for a transaction to insert a tuple, then idle for a long time while a
concurrent transaction executes summarization of the range: this would
result in the inserted value not being considered in the summary.
Previously we were trying to use a MVCC snapshot in conjunction with
adding a "placeholder" tuple in the index: the snapshot would see all
committed tuples, and the placeholder tuple would catch insertions by
any new inserters. The hole is that prior insertions by transactions
that are still in progress by the time the MVCC snapshot was taken were
ignored.

Kevin Grittner reported this as a bogus error message during vacuum with
default transaction isolation mode set to repeatable read (because the
error report mentioned a function name not being invoked during), but
the problem is larger than that.

To fix, tweak IndexBuildHeapRangeScan to have a new mode that behaves
the way we need using SnapshotAny visibility rules. This change
simplifies the BRIN code a bit, mainly by removing large comments that
were mistaken. Instead, rely on the SnapshotAny semantics to provide
what it needs. (The business about a placeholder tuple needs to remain:
that covers the case that a transaction inserts a a tuple in a page that
summarization already scanned.)

Discussion: /messages/by-id/20150731175700.GX2441@postgresql.org

In passing, remove a couple of unused declarations from brin.h and
reword a comment to be proper English. This part submitted by Kevin
Grittner.

Backpatch to 9.5, where BRIN was introduced.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/2834855cb9fde734ce12f59694522c10bf0c0205

Modified Files
--------------
src/backend/access/brin/brin.c | 41 +++++++----------------------
src/backend/catalog/index.c | 36 +++++++++++++++++++++++++-
src/include/access/brin.h | 2 --
src/include/catalog/index.h | 1 +
src/test/isolation/expected/brin-1.out | 39 ++++++++++++++++++++++++++++
src/test/isolation/isolation_schedule | 1 +
src/test/isolation/specs/brin-1.spec | 44 ++++++++++++++++++++++++++++++++
7 files changed, 129 insertions(+), 35 deletions(-)

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#1)
Re: pgsql: Fix BRIN to use SnapshotAny during summarization

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Fix BRIN to use SnapshotAny during summarization

This patch added an isolation test that fails unless contrib/pageinspect
has been built and installed. I do not find that acceptable. It causes
"make check-world" to fail ... and no, installing the extension during
make check-world isn't going to make me happier.

I don't really think we need this isolation test at all, but if we do,
please fix it to not rely on any extensions. Perhaps looking at
pg_relation_size or some such would do? Or you could just issue
a query that should use the index, and see if it finds the rows it
ought to.

regards, tom lane

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: pgsql: Fix BRIN to use SnapshotAny during summarization

... btw, I believe the reason for the failure here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundi&amp;dt=2015-08-06%2011%3A52%3A17

is that brin_page_items() is unsafe. It's accessing
state->bdesc->bd_tupdesc, which is pointing at the tupdesc of an index
Relation that it no longer has open; not only directly, but via
brin_deform_tuple(). All you need is a relcache flush to be accessing
garbage. I haven't dug down thoroughly, but I'd not be surprised if there
were also some dereferences of bd_index, which is equally a dangling
pointer.

regards, tom lane

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#4Christoph Berg
myon@debian.org
In reply to: Tom Lane (#2)
Using contrib modules in check (Re: pgsql: Fix BRIN to use SnapshotAny during summarization)

Re: Tom Lane 2015-08-07 <928.1438900846@sss.pgh.pa.us>

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Fix BRIN to use SnapshotAny during summarization

This patch added an isolation test that fails unless contrib/pageinspect
has been built and installed. I do not find that acceptable. It causes
"make check-world" to fail ... and no, installing the extension during
make check-world isn't going to make me happier.

I don't really think we need this isolation test at all, but if we do,
please fix it to not rely on any extensions. Perhaps looking at
pg_relation_size or some such would do? Or you could just issue
a query that should use the index, and see if it finds the rows it
ought to.

Hi,

this breaks the Debian package builds as well because we run
check-world as a build step.

Any chance for a fix/workaround so the nightly master/head builds will
succeed again?

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christoph Berg (#4)
Re: Using contrib modules in check (Re: pgsql: Fix BRIN to use SnapshotAny during summarization)

Christoph Berg <myon@debian.org> writes:

Re: Tom Lane 2015-08-07 <928.1438900846@sss.pgh.pa.us>

I don't really think we need this isolation test at all, but if we do,
please fix it to not rely on any extensions. Perhaps looking at
pg_relation_size or some such would do? Or you could just issue
a query that should use the index, and see if it finds the rows it
ought to.

this breaks the Debian package builds as well because we run
check-world as a build step.
Any chance for a fix/workaround so the nightly master/head builds will
succeed again?

I was waiting for Alvaro to deal with this, but perhaps he's on summer
vacation or something. I will remove the isolation test until he has
time to address it more fully.

However, we did learn something valuable from the fact that all the
-DCLOBBER_CACHE_ALWAYS critters failed on it: per my earlier message,
brin_page_items() is unsafe against a relcache flush on the index.
I'll put that on the 9.5 open items list.

(If I were tasked with fixing it, I'd be tempted to rewrite it to do
all the work in one call and return a tuplestore; the alternative
seems to be to try to keep the index open across multiple calls,
which would be a mess.)

regards, tom lane

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#6Christoph Berg
myon@debian.org
In reply to: Tom Lane (#5)
Re: Using contrib modules in check (Re: pgsql: Fix BRIN to use SnapshotAny during summarization)

Re: Tom Lane 2015-08-10 <2713.1439215712@sss.pgh.pa.us>

I was waiting for Alvaro to deal with this, but perhaps he's on summer
vacation or something. I will remove the isolation test until he has
time to address it more fully.

Thanks, the build worked now.

Wouldn't a possible fix be to promote this/some module to core? It
doesn't have any SQL- or OS-side dependencies.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christoph Berg (#6)
Re: Using contrib modules in check (Re: pgsql: Fix BRIN to use SnapshotAny during summarization)

Christoph Berg <myon@debian.org> writes:

Thanks, the build worked now.

Wouldn't a possible fix be to promote this/some module to core? It
doesn't have any SQL- or OS-side dependencies.

Meh. I think we'd agreed in another nearby thread that pageinspect
is exactly the sort of thing we don't want in core. Even if it were
in core, I'm dubious that it's a good way to implement the desired
testing here: the risks of platform-dependent results (endianness,
varying numbers of tuples per page, etc etc) seem far too high.

regards, tom lane

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#8Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#7)
Re: Using contrib modules in check (Re: pgsql: Fix BRIN to use SnapshotAny during summarization)

On 2015-08-10 11:19:56 -0400, Tom Lane wrote:

Meh. I think we'd agreed in another nearby thread that pageinspect
is exactly the sort of thing we don't want in core.

I think we agreed that it shouldn't be included in an initdb'ed
database, but that doesn't preclude pageinspect being an extension in
src/extension or something.

Even if it were
in core, I'm dubious that it's a good way to implement the desired
testing here: the risks of platform-dependent results (endianness,
varying numbers of tuples per page, etc etc) seem far too high.

That on the other hand I can agree with being a danger.

Greetings,

Andres Freund

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#5)
1 attachment(s)
Re: [COMMITTERS] Using contrib modules in check (Re: pgsql: Fix BRIN to use SnapshotAny during summarization)

Tom Lane wrote:

However, we did learn something valuable from the fact that all the
-DCLOBBER_CACHE_ALWAYS critters failed on it: per my earlier message,
brin_page_items() is unsafe against a relcache flush on the index.
I'll put that on the 9.5 open items list.

(If I were tasked with fixing it, I'd be tempted to rewrite it to do
all the work in one call and return a tuplestore; the alternative
seems to be to try to keep the index open across multiple calls,
which would be a mess.)

Here's a patch doing that.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

brin-page-items.patchtext/x-diff; charset=us-asciiDownload
commit 41eba17795e9cf9b0c23e05def5d33b0dbdd4807
Author:     Alvaro Herrera <alvherre@alvh.no-ip.org>
AuthorDate: Wed Aug 12 15:41:15 2015 -0300
CommitDate: Wed Aug 12 15:41:15 2015 -0300

    fix brin_page_items to use a tuplestore

diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c
index 7adcfa8..f695b18 100644
--- a/contrib/pageinspect/brinfuncs.c
+++ b/contrib/pageinspect/brinfuncs.c
@@ -37,18 +37,6 @@ typedef struct brin_column_state
 	FmgrInfo	outputFn[FLEXIBLE_ARRAY_MEMBER];
 } brin_column_state;
 
-typedef struct brin_page_state
-{
-	BrinDesc   *bdesc;
-	Page		page;
-	OffsetNumber offset;
-	bool		unusedItem;
-	bool		done;
-	AttrNumber	attno;
-	BrinMemTuple *dtup;
-	brin_column_state *columns[FLEXIBLE_ARRAY_MEMBER];
-} brin_page_state;
-
 
 static Page verify_brin_page(bytea *raw_page, uint16 type,
 				 const char *strtype);
@@ -119,89 +107,90 @@ verify_brin_page(bytea *raw_page, uint16 type, const char *strtype)
 Datum
 brin_page_items(PG_FUNCTION_ARGS)
 {
-	brin_page_state *state;
-	FuncCallContext *fctx;
+	bytea	   *raw_page = PG_GETARG_BYTEA_P(0);
+	Oid			indexRelid = PG_GETARG_OID(1);
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	MemoryContext oldcontext;
+	Tuplestorestate *tupstore;
+	Relation	indexRel;
+	Page		page;
+	OffsetNumber offset;
+	AttrNumber	attno;
+	bool		unusedItem;
+	BrinDesc   *bdesc;
+	BrinMemTuple *dtup;
+	brin_column_state **columns;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to use raw page functions"))));
 
-	if (SRF_IS_FIRSTCALL())
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize) ||
+		rsinfo->expectedDesc == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/* Build tuplestore to hold the result rows */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	indexRel = index_open(indexRelid, AccessShareLock);
+
+	/* minimally verify the page we got */
+	page = verify_brin_page(raw_page, BRIN_PAGETYPE_REGULAR, "regular");
+
+	bdesc = brin_build_desc(indexRel);
+	offset = FirstOffsetNumber;
+	unusedItem = false;
+	dtup = NULL;
+
+	/*
+	 * Initialize output functions for all indexed datatypes; simplifies
+	 * calling them later.
+	 */
+	columns = palloc(sizeof(brin_column_state *) * RelationGetDescr(indexRel)->natts);
+	for (attno = 1; attno <= bdesc->bd_tupdesc->natts; attno++)
 	{
-		bytea	   *raw_page = PG_GETARG_BYTEA_P(0);
-		Oid			indexRelid = PG_GETARG_OID(1);
-		Page		page;
-		TupleDesc	tupdesc;
-		MemoryContext mctx;
-		Relation	indexRel;
-		AttrNumber	attno;
+		Oid			output;
+		bool		isVarlena;
+		BrinOpcInfo *opcinfo;
+		int			i;
+		brin_column_state *column;
 
-		/* minimally verify the page we got */
-		page = verify_brin_page(raw_page, BRIN_PAGETYPE_REGULAR, "regular");
+		opcinfo = bdesc->bd_info[attno - 1];
+		column = palloc(offsetof(brin_column_state, outputFn) +
+						sizeof(FmgrInfo) * opcinfo->oi_nstored);
 
-		/* create a function context for cross-call persistence */
-		fctx = SRF_FIRSTCALL_INIT();
-
-		/* switch to memory context appropriate for multiple function calls */
-		mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
-
-		/* Build a tuple descriptor for our result type */
-		if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
-			elog(ERROR, "return type must be a row type");
-
-		indexRel = index_open(indexRelid, AccessShareLock);
-
-		state = palloc(offsetof(brin_page_state, columns) +
-			  sizeof(brin_column_state) * RelationGetDescr(indexRel)->natts);
-
-		state->bdesc = brin_build_desc(indexRel);
-		state->page = page;
-		state->offset = FirstOffsetNumber;
-		state->unusedItem = false;
-		state->done = false;
-		state->dtup = NULL;
-
-		/*
-		 * Initialize output functions for all indexed datatypes; simplifies
-		 * calling them later.
-		 */
-		for (attno = 1; attno <= state->bdesc->bd_tupdesc->natts; attno++)
+		column->nstored = opcinfo->oi_nstored;
+		for (i = 0; i < opcinfo->oi_nstored; i++)
 		{
-			Oid			output;
-			bool		isVarlena;
-			BrinOpcInfo *opcinfo;
-			int			i;
-			brin_column_state *column;
-
-			opcinfo = state->bdesc->bd_info[attno - 1];
-			column = palloc(offsetof(brin_column_state, outputFn) +
-							sizeof(FmgrInfo) * opcinfo->oi_nstored);
-
-			column->nstored = opcinfo->oi_nstored;
-			for (i = 0; i < opcinfo->oi_nstored; i++)
-			{
-				getTypeOutputInfo(opcinfo->oi_typcache[i]->type_id, &output, &isVarlena);
-				fmgr_info(output, &column->outputFn[i]);
-			}
-
-			state->columns[attno - 1] = column;
+			getTypeOutputInfo(opcinfo->oi_typcache[i]->type_id, &output, &isVarlena);
+			fmgr_info(output, &column->outputFn[i]);
 		}
 
-		index_close(indexRel, AccessShareLock);
-
-		fctx->user_fctx = state;
-		fctx->tuple_desc = BlessTupleDesc(tupdesc);
-
-		MemoryContextSwitchTo(mctx);
+		columns[attno - 1] = column;
 	}
 
-	fctx = SRF_PERCALL_SETUP();
-	state = fctx->user_fctx;
-
-	if (!state->done)
+	for (;;)
 	{
-		HeapTuple	result;
 		Datum		values[7];
 		bool		nulls[7];
 
@@ -211,39 +200,32 @@ brin_page_items(PG_FUNCTION_ARGS)
 		 * signal for obtaining and decoding the next one.  If that's not the
 		 * case, we output the next attribute.
 		 */
-		if (state->dtup == NULL)
+		if (dtup == NULL)
 		{
 			BrinTuple  *tup;
-			MemoryContext mctx;
 			ItemId		itemId;
 
-			/* deformed tuple must live across calls */
-			mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
-
 			/* verify item status: if there's no data, we can't decode */
-			itemId = PageGetItemId(state->page, state->offset);
+			itemId = PageGetItemId(page, offset);
 			if (ItemIdIsUsed(itemId))
 			{
-				tup = (BrinTuple *) PageGetItem(state->page,
-												PageGetItemId(state->page,
-															  state->offset));
-				state->dtup = brin_deform_tuple(state->bdesc, tup);
-				state->attno = 1;
-				state->unusedItem = false;
+				tup = (BrinTuple *)
+					PageGetItem(page, PageGetItemId(page, offset));
+				dtup = brin_deform_tuple(bdesc, tup);
+				attno = 1;
+				unusedItem = false;
 			}
 			else
-				state->unusedItem = true;
-
-			MemoryContextSwitchTo(mctx);
+				unusedItem = true;
 		}
 		else
-			state->attno++;
+			attno++;
 
 		MemSet(nulls, 0, sizeof(nulls));
 
-		if (state->unusedItem)
+		if (unusedItem)
 		{
-			values[0] = UInt16GetDatum(state->offset);
+			values[0] = UInt16GetDatum(offset);
 			nulls[1] = true;
 			nulls[2] = true;
 			nulls[3] = true;
@@ -253,17 +235,17 @@ brin_page_items(PG_FUNCTION_ARGS)
 		}
 		else
 		{
-			int			att = state->attno - 1;
+			int			att = attno - 1;
 
-			values[0] = UInt16GetDatum(state->offset);
-			values[1] = UInt32GetDatum(state->dtup->bt_blkno);
-			values[2] = UInt16GetDatum(state->attno);
-			values[3] = BoolGetDatum(state->dtup->bt_columns[att].bv_allnulls);
-			values[4] = BoolGetDatum(state->dtup->bt_columns[att].bv_hasnulls);
-			values[5] = BoolGetDatum(state->dtup->bt_placeholder);
-			if (!state->dtup->bt_columns[att].bv_allnulls)
+			values[0] = UInt16GetDatum(offset);
+			values[1] = UInt32GetDatum(dtup->bt_blkno);
+			values[2] = UInt16GetDatum(attno);
+			values[3] = BoolGetDatum(dtup->bt_columns[att].bv_allnulls);
+			values[4] = BoolGetDatum(dtup->bt_columns[att].bv_hasnulls);
+			values[5] = BoolGetDatum(dtup->bt_placeholder);
+			if (!dtup->bt_columns[att].bv_allnulls)
 			{
-				BrinValues *bvalues = &state->dtup->bt_columns[att];
+				BrinValues *bvalues = &dtup->bt_columns[att];
 				StringInfoData s;
 				bool		first;
 				int			i;
@@ -272,14 +254,14 @@ brin_page_items(PG_FUNCTION_ARGS)
 				appendStringInfoChar(&s, '{');
 
 				first = true;
-				for (i = 0; i < state->columns[att]->nstored; i++)
+				for (i = 0; i < columns[att]->nstored; i++)
 				{
 					char	   *val;
 
 					if (!first)
 						appendStringInfoString(&s, " .. ");
 					first = false;
-					val = OutputFunctionCall(&state->columns[att]->outputFn[i],
+					val = OutputFunctionCall(&columns[att]->outputFn[i],
 											 bvalues->bv_values[i]);
 					appendStringInfoString(&s, val);
 					pfree(val);
@@ -295,35 +277,35 @@ brin_page_items(PG_FUNCTION_ARGS)
 			}
 		}
 
-		result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
 
 		/*
 		 * If the item was unused, jump straight to the next one; otherwise,
 		 * the only cleanup needed here is to set our signal to go to the next
 		 * tuple in the following iteration, by freeing the current one.
 		 */
-		if (state->unusedItem)
-			state->offset = OffsetNumberNext(state->offset);
-		else if (state->attno >= state->bdesc->bd_tupdesc->natts)
+		if (unusedItem)
+			offset = OffsetNumberNext(offset);
+		else if (attno >= bdesc->bd_tupdesc->natts)
 		{
-			pfree(state->dtup);
-			state->dtup = NULL;
-			state->offset = OffsetNumberNext(state->offset);
+			pfree(dtup);
+			dtup = NULL;
+			offset = OffsetNumberNext(offset);
 		}
 
 		/*
-		 * If we're beyond the end of the page, set flag to end the function
-		 * in the following iteration.
+		 * If we're beyond the end of the page, we're done.
 		 */
-		if (state->offset > PageGetMaxOffsetNumber(state->page))
-			state->done = true;
-
-		SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+		if (offset > PageGetMaxOffsetNumber(page))
+			break;
 	}
 
-	brin_free_desc(state->bdesc);
+	/* clean up and return the tuplestore */
+	brin_free_desc(bdesc);
+	tuplestore_donestoring(tupstore);
+	index_close(indexRel, AccessShareLock);
 
-	SRF_RETURN_DONE(fctx);
+	return (Datum) 0;
 }
 
 Datum
#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#9)
Re: Re: [COMMITTERS] Using contrib modules in check (Re: pgsql: Fix BRIN to use SnapshotAny during summarization)

Alvaro Herrera wrote:

Tom Lane wrote:

(If I were tasked with fixing it, I'd be tempted to rewrite it to do
all the work in one call and return a tuplestore; the alternative
seems to be to try to keep the index open across multiple calls,
which would be a mess.)

Here's a patch doing that.

Pushed, thanks.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers