[proposal] de-TOAST'ing using a iterator

Started by Binguo Baoover 6 years ago34 messages
#1Binguo Bao
djydewang@gmail.com
1 attachment(s)

Hi hackers!
This proposal aims to provide the ability to de-TOAST a fully TOAST'd and
compressed field using an iterator and then update the appropriate parts of
the code to use the iterator where possible instead of de-TOAST'ing and
de-compressing the entire value. Examples where this can be helpful include
using position() from the beginning of the value, or doing a pattern or
substring match.

de-TOAST iterator overview:
1. The caller requests the slice of the attribute value from the de-TOAST
iterator.
2. The de-TOAST iterator checks if there is a slice available in the output
buffer, if there is, return the result directly,
otherwise goto the step3.
3. The de-TOAST iterator checks if there is the slice available in the
input buffer, if there is, goto step44. Otherwise,
call fetch_datum_iterator to fetch datums from disk to input buffer.
4. If the data in the input buffer is compressed, extract some data from
the input buffer to the output buffer until the caller's
needs are met.

I've implemented the prototype and apply it to the position() function to
test performance.
Test tables:
-----------------------------------------------------------------------------------------------------
create table detoast_c (id serial primary key,
a text
);
insert into detoast_c (a) select
repeat('1234567890-=abcdefghijklmnopqrstuvwxyz', 1000000)||'321' as a from
generate_series(1,100);

create table detoast_u (id serial primary key,
a text
);
alter table detoast_u alter a set storage external;
insert into detoast_u (a) select
repeat('1234567890-=abcdefghijklmnopqrstuvwxyz', 1000000)||'321' as a from
generate_series(1,100);
**************************************************************************************
-----------------------------------------------------------------------------------------------------
query |
master (ms) | patch (ms) |
-----------------------------------------------------------------------------------------------------
select position('123' in a) from detoast_c; | 4054.838 |
1440.735 |
-----------------------------------------------------------------------------------------------------
select position('321' in a) from detoast_c; | 25549.270 |
27696.245 |
-----------------------------------------------------------------------------------------------------
select position('123' in a) from detoast_u; | 8116.996 |
1386.802 |
-----------------------------------------------------------------------------------------------------
select position('321' in a) from detoast_u | 28442.116 |
27672.319 |
-----------------------------------------------------------------------------------------------------
**************************************************************************************
It can be seen that the iterator greatly improves the efficiency of partial
de-TOAST when it has almost no degradation in full de-TOAST efficiency.
Next, I will continue to study how to apply iterators to more queries
and improve iterator efficiency, such as using macros instead of function
calls.

The patch is also available on github[1]https://github.com/djydewang/postgres/pull/1/files.
Any suggestions or comments would be much appreciated:)

Best regards, Binguo Bao.

[1]: https://github.com/djydewang/postgres/pull/1/files

Attachments:

0001-de-TOASTing-using-a-iterator.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator.patchDownload
From b071bf9801f45d5ff48422b44bd5042ae19ea20c Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 488 +++++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c      |  48 ++--
 src/include/access/tuptoaster.h      |  92 +++++++
 3 files changed, 612 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..92dc87a 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static int32 fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static int32 pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									 DetoastIterator iter, int32 length);
 
 
 /* ----------
@@ -347,6 +354,145 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = NULL;
+		iterator->source = palloc0(sizeof(ToastBuffer));
+		iterator->source->buf = (const char*) attr;
+		iterator->source->position = TOAST_COMPRESS_RAWDATA(attr);
+		iterator->source->limit = (char *)attr + VARSIZE(attr);
+		iterator->source->capacity = iterator->source->limit;
+
+		iterator->buf = palloc0(sizeof(ToastBuffer));
+		init_toast_buffer(iterator->buf, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ, false);
+
+		iterator->ctrlc = 0;
+		iterator->compressed = true;
+		iterator->done = false;
+	}
+
+	return iterator;
+}
+
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+bool free_detoast_iterator(DetoastIterator iter) {
+	if (iter == NULL)
+	{
+		return false;
+	}
+	if (iter->buf != iter->fetch_datum_iterator->buf)
+	{
+		free_toast_buffer(iter->buf);
+	}
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	free_toast_buffer(iter->source);
+	pfree(iter);
+	return true;
+}
+
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter)
+{
+	if (iter == NULL)
+	{
+		elog(ERROR, "detoast_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->buf->limit - iter->buf->position >= length || iter->done)
+	{
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	if (iter->fetch_datum_iterator != NULL)
+	{
+		ToastBuffer *buf = iter->fetch_datum_iterator->buf;
+		FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
+		while(buf->limit - buf->position < length && !fetch_iter->done) {
+			fetch_datum_iterate(fetch_iter);
+		}
+		if (iter->compressed) {
+			return pglz_decompress_iterate(buf, iter->buf, iter, length);
+		}
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	return pglz_decompress_iterate(iter->source, iter->buf, iter, length);
+}
+
+
+/* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
@@ -2409,3 +2555,345 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+static FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr) {
+	int			validIndex;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");
+
+	FetchDatumIterator iterator = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iterator->toast_pointer, attr);
+
+	iterator->ressize = iterator->toast_pointer.va_extsize;
+	iterator->numchunks = ((iterator->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iterator->toastrel = table_open(iterator->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iterator->toastrel,
+									AccessShareLock,
+									&iterator->toastidxs,
+									&iterator->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iterator->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iterator->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iterator->SnapshotToast);
+	iterator->toastscan = systable_beginscan_ordered(iterator->toastrel, iterator->toastidxs[validIndex],
+										   &iterator->SnapshotToast, 1, &iterator->toastkey);
+
+	iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	init_toast_buffer(iterator->buf, iterator->ressize + VARHDRSZ, VARATT_EXTERNAL_IS_COMPRESSED(iterator->toast_pointer));
+
+	iterator->nextidx = 0;
+	iterator->done = false;
+
+	return iterator;
+}
+
+static bool
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+	{
+		return false;
+	}
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+	return true;
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in compression or external storage,
+ * fetch it into buffer in iterator, and return slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+static int32
+fetch_datum_iterate(FetchDatumIterator iter) {
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	if (iter == NULL)
+	{
+		elog(ERROR, "fetch_datum_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->done)
+	{
+		return -1;
+	}
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return -1;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+	return chunksize;
+}
+
+
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed) {
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+	buf->buf_size = size;
+}
+
+
+static bool
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+	{
+		return false;
+	}
+
+	pfree(buf->buf);
+	pfree(buf);
+}
+
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ *		Decompresses source into dest. Returns the number of available bytes
+ *		decompressed in the destination buffer.
+ * ----------
+ */
+static int32
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter, int32 need_len)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	sp = (const unsigned char *) source->position;
+	srcend = ((const unsigned char *) source->limit);
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *)dest->capacity;
+
+	while (sp < srcend && dp < destend && ((char *)dp - dest->position) < need_len)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+		if (iter->ctrlc != 0) {
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		if (ctrlc < 8) {
+			iter->ctrlc = ctrlc;
+			iter->ctrl = ctrl;
+		}
+		else
+		{
+			iter->ctrlc = 0;
+		}
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+	return dest->limit - dest->position;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 0864838..e7fab58 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -122,10 +122,10 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
-static bool text_position_next(TextPositionState *state);
-static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
+static bool text_position_next(TextPositionState *state, DetoastIterator iter);
+static char *text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter);
 static char *text_position_get_match_ptr(TextPositionState *state);
 static int	text_position_get_match_pos(TextPositionState *state);
 static void text_position_cleanup(TextPositionState *state);
@@ -1092,10 +1092,19 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	DetoastIterator iter = create_detoast_iterator((struct varlena *)(DatumGetPointer(PG_GETARG_DATUM(0))));
+	if (iter != NULL)
+	{
+		str = (text *) iter->buf->buf;
+	}
+	else
+	{
+		str = PG_GETARG_TEXT_PP(0);
+	}
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1122,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1122,15 +1131,15 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
-	if (!text_position_next(&state))
+	if (!text_position_next(&state, iter))
 		result = 0;
 	else
 		result = text_position_get_match_pos(&state);
 	text_position_cleanup(&state);
+	free_detoast_iterator(iter);
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1274,7 +1283,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
  * is found.
  */
 static bool
-text_position_next(TextPositionState *state)
+text_position_next(TextPositionState *state, DetoastIterator iter)
 {
 	int			needle_len = state->len2;
 	char	   *start_ptr;
@@ -1290,7 +1299,7 @@ text_position_next(TextPositionState *state)
 		start_ptr = state->str1;
 
 retry:
-	matchptr = text_position_next_internal(start_ptr, state);
+	matchptr = text_position_next_internal(start_ptr, state, iter);
 
 	if (!matchptr)
 		return false;
@@ -1338,7 +1347,7 @@ retry:
  * match starting at 'start_ptr', or NULL if no match is found.
  */
 static char *
-text_position_next_internal(char *start_ptr, TextPositionState *state)
+text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter)
 {
 	int			haystack_len = state->len1;
 	int			needle_len = state->len2;
@@ -1358,6 +1367,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1371,6 +1383,10 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr + needle_len - 1;
 		while (hptr < haystack_end)
 		{
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
+
 			/* Match the needle scanning *backward* */
 			const char *nptr;
 			const char *p;
@@ -4231,7 +4247,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 	text_position_setup(src_text, from_sub_text, PG_GET_COLLATION(), &state);
 
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* When the from_sub_text is not found, there is nothing to do. */
 	if (!found)
@@ -4256,7 +4272,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 		start_ptr = curr_ptr + from_sub_text_len;
 
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			curr_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4594,7 +4610,7 @@ split_text(PG_FUNCTION_ARGS)
 
 	/* identify bounds of first field */
 	start_ptr = VARDATA_ANY(inputstring);
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* special case if fldsep not found at all */
 	if (!found)
@@ -4612,7 +4628,7 @@ split_text(PG_FUNCTION_ARGS)
 	{
 		/* identify bounds of next field */
 		start_ptr = end_ptr + fldsep_len;
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			end_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4766,7 +4782,7 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 
 			CHECK_FOR_INTERRUPTS();
 
-			found = text_position_next(&state);
+			found = text_position_next(&state, NULL);
 			if (!found)
 			{
 				/* fetch last field */
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f0aea24..e60815d 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -17,6 +17,98 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consume.
+ * limit: point to the next char to be produce.
+ * capacity: point to the end of buffer.
+ *
+ * Constrains that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+	int32		buf_size;
+} ToastBuffer;
+
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			SnapshotToast;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	ToastBuffer			*source;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;			/* iterator exhausted? */
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+extern DetoastIterator create_detoast_iterator(struct varlena *attr);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+extern bool free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter);
+
+#endif
+
+
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
  * smart enough to rebuild indexes from scratch.
-- 
2.7.4

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Binguo Bao (#1)
Re: [proposal] de-TOAST'ing using a iterator

On Thu, Jun 20, 2019 at 1:51 AM Binguo Bao <djydewang@gmail.com> wrote:

Hi hackers!
This proposal aims to provide the ability to de-TOAST a fully TOAST'd and compressed field using an iterator and then update the appropriate parts of the code to use the iterator where possible instead of de-TOAST'ing and de-compressing the entire value. Examples where this can be helpful include using position() from the beginning of the value, or doing a pattern or substring match.

de-TOAST iterator overview:
1. The caller requests the slice of the attribute value from the de-TOAST iterator.
2. The de-TOAST iterator checks if there is a slice available in the output buffer, if there is, return the result directly,
otherwise goto the step3.
3. The de-TOAST iterator checks if there is the slice available in the input buffer, if there is, goto step44. Otherwise,
call fetch_datum_iterator to fetch datums from disk to input buffer.
4. If the data in the input buffer is compressed, extract some data from the input buffer to the output buffer until the caller's
needs are met.

I've implemented the prototype and apply it to the position() function to test performance.

Hi Binguo,

Interesting work, and nice performance improvements so far. Just by
the way, the patch currently generates warnings:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/554345719

--
Thomas Munro
https://enterprisedb.com

#3Binguo Bao
djydewang@gmail.com
In reply to: Thomas Munro (#2)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

Hi Thomas,
I've fixed the warnings.

Thomas Munro <thomas.munro@gmail.com> 于2019年7月5日周五 下午12:21写道:

Show quoted text

On Thu, Jun 20, 2019 at 1:51 AM Binguo Bao <djydewang@gmail.com> wrote:

Hi hackers!
This proposal aims to provide the ability to de-TOAST a fully TOAST'd

and compressed field using an iterator and then update the appropriate
parts of the code to use the iterator where possible instead of
de-TOAST'ing and de-compressing the entire value. Examples where this can
be helpful include using position() from the beginning of the value, or
doing a pattern or substring match.

de-TOAST iterator overview:
1. The caller requests the slice of the attribute value from the

de-TOAST iterator.

2. The de-TOAST iterator checks if there is a slice available in the

output buffer, if there is, return the result directly,

otherwise goto the step3.
3. The de-TOAST iterator checks if there is the slice available in the

input buffer, if there is, goto step44. Otherwise,

call fetch_datum_iterator to fetch datums from disk to input buffer.
4. If the data in the input buffer is compressed, extract some data from

the input buffer to the output buffer until the caller's

needs are met.

I've implemented the prototype and apply it to the position() function

to test performance.

Hi Binguo,

Interesting work, and nice performance improvements so far. Just by
the way, the patch currently generates warnings:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/554345719

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-de-TOASTing-using-a-iterator-2.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-2.patchDownload
From 0ac012867be04b409bfc4e67306fd9f3e5b4785a Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 489 +++++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c      |  48 ++--
 src/include/access/tuptoaster.h      |  92 +++++++
 3 files changed, 613 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..07bb69e 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static int32 fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static int32 pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									 DetoastIterator iter, int32 length);
 
 
 /* ----------
@@ -347,6 +354,145 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = NULL;
+		iterator->source = palloc0(sizeof(ToastBuffer));
+		iterator->source->buf = (const char*) attr;
+		iterator->source->position = TOAST_COMPRESS_RAWDATA(attr);
+		iterator->source->limit = (char *)attr + VARSIZE(attr);
+		iterator->source->capacity = iterator->source->limit;
+
+		iterator->buf = palloc0(sizeof(ToastBuffer));
+		init_toast_buffer(iterator->buf, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ, false);
+
+		iterator->ctrlc = 0;
+		iterator->compressed = true;
+		iterator->done = false;
+	}
+
+	return iterator;
+}
+
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+bool free_detoast_iterator(DetoastIterator iter) {
+	if (iter == NULL)
+	{
+		return false;
+	}
+	if (iter->buf != iter->fetch_datum_iterator->buf)
+	{
+		free_toast_buffer(iter->buf);
+	}
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	free_toast_buffer(iter->source);
+	pfree(iter);
+	return true;
+}
+
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter)
+{
+	if (iter == NULL)
+	{
+		elog(ERROR, "detoast_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->buf->limit - iter->buf->position >= length || iter->done)
+	{
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	if (iter->fetch_datum_iterator != NULL)
+	{
+		ToastBuffer *buf = iter->fetch_datum_iterator->buf;
+		FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
+		while(buf->limit - buf->position < length && !fetch_iter->done) {
+			fetch_datum_iterate(fetch_iter);
+		}
+		if (iter->compressed) {
+			return pglz_decompress_iterate(buf, iter->buf, iter, length);
+		}
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	return pglz_decompress_iterate(iter->source, iter->buf, iter, length);
+}
+
+
+/* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
@@ -2409,3 +2555,346 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+static FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr) {
+	int			validIndex;
+	FetchDatumIterator iterator;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");
+
+	iterator = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iterator->toast_pointer, attr);
+
+	iterator->ressize = iterator->toast_pointer.va_extsize;
+	iterator->numchunks = ((iterator->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iterator->toastrel = table_open(iterator->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iterator->toastrel,
+									AccessShareLock,
+									&iterator->toastidxs,
+									&iterator->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iterator->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iterator->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iterator->SnapshotToast);
+	iterator->toastscan = systable_beginscan_ordered(iterator->toastrel, iterator->toastidxs[validIndex],
+										   &iterator->SnapshotToast, 1, &iterator->toastkey);
+
+	iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	init_toast_buffer(iterator->buf, iterator->ressize + VARHDRSZ, VARATT_EXTERNAL_IS_COMPRESSED(iterator->toast_pointer));
+
+	iterator->nextidx = 0;
+	iterator->done = false;
+
+	return iterator;
+}
+
+static bool
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+	{
+		return false;
+	}
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+	return true;
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in compression or external storage,
+ * fetch it into buffer in iterator, and return slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+static int32
+fetch_datum_iterate(FetchDatumIterator iter) {
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	if (iter == NULL)
+	{
+		elog(ERROR, "fetch_datum_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->done)
+	{
+		return -1;
+	}
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return -1;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+	return chunksize;
+}
+
+
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed) {
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+	buf->buf_size = size;
+}
+
+
+static bool
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+	{
+		return false;
+	}
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+}
+
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ *		Decompresses source into dest. Returns the number of available bytes
+ *		decompressed in the destination buffer.
+ * ----------
+ */
+static int32
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter, int32 need_len)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	sp = (const unsigned char *) source->position;
+	srcend = ((const unsigned char *) source->limit);
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *)dest->capacity;
+
+	while (sp < srcend && dp < destend && ((char *)dp - dest->position) < need_len)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+		if (iter->ctrlc != 0) {
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		if (ctrlc < 8) {
+			iter->ctrlc = ctrlc;
+			iter->ctrl = ctrl;
+		}
+		else
+		{
+			iter->ctrlc = 0;
+		}
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+	return dest->limit - dest->position;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 0864838..e7fab58 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -122,10 +122,10 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
-static bool text_position_next(TextPositionState *state);
-static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
+static bool text_position_next(TextPositionState *state, DetoastIterator iter);
+static char *text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter);
 static char *text_position_get_match_ptr(TextPositionState *state);
 static int	text_position_get_match_pos(TextPositionState *state);
 static void text_position_cleanup(TextPositionState *state);
@@ -1092,10 +1092,19 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	DetoastIterator iter = create_detoast_iterator((struct varlena *)(DatumGetPointer(PG_GETARG_DATUM(0))));
+	if (iter != NULL)
+	{
+		str = (text *) iter->buf->buf;
+	}
+	else
+	{
+		str = PG_GETARG_TEXT_PP(0);
+	}
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1122,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1122,15 +1131,15 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
-	if (!text_position_next(&state))
+	if (!text_position_next(&state, iter))
 		result = 0;
 	else
 		result = text_position_get_match_pos(&state);
 	text_position_cleanup(&state);
+	free_detoast_iterator(iter);
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1274,7 +1283,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
  * is found.
  */
 static bool
-text_position_next(TextPositionState *state)
+text_position_next(TextPositionState *state, DetoastIterator iter)
 {
 	int			needle_len = state->len2;
 	char	   *start_ptr;
@@ -1290,7 +1299,7 @@ text_position_next(TextPositionState *state)
 		start_ptr = state->str1;
 
 retry:
-	matchptr = text_position_next_internal(start_ptr, state);
+	matchptr = text_position_next_internal(start_ptr, state, iter);
 
 	if (!matchptr)
 		return false;
@@ -1338,7 +1347,7 @@ retry:
  * match starting at 'start_ptr', or NULL if no match is found.
  */
 static char *
-text_position_next_internal(char *start_ptr, TextPositionState *state)
+text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter)
 {
 	int			haystack_len = state->len1;
 	int			needle_len = state->len2;
@@ -1358,6 +1367,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1371,6 +1383,10 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr + needle_len - 1;
 		while (hptr < haystack_end)
 		{
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
+
 			/* Match the needle scanning *backward* */
 			const char *nptr;
 			const char *p;
@@ -4231,7 +4247,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 	text_position_setup(src_text, from_sub_text, PG_GET_COLLATION(), &state);
 
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* When the from_sub_text is not found, there is nothing to do. */
 	if (!found)
@@ -4256,7 +4272,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 		start_ptr = curr_ptr + from_sub_text_len;
 
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			curr_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4594,7 +4610,7 @@ split_text(PG_FUNCTION_ARGS)
 
 	/* identify bounds of first field */
 	start_ptr = VARDATA_ANY(inputstring);
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* special case if fldsep not found at all */
 	if (!found)
@@ -4612,7 +4628,7 @@ split_text(PG_FUNCTION_ARGS)
 	{
 		/* identify bounds of next field */
 		start_ptr = end_ptr + fldsep_len;
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			end_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4766,7 +4782,7 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 
 			CHECK_FOR_INTERRUPTS();
 
-			found = text_position_next(&state);
+			found = text_position_next(&state, NULL);
 			if (!found)
 			{
 				/* fetch last field */
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f0aea24..e60815d 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -17,6 +17,98 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consume.
+ * limit: point to the next char to be produce.
+ * capacity: point to the end of buffer.
+ *
+ * Constrains that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+	int32		buf_size;
+} ToastBuffer;
+
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			SnapshotToast;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	ToastBuffer			*source;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;			/* iterator exhausted? */
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+extern DetoastIterator create_detoast_iterator(struct varlena *attr);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+extern bool free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter);
+
+#endif
+
+
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
  * smart enough to rebuild indexes from scratch.
-- 
2.7.4

#4Binguo Bao
djydewang@gmail.com
In reply to: Binguo Bao (#3)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

This is the patch that fix warnings.

Best Regards,
Binguo Bao

Binguo Bao <djydewang@gmail.com> 于2019年7月10日周三 下午10:18写道:

Show quoted text

Hi Thomas,
I've fixed the warnings.

Thomas Munro <thomas.munro@gmail.com> 于2019年7月5日周五 下午12:21写道:

On Thu, Jun 20, 2019 at 1:51 AM Binguo Bao <djydewang@gmail.com> wrote:

Hi hackers!
This proposal aims to provide the ability to de-TOAST a fully TOAST'd

and compressed field using an iterator and then update the appropriate
parts of the code to use the iterator where possible instead of
de-TOAST'ing and de-compressing the entire value. Examples where this can
be helpful include using position() from the beginning of the value, or
doing a pattern or substring match.

de-TOAST iterator overview:
1. The caller requests the slice of the attribute value from the

de-TOAST iterator.

2. The de-TOAST iterator checks if there is a slice available in the

output buffer, if there is, return the result directly,

otherwise goto the step3.
3. The de-TOAST iterator checks if there is the slice available in the

input buffer, if there is, goto step44. Otherwise,

call fetch_datum_iterator to fetch datums from disk to input buffer.
4. If the data in the input buffer is compressed, extract some data

from the input buffer to the output buffer until the caller's

needs are met.

I've implemented the prototype and apply it to the position() function

to test performance.

Hi Binguo,

Interesting work, and nice performance improvements so far. Just by
the way, the patch currently generates warnings:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/554345719

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-de-TOASTing-using-a-iterator-3.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-3.patchDownload
From ea2eb65049e2da984f0a2c3475f7f4f92adc98ab Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 491 +++++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c      |  48 ++--
 src/include/access/tuptoaster.h      |  92 +++++++
 3 files changed, 615 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..8f7faf6 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static int32 fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static int32 pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									 DetoastIterator iter, int32 length);
 
 
 /* ----------
@@ -347,6 +354,145 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = NULL;
+		iterator->source = palloc0(sizeof(ToastBuffer));
+		iterator->source->buf = (const char*) attr;
+		iterator->source->position = TOAST_COMPRESS_RAWDATA(attr);
+		iterator->source->limit = (char *)attr + VARSIZE(attr);
+		iterator->source->capacity = iterator->source->limit;
+
+		iterator->buf = palloc0(sizeof(ToastBuffer));
+		init_toast_buffer(iterator->buf, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ, false);
+
+		iterator->ctrlc = 0;
+		iterator->compressed = true;
+		iterator->done = false;
+	}
+
+	return iterator;
+}
+
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+bool free_detoast_iterator(DetoastIterator iter) {
+	if (iter == NULL)
+	{
+		return false;
+	}
+	if (iter->buf != iter->fetch_datum_iterator->buf)
+	{
+		free_toast_buffer(iter->buf);
+	}
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	free_toast_buffer(iter->source);
+	pfree(iter);
+	return true;
+}
+
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter)
+{
+	if (iter == NULL)
+	{
+		elog(ERROR, "detoast_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->buf->limit - iter->buf->position >= length || iter->done)
+	{
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	if (iter->fetch_datum_iterator != NULL)
+	{
+		ToastBuffer *buf = iter->fetch_datum_iterator->buf;
+		FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
+		while(buf->limit - buf->position < length && !fetch_iter->done) {
+			fetch_datum_iterate(fetch_iter);
+		}
+		if (iter->compressed) {
+			return pglz_decompress_iterate(buf, iter->buf, iter, length);
+		}
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	return pglz_decompress_iterate(iter->source, iter->buf, iter, length);
+}
+
+
+/* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
@@ -2409,3 +2555,348 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+static FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr) {
+	int			validIndex;
+	FetchDatumIterator iterator;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");
+
+	iterator = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iterator->toast_pointer, attr);
+
+	iterator->ressize = iterator->toast_pointer.va_extsize;
+	iterator->numchunks = ((iterator->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iterator->toastrel = table_open(iterator->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iterator->toastrel,
+									AccessShareLock,
+									&iterator->toastidxs,
+									&iterator->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iterator->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iterator->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iterator->SnapshotToast);
+	iterator->toastscan = systable_beginscan_ordered(iterator->toastrel, iterator->toastidxs[validIndex],
+										   &iterator->SnapshotToast, 1, &iterator->toastkey);
+
+	iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	init_toast_buffer(iterator->buf, iterator->ressize + VARHDRSZ, VARATT_EXTERNAL_IS_COMPRESSED(iterator->toast_pointer));
+
+	iterator->nextidx = 0;
+	iterator->done = false;
+
+	return iterator;
+}
+
+static bool
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+	{
+		return false;
+	}
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+	return true;
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in compression or external storage,
+ * fetch it into buffer in iterator, and return slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+static int32
+fetch_datum_iterate(FetchDatumIterator iter) {
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	if (iter == NULL)
+	{
+		elog(ERROR, "fetch_datum_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->done)
+	{
+		return -1;
+	}
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return -1;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+	return chunksize;
+}
+
+
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed) {
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+	buf->buf_size = size;
+}
+
+
+static bool
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+	{
+		return false;
+	}
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+
+	return true;
+}
+
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ *		Decompresses source into dest. Returns the number of available bytes
+ *		decompressed in the destination buffer.
+ * ----------
+ */
+static int32
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter, int32 need_len)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	sp = (const unsigned char *) source->position;
+	srcend = ((const unsigned char *) source->limit);
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *)dest->capacity;
+
+	while (sp < srcend && dp < destend && ((char *)dp - dest->position) < need_len)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+		if (iter->ctrlc != 0) {
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		if (ctrlc < 8) {
+			iter->ctrlc = ctrlc;
+			iter->ctrl = ctrl;
+		}
+		else
+		{
+			iter->ctrlc = 0;
+		}
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+	return dest->limit - dest->position;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 0864838..e7fab58 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -122,10 +122,10 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
-static bool text_position_next(TextPositionState *state);
-static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
+static bool text_position_next(TextPositionState *state, DetoastIterator iter);
+static char *text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter);
 static char *text_position_get_match_ptr(TextPositionState *state);
 static int	text_position_get_match_pos(TextPositionState *state);
 static void text_position_cleanup(TextPositionState *state);
@@ -1092,10 +1092,19 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	DetoastIterator iter = create_detoast_iterator((struct varlena *)(DatumGetPointer(PG_GETARG_DATUM(0))));
+	if (iter != NULL)
+	{
+		str = (text *) iter->buf->buf;
+	}
+	else
+	{
+		str = PG_GETARG_TEXT_PP(0);
+	}
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1122,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1122,15 +1131,15 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
-	if (!text_position_next(&state))
+	if (!text_position_next(&state, iter))
 		result = 0;
 	else
 		result = text_position_get_match_pos(&state);
 	text_position_cleanup(&state);
+	free_detoast_iterator(iter);
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1274,7 +1283,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
  * is found.
  */
 static bool
-text_position_next(TextPositionState *state)
+text_position_next(TextPositionState *state, DetoastIterator iter)
 {
 	int			needle_len = state->len2;
 	char	   *start_ptr;
@@ -1290,7 +1299,7 @@ text_position_next(TextPositionState *state)
 		start_ptr = state->str1;
 
 retry:
-	matchptr = text_position_next_internal(start_ptr, state);
+	matchptr = text_position_next_internal(start_ptr, state, iter);
 
 	if (!matchptr)
 		return false;
@@ -1338,7 +1347,7 @@ retry:
  * match starting at 'start_ptr', or NULL if no match is found.
  */
 static char *
-text_position_next_internal(char *start_ptr, TextPositionState *state)
+text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter)
 {
 	int			haystack_len = state->len1;
 	int			needle_len = state->len2;
@@ -1358,6 +1367,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1371,6 +1383,10 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr + needle_len - 1;
 		while (hptr < haystack_end)
 		{
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
+
 			/* Match the needle scanning *backward* */
 			const char *nptr;
 			const char *p;
@@ -4231,7 +4247,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 	text_position_setup(src_text, from_sub_text, PG_GET_COLLATION(), &state);
 
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* When the from_sub_text is not found, there is nothing to do. */
 	if (!found)
@@ -4256,7 +4272,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 		start_ptr = curr_ptr + from_sub_text_len;
 
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			curr_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4594,7 +4610,7 @@ split_text(PG_FUNCTION_ARGS)
 
 	/* identify bounds of first field */
 	start_ptr = VARDATA_ANY(inputstring);
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* special case if fldsep not found at all */
 	if (!found)
@@ -4612,7 +4628,7 @@ split_text(PG_FUNCTION_ARGS)
 	{
 		/* identify bounds of next field */
 		start_ptr = end_ptr + fldsep_len;
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			end_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4766,7 +4782,7 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 
 			CHECK_FOR_INTERRUPTS();
 
-			found = text_position_next(&state);
+			found = text_position_next(&state, NULL);
 			if (!found)
 			{
 				/* fetch last field */
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f0aea24..e60815d 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -17,6 +17,98 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consume.
+ * limit: point to the next char to be produce.
+ * capacity: point to the end of buffer.
+ *
+ * Constrains that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+	int32		buf_size;
+} ToastBuffer;
+
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			SnapshotToast;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	ToastBuffer			*source;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;			/* iterator exhausted? */
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+extern DetoastIterator create_detoast_iterator(struct varlena *attr);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+extern bool free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter);
+
+#endif
+
+
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
  * smart enough to rebuild indexes from scratch.
-- 
2.7.4

#5Binguo Bao
djydewang@gmail.com
In reply to: Binguo Bao (#4)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

I have set the local build configuration to be the same as on the CI. This
patch should be correct.

Best regards,
Binguo Bao

Binguo Bao <djydewang@gmail.com> 于2019年7月11日周四 上午12:39写道:

Show quoted text

This is the patch that fix warnings.

Best Regards,
Binguo Bao

Binguo Bao <djydewang@gmail.com> 于2019年7月10日周三 下午10:18写道:

Hi Thomas,
I've fixed the warnings.

Thomas Munro <thomas.munro@gmail.com> 于2019年7月5日周五 下午12:21写道:

On Thu, Jun 20, 2019 at 1:51 AM Binguo Bao <djydewang@gmail.com> wrote:

Hi hackers!
This proposal aims to provide the ability to de-TOAST a fully TOAST'd

and compressed field using an iterator and then update the appropriate
parts of the code to use the iterator where possible instead of
de-TOAST'ing and de-compressing the entire value. Examples where this can
be helpful include using position() from the beginning of the value, or
doing a pattern or substring match.

de-TOAST iterator overview:
1. The caller requests the slice of the attribute value from the

de-TOAST iterator.

2. The de-TOAST iterator checks if there is a slice available in the

output buffer, if there is, return the result directly,

otherwise goto the step3.
3. The de-TOAST iterator checks if there is the slice available in the

input buffer, if there is, goto step44. Otherwise,

call fetch_datum_iterator to fetch datums from disk to input

buffer.

4. If the data in the input buffer is compressed, extract some data

from the input buffer to the output buffer until the caller's

needs are met.

I've implemented the prototype and apply it to the position() function

to test performance.

Hi Binguo,

Interesting work, and nice performance improvements so far. Just by
the way, the patch currently generates warnings:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/554345719

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-de-TOASTing-using-a-iterator-4.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-4.patchDownload
From 556e7945ee6e9c6ceba723a2fc5c5f1d60a9f412 Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 491 +++++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c      |  49 ++--
 src/include/access/tuptoaster.h      |  92 +++++++
 3 files changed, 616 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..8f7faf6 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static int32 fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static int32 pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									 DetoastIterator iter, int32 length);
 
 
 /* ----------
@@ -347,6 +354,145 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = NULL;
+		iterator->source = palloc0(sizeof(ToastBuffer));
+		iterator->source->buf = (const char*) attr;
+		iterator->source->position = TOAST_COMPRESS_RAWDATA(attr);
+		iterator->source->limit = (char *)attr + VARSIZE(attr);
+		iterator->source->capacity = iterator->source->limit;
+
+		iterator->buf = palloc0(sizeof(ToastBuffer));
+		init_toast_buffer(iterator->buf, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ, false);
+
+		iterator->ctrlc = 0;
+		iterator->compressed = true;
+		iterator->done = false;
+	}
+
+	return iterator;
+}
+
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+bool free_detoast_iterator(DetoastIterator iter) {
+	if (iter == NULL)
+	{
+		return false;
+	}
+	if (iter->buf != iter->fetch_datum_iterator->buf)
+	{
+		free_toast_buffer(iter->buf);
+	}
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	free_toast_buffer(iter->source);
+	pfree(iter);
+	return true;
+}
+
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter)
+{
+	if (iter == NULL)
+	{
+		elog(ERROR, "detoast_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->buf->limit - iter->buf->position >= length || iter->done)
+	{
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	if (iter->fetch_datum_iterator != NULL)
+	{
+		ToastBuffer *buf = iter->fetch_datum_iterator->buf;
+		FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
+		while(buf->limit - buf->position < length && !fetch_iter->done) {
+			fetch_datum_iterate(fetch_iter);
+		}
+		if (iter->compressed) {
+			return pglz_decompress_iterate(buf, iter->buf, iter, length);
+		}
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	return pglz_decompress_iterate(iter->source, iter->buf, iter, length);
+}
+
+
+/* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
@@ -2409,3 +2555,348 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+static FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr) {
+	int			validIndex;
+	FetchDatumIterator iterator;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");
+
+	iterator = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iterator->toast_pointer, attr);
+
+	iterator->ressize = iterator->toast_pointer.va_extsize;
+	iterator->numchunks = ((iterator->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iterator->toastrel = table_open(iterator->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iterator->toastrel,
+									AccessShareLock,
+									&iterator->toastidxs,
+									&iterator->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iterator->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iterator->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iterator->SnapshotToast);
+	iterator->toastscan = systable_beginscan_ordered(iterator->toastrel, iterator->toastidxs[validIndex],
+										   &iterator->SnapshotToast, 1, &iterator->toastkey);
+
+	iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	init_toast_buffer(iterator->buf, iterator->ressize + VARHDRSZ, VARATT_EXTERNAL_IS_COMPRESSED(iterator->toast_pointer));
+
+	iterator->nextidx = 0;
+	iterator->done = false;
+
+	return iterator;
+}
+
+static bool
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+	{
+		return false;
+	}
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+	return true;
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in compression or external storage,
+ * fetch it into buffer in iterator, and return slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+static int32
+fetch_datum_iterate(FetchDatumIterator iter) {
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	if (iter == NULL)
+	{
+		elog(ERROR, "fetch_datum_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->done)
+	{
+		return -1;
+	}
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return -1;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+	return chunksize;
+}
+
+
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed) {
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+	buf->buf_size = size;
+}
+
+
+static bool
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+	{
+		return false;
+	}
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+
+	return true;
+}
+
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ *		Decompresses source into dest. Returns the number of available bytes
+ *		decompressed in the destination buffer.
+ * ----------
+ */
+static int32
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter, int32 need_len)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	sp = (const unsigned char *) source->position;
+	srcend = ((const unsigned char *) source->limit);
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *)dest->capacity;
+
+	while (sp < srcend && dp < destend && ((char *)dp - dest->position) < need_len)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+		if (iter->ctrlc != 0) {
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		if (ctrlc < 8) {
+			iter->ctrlc = ctrlc;
+			iter->ctrl = ctrl;
+		}
+		else
+		{
+			iter->ctrlc = 0;
+		}
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+	return dest->limit - dest->position;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 0864838..72e10a3 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -122,10 +122,10 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
-static bool text_position_next(TextPositionState *state);
-static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
+static bool text_position_next(TextPositionState *state, DetoastIterator iter);
+static char *text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter);
 static char *text_position_get_match_ptr(TextPositionState *state);
 static int	text_position_get_match_pos(TextPositionState *state);
 static void text_position_cleanup(TextPositionState *state);
@@ -1092,10 +1092,20 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	DetoastIterator iter = create_detoast_iterator((struct varlena *)(DatumGetPointer(PG_GETARG_DATUM(0))));
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	if (iter != NULL)
+	{
+		str = (text *) iter->buf->buf;
+	}
+	else
+	{
+		str = PG_GETARG_TEXT_PP(0);
+	}
+
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1123,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1122,15 +1132,15 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
-	if (!text_position_next(&state))
+	if (!text_position_next(&state, iter))
 		result = 0;
 	else
 		result = text_position_get_match_pos(&state);
 	text_position_cleanup(&state);
+	free_detoast_iterator(iter);
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1274,7 +1284,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
  * is found.
  */
 static bool
-text_position_next(TextPositionState *state)
+text_position_next(TextPositionState *state, DetoastIterator iter)
 {
 	int			needle_len = state->len2;
 	char	   *start_ptr;
@@ -1290,7 +1300,7 @@ text_position_next(TextPositionState *state)
 		start_ptr = state->str1;
 
 retry:
-	matchptr = text_position_next_internal(start_ptr, state);
+	matchptr = text_position_next_internal(start_ptr, state, iter);
 
 	if (!matchptr)
 		return false;
@@ -1338,7 +1348,7 @@ retry:
  * match starting at 'start_ptr', or NULL if no match is found.
  */
 static char *
-text_position_next_internal(char *start_ptr, TextPositionState *state)
+text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter)
 {
 	int			haystack_len = state->len1;
 	int			needle_len = state->len2;
@@ -1358,6 +1368,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1375,6 +1388,10 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 			const char *nptr;
 			const char *p;
 
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
+
 			nptr = needle_last;
 			p = hptr;
 			while (*nptr == *p)
@@ -4231,7 +4248,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 	text_position_setup(src_text, from_sub_text, PG_GET_COLLATION(), &state);
 
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* When the from_sub_text is not found, there is nothing to do. */
 	if (!found)
@@ -4256,7 +4273,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 		start_ptr = curr_ptr + from_sub_text_len;
 
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			curr_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4594,7 +4611,7 @@ split_text(PG_FUNCTION_ARGS)
 
 	/* identify bounds of first field */
 	start_ptr = VARDATA_ANY(inputstring);
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* special case if fldsep not found at all */
 	if (!found)
@@ -4612,7 +4629,7 @@ split_text(PG_FUNCTION_ARGS)
 	{
 		/* identify bounds of next field */
 		start_ptr = end_ptr + fldsep_len;
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			end_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4766,7 +4783,7 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 
 			CHECK_FOR_INTERRUPTS();
 
-			found = text_position_next(&state);
+			found = text_position_next(&state, NULL);
 			if (!found)
 			{
 				/* fetch last field */
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f0aea24..e60815d 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -17,6 +17,98 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consume.
+ * limit: point to the next char to be produce.
+ * capacity: point to the end of buffer.
+ *
+ * Constrains that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+	int32		buf_size;
+} ToastBuffer;
+
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			SnapshotToast;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	ToastBuffer			*source;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;			/* iterator exhausted? */
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+extern DetoastIterator create_detoast_iterator(struct varlena *attr);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+extern bool free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter);
+
+#endif
+
+
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
  * smart enough to rebuild indexes from scratch.
-- 
2.7.4

#6John Naylor
john.naylor@2ndquadrant.com
In reply to: Binguo Bao (#5)
Re: [proposal] de-TOAST'ing using a iterator

On Wed, Jun 19, 2019 at 8:51 PM Binguo Bao <djydewang@gmail.com> wrote:

[v4 patch]

Hi Binguo,

I can verify I get no warnings with the v4 patch. I've done some
additional performance testing. First, to sum up your results:

insert into detoast_c (a) select repeat('1234567890-=abcdefghijklmnopqrstuvwxyz', 1000000)||'321' as a from generate_series(1,100);

When the search pattern was at the beginning, the patch was several
times faster , and when the pattern was at the end, it was 3% slower
when uncompressed and 9% slower when compressed.

First, I'd like to advocate for caution when using synthetic
benchmarks involving compression. Consider this test:

insert into detoast_c (a)
select
'abc'||
repeat(
(SELECT string_agg(md5(chr(i)), '')
FROM generate_series(1,127) i)
, 10000)
||'xyz'
from generate_series(1,100);

The results for the uncompressed case were not much different then
your test. However, in the compressed case the iterator doesn't buy us
much with beginning searches since full decompression is already fast:

master patch
comp. beg. 869ms 837ms
comp. end 14100ms 16100ms
uncomp. beg. 6360ms 800ms
uncomp. end 21100ms 21400ms

and with compression it's 14% slower searching to the end. This is
pretty contrived, but I include it for demonstration.

To test something hopefully a bit more realistic, I loaded 100 records
each containing the 1995 CIA fact book (~3MB of ascii) with a pattern
string put at the beginning and end. For the end search, I used a
longer needle to speed up the consumption of text, hoping to put more
stress on the detoasting algorithms, for example:

select max(position('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
in a)) from detoast_*;

comp. beg. 836ms 22ms
comp. end 1510ms 1700ms
uncomp. beg. 185ms 12ms
uncomp. end 851ms 903ms

Here, the "beginning" case is ~15-35x faster, which is very impressive
and much faster than with your generated contents. The "end" case is
up to 13% slower. It would be interesting to see where the break-even
point is, where the results are the same.

Reading the thread where you're working on optimizing partial
decompression [1]/messages/by-id/CAL-OGkux7+Bm_J=t5VpH7fJGGSm+PxWJtgs1+WU2g6cmLru==A@mail.gmail.com, it seems you have two separate solutions for the
two problems. Maybe this is fine, but I'd like to bring up the
possibility of using the same approach for both kinds of callers.

I'm not an expert on TOAST, but maybe one way to solve both problems
is to work at the level of whole TOAST chunks. In that case, the
current patch would look like this:

1. The caller requests more of the attribute value from the de-TOAST iterator.
2. The iterator gets the next chunk and either copies or decompresses
the whole chunk into the buffer. (If inline, just decompress the whole
thing)

This seems simpler and also easy to adapt to callers that do know how
big a slice they want. I also suspect this way would be easier to
adapt to future TOAST formats not tied to heap or to a certain
compression algorithm. With less bookkeepping overhead, maybe there'll
be less worst-case performance degradation, while not giving up much
in the best case. (Note also that commit 9556aa01c6 already introduced
some performance degradation in near-end searches, when using
multibyte strings. This patch would add to that.) The regression
doesn't seem large, but I see more than your test showed, and it would
be nice to avoid it.

Thoughts, anyone?

[1]: /messages/by-id/CAL-OGkux7+Bm_J=t5VpH7fJGGSm+PxWJtgs1+WU2g6cmLru==A@mail.gmail.com

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Binguo Bao
djydewang@gmail.com
In reply to: John Naylor (#6)
2 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

Hi, John

First, I'd like to advocate for caution when using synthetic

benchmarks involving compression. Consider this test:
insert into detoast_c (a)
select
'abc'||
repeat(
(SELECT string_agg(md5(chr(i)), '')
FROM generate_series(1,127) i)
, 10000)
||'xyz'
from generate_series(1,100);
The results for the uncompressed case were not much different then
your test. However, in the compressed case the iterator doesn't buy us
much with beginning searches since full decompression is already fast:
master patch
comp. beg. 869ms 837ms
comp. end 14100ms 16100ms
uncomp. beg. 6360ms 800ms
uncomp. end 21100ms 21400ms
and with compression it's 14% slower searching to the end. This is
pretty contrived, but I include it for demonstration.

I've reproduced the test case with test scripts in the attachment on my
laptop:

master patch
comp. beg. 2686.77 ms 1532.79 ms
comp. end 17971.8 ms 21206.3 ms
uncomp. beg. 8358.79 ms 1556.93 ms
uncomp. end 23559.7 ms 22547.1 ms

In the compressed beginning case, the test result is different from yours
since the patch is ~1.75x faster
rather than no improvement. The interesting thing is that the patch if 4%
faster than master in the uncompressed end case.
I can't figure out reason now.

Reading the thread where you're working on optimizing partial

decompression [1], it seems you have two separate solutions for the
two problems. Maybe this is fine, but I'd like to bring up the
possibility of using the same approach for both kinds of callers.

I'm not an expert on TOAST, but maybe one way to solve both problems
is to work at the level of whole TOAST chunks. In that case, the
current patch would look like this:
1. The caller requests more of the attribute value from the de-TOAST
iterator.
2. The iterator gets the next chunk and either copies or decompresses
the whole chunk into the buffer. (If inline, just decompress the whole
thing)

Thanks for your suggestion. It is indeed possible to implement
PG_DETOAST_DATUM_SLICE using the de-TOAST iterator.
IMO the iterator is more suitable for situations where the caller doesn't
know the slice size. If the caller knows the slice size,
it is reasonable to fetch enough chunks at once and then decompress it at
once.
--
Best regards,
Binguo Bao

Attachments:

init-test.shapplication/x-shellscript; name=init-test.shDownload
iterator-test.shapplication/x-shellscript; name=iterator-test.shDownload
#8John Naylor
john.naylor@2ndquadrant.com
In reply to: Binguo Bao (#7)
Re: [proposal] de-TOAST'ing using a iterator

On Tue, Jul 16, 2019 at 9:14 PM Binguo Bao <djydewang@gmail.com> wrote:

In the compressed beginning case, the test result is different from yours since the patch is ~1.75x faster
rather than no improvement. The interesting thing is that the patch if 4% faster than master in the uncompressed end case.
I can't figure out reason now.

Probably some differences in our test environments. I wouldn't worry
about it too much, since we can show improvement in more realistic
tests.

I'm not an expert on TOAST, but maybe one way to solve both problems
is to work at the level of whole TOAST chunks. In that case, the
current patch would look like this:
1. The caller requests more of the attribute value from the de-TOAST iterator.
2. The iterator gets the next chunk and either copies or decompresses
the whole chunk into the buffer. (If inline, just decompress the whole
thing)

Thanks for your suggestion. It is indeed possible to implement PG_DETOAST_DATUM_SLICE using the de-TOAST iterator.
IMO the iterator is more suitable for situations where the caller doesn't know the slice size. If the caller knows the slice size,
it is reasonable to fetch enough chunks at once and then decompress it at once.

That sounds reasonable for the reason of less overhead.

In the case where we don't know the slice size, how about the other
aspect of my question above: Might it be simpler and less overhead to
decompress entire chunks at a time? If so, I think it would be
enlightening to compare performance.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Binguo Bao
djydewang@gmail.com
In reply to: John Naylor (#8)
3 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

Hi John!
Sorry for the late reply. It took me some time to fix a random bug.

In the case where we don't know the slice size, how about the other

aspect of my question above: Might it be simpler and less overhead to
decompress entire chunks at a time? If so, I think it would be
enlightening to compare performance.

Good idea. I've tested your propopal with scripts and patch v5 in the
attachment:

master patch v4 patch v5
comp. beg. 4364ms 1505ms 1529ms
comp. end 28321ms 31202ms 26916ms
uncomp. beg. 3474ms 1513ms 1523ms
uncomp. end 27416ms 30260ms 25888ms

The proposal improves suffix query performance greatly
with less calls to the decompression function.

Besides, do you have any other suggestions for the structure of
DetoastIterator or ToastBuffer?
Maybe they can be designed to be more reasonable.

Thanks again for the proposal.
--
Best regards,
Binguo Bao

Attachments:

init-test.shapplication/x-shellscript; name=init-test.shDownload
iterator-test.shapplication/x-shellscript; name=iterator-test.shDownload
0001-de-TOASTing-using-a-iterator-5.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-5.patchDownload
From 19deb6d7609a156cb14571d83c27d72f3ecb5aa8 Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 498 +++++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c      |  49 ++--
 src/include/access/tuptoaster.h      |  92 +++++++
 3 files changed, 623 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..fb52bfb 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static int32 fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static int32 pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									 DetoastIterator iter, int32 length);
 
 
 /* ----------
@@ -347,6 +354,147 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = NULL;
+		iterator->source = palloc0(sizeof(ToastBuffer));
+		iterator->source->buf = (const char*) attr;
+		iterator->source->position = TOAST_COMPRESS_RAWDATA(attr);
+		iterator->source->limit = (char *)attr + VARSIZE(attr);
+		iterator->source->capacity = iterator->source->limit;
+
+		iterator->buf = palloc0(sizeof(ToastBuffer));
+		init_toast_buffer(iterator->buf, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ, false);
+
+		iterator->ctrlc = 0;
+		iterator->compressed = true;
+		iterator->done = false;
+	}
+
+	return iterator;
+}
+
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+bool free_detoast_iterator(DetoastIterator iter) {
+	if (iter == NULL)
+	{
+		return false;
+	}
+	if (iter->buf != iter->fetch_datum_iterator->buf)
+	{
+		free_toast_buffer(iter->buf);
+	}
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	free_toast_buffer(iter->source);
+	pfree(iter);
+	return true;
+}
+
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter)
+{
+	if (iter == NULL)
+	{
+		elog(ERROR, "detoast_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->buf->limit - iter->buf->position >= length || iter->done)
+	{
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	if (iter->fetch_datum_iterator != NULL)
+	{
+		ToastBuffer *buf = iter->fetch_datum_iterator->buf;
+		FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
+		while(iter->buf->limit - iter->buf->position < length && !fetch_iter->done) {
+			fetch_datum_iterate(fetch_iter);
+			if (iter->compressed)
+			{
+				pglz_decompress_iterate(buf, iter->buf, iter, -1);
+			}
+		}
+
+		return iter->buf->limit - iter->buf->position;
+	}
+
+	return pglz_decompress_iterate(iter->source, iter->buf, iter, length);
+}
+
+
+/* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
@@ -2409,3 +2557,353 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+static FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr) {
+	int			validIndex;
+	FetchDatumIterator iterator;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");
+
+	iterator = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iterator->toast_pointer, attr);
+
+	iterator->ressize = iterator->toast_pointer.va_extsize;
+	iterator->numchunks = ((iterator->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iterator->toastrel = table_open(iterator->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iterator->toastrel,
+									AccessShareLock,
+									&iterator->toastidxs,
+									&iterator->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iterator->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iterator->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iterator->SnapshotToast);
+	iterator->toastscan = systable_beginscan_ordered(iterator->toastrel, iterator->toastidxs[validIndex],
+										   &iterator->SnapshotToast, 1, &iterator->toastkey);
+
+	iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	init_toast_buffer(iterator->buf, iterator->ressize + VARHDRSZ, VARATT_EXTERNAL_IS_COMPRESSED(iterator->toast_pointer));
+
+	iterator->nextidx = 0;
+	iterator->done = false;
+
+	return iterator;
+}
+
+static bool
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+	{
+		return false;
+	}
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+	return true;
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in compression or external storage,
+ * fetch it into buffer in iterator, and return slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+static int32
+fetch_datum_iterate(FetchDatumIterator iter) {
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	if (iter == NULL)
+	{
+		elog(ERROR, "fetch_datum_iterate shouln't be called for NULL iterator");
+	}
+
+	if (iter->done)
+	{
+		return -1;
+	}
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return -1;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+	return chunksize;
+}
+
+
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed) {
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+	buf->buf_size = size;
+}
+
+
+static bool
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+	{
+		return false;
+	}
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+
+	return true;
+}
+
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ *		Decompresses source into dest. Returns the number of available bytes
+ *		decompressed in the destination buffer.
+ * ----------
+ */
+static int32
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter, int32 need_len)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	bool decompress_all;
+
+	sp = (const unsigned char *) source->position;
+	srcend = (const unsigned char *) (source->limit == source->capacity ? source->limit : (source->limit - 4));
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *) dest->capacity;
+
+	decompress_all = need_len < 0 ? true : false;
+
+	while (sp + 1 < srcend && dp < destend &&
+	      (decompress_all || ((char *)dp - dest->position) < need_len))
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+		if (iter->ctrlc != 0) {
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		if (ctrlc < 8) {
+			iter->ctrlc = ctrlc;
+			iter->ctrl = ctrl;
+		}
+		else
+		{
+			iter->ctrlc = 0;
+			iter->ctrl = 0;
+		}
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+	return dest->limit - dest->position;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 0864838..72e10a3 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -122,10 +122,10 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
-static bool text_position_next(TextPositionState *state);
-static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
+static bool text_position_next(TextPositionState *state, DetoastIterator iter);
+static char *text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter);
 static char *text_position_get_match_ptr(TextPositionState *state);
 static int	text_position_get_match_pos(TextPositionState *state);
 static void text_position_cleanup(TextPositionState *state);
@@ -1092,10 +1092,20 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	DetoastIterator iter = create_detoast_iterator((struct varlena *)(DatumGetPointer(PG_GETARG_DATUM(0))));
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	if (iter != NULL)
+	{
+		str = (text *) iter->buf->buf;
+	}
+	else
+	{
+		str = PG_GETARG_TEXT_PP(0);
+	}
+
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1123,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1122,15 +1132,15 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
-	if (!text_position_next(&state))
+	if (!text_position_next(&state, iter))
 		result = 0;
 	else
 		result = text_position_get_match_pos(&state);
 	text_position_cleanup(&state);
+	free_detoast_iterator(iter);
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1274,7 +1284,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
  * is found.
  */
 static bool
-text_position_next(TextPositionState *state)
+text_position_next(TextPositionState *state, DetoastIterator iter)
 {
 	int			needle_len = state->len2;
 	char	   *start_ptr;
@@ -1290,7 +1300,7 @@ text_position_next(TextPositionState *state)
 		start_ptr = state->str1;
 
 retry:
-	matchptr = text_position_next_internal(start_ptr, state);
+	matchptr = text_position_next_internal(start_ptr, state, iter);
 
 	if (!matchptr)
 		return false;
@@ -1338,7 +1348,7 @@ retry:
  * match starting at 'start_ptr', or NULL if no match is found.
  */
 static char *
-text_position_next_internal(char *start_ptr, TextPositionState *state)
+text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIterator iter)
 {
 	int			haystack_len = state->len1;
 	int			needle_len = state->len2;
@@ -1358,6 +1368,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1375,6 +1388,10 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 			const char *nptr;
 			const char *p;
 
+			if (iter != NULL) {
+				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			}
+
 			nptr = needle_last;
 			p = hptr;
 			while (*nptr == *p)
@@ -4231,7 +4248,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 	text_position_setup(src_text, from_sub_text, PG_GET_COLLATION(), &state);
 
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* When the from_sub_text is not found, there is nothing to do. */
 	if (!found)
@@ -4256,7 +4273,7 @@ replace_text(PG_FUNCTION_ARGS)
 
 		start_ptr = curr_ptr + from_sub_text_len;
 
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			curr_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4594,7 +4611,7 @@ split_text(PG_FUNCTION_ARGS)
 
 	/* identify bounds of first field */
 	start_ptr = VARDATA_ANY(inputstring);
-	found = text_position_next(&state);
+	found = text_position_next(&state, NULL);
 
 	/* special case if fldsep not found at all */
 	if (!found)
@@ -4612,7 +4629,7 @@ split_text(PG_FUNCTION_ARGS)
 	{
 		/* identify bounds of next field */
 		start_ptr = end_ptr + fldsep_len;
-		found = text_position_next(&state);
+		found = text_position_next(&state, NULL);
 		if (found)
 			end_ptr = text_position_get_match_ptr(&state);
 	}
@@ -4766,7 +4783,7 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 
 			CHECK_FOR_INTERRUPTS();
 
-			found = text_position_next(&state);
+			found = text_position_next(&state, NULL);
 			if (!found)
 			{
 				/* fetch last field */
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f0aea24..e60815d 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -17,6 +17,98 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consume.
+ * limit: point to the next char to be produce.
+ * capacity: point to the end of buffer.
+ *
+ * Constrains that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+	int32		buf_size;
+} ToastBuffer;
+
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			SnapshotToast;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	ToastBuffer			*source;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;			/* iterator exhausted? */
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+extern DetoastIterator create_detoast_iterator(struct varlena *attr);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+extern bool free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator, and return available slice length.
+ * Return -1 when no more data.
+ * ----------
+ */
+extern int32 detoast_iterate(int32 length, DetoastIterator iter);
+
+#endif
+
+
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
  * smart enough to rebuild indexes from scratch.
-- 
2.7.4

#10John Naylor
john.naylor@2ndquadrant.com
In reply to: Binguo Bao (#9)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

On Thu, Jul 25, 2019 at 10:21 PM Binguo Bao <djydewang@gmail.com> wrote:

Hi John!
Sorry for the late reply. It took me some time to fix a random bug.

Don't worry, it's not late at all! :-)

In the case where we don't know the slice size, how about the other
aspect of my question above: Might it be simpler and less overhead to
decompress entire chunks at a time? If so, I think it would be
enlightening to compare performance.

Good idea. I've tested your propopal with scripts and patch v5 in the attachment:

master patch v4 patch v5
comp. beg. 4364ms 1505ms 1529ms
comp. end 28321ms 31202ms 26916ms
uncomp. beg. 3474ms 1513ms 1523ms
uncomp. end 27416ms 30260ms 25888ms

The proposal improves suffix query performance greatly
with less calls to the decompression function.

Looks good. I repeated my CIA fact book test and found no difference
with compression, but found that suffix search in the uncompressed
case had less regression (~5%) than v4 (>8%). Let's pursue this
further.

Besides, do you have any other suggestions for the structure of DetoastIterator or ToastBuffer?

My goal for this stage of review was to understand more fully what the
code is doing, and make it as simple and clear as possible, starting
at the top level. In doing so, it looks like I found some additional
performance gains. I haven't looked much yet at the TOAST fetching
logic.

1). For every needle comparison, text_position_next_internal()
calculates how much of the value is needed and passes that to
detoast_iterate(), which then calculates if it has to do something or
not. This is a bit hard to follow. There might also be a performance
penalty -- the following is just a theory, but it sounds plausible:
The CPU can probably correctly predict that detoast_iterate() will
usually return the same value it did last time, but it still has to
call the function and make sure, which I imagine is more expensive
than advancing the needle. Ideally, we want to call the iterator only
if we have to.

In the attached patch (applies on top of your v5),
text_position_next_internal() simply compares hptr to the detoast
buffer limit, and calls detoast_iterate() until it can proceed. I
think this is clearer. (I'm not sure of the error handling, see #2.)
In this scheme, the only reason to know length is to pass to
pglz_decompress_iterate() in the case of in-line compression. As I
alluded to in my first review, I don't think it's worth the complexity
to handle that iteratively since the value is only a few kB. I made it
so in-line datums are fully decompressed as in HEAD and removed struct
members to match. I also noticed that no one updates or looks at
"toast_iter.done" so I removed that as well.

Now pglz_decompress_iterate() doesn't need length at all. For testing
I just set decompress_all = true and let the compiler optimize away
the rest. I left finishing it for you if you agree with these changes.

With this additional patch, the penalty for suffix search in my CIA
fact book test is only ~2% in the compressed case, and might even be
slightly faster than HEAD in the uncompressed case.

2). detoast_iterate() and fetch_datum_iterate() return a value but we
don't check it or do anything with it. Should we do something with it?
It's also not yet clear if we should check the iterator state instead
of return values. I've added some XXX comments as a reminder. We
should also check the return value of pglz_decompress_iterate().

3). Speaking of pglz_decompress_iterate(), I diff'd it with
pglz_decompress(), and I have some questions on it:

a).
+ srcend = (const unsigned char *) (source->limit == source->capacity
? source->limit : (source->limit - 4));

What does the 4 here mean in this expression? Is it possible it's
compensating for this bit in init_toast_buffer()?

+ buf->limit = VARDATA(buf->buf);

It seems the initial limit should also depend on whether the datum is
compressed, right? Can we just do this:

+ buf->limit = buf->position;

b).
- while (sp < srcend && dp < destend)
...
+ while (sp + 1 < srcend && dp < destend &&
...

Why is it here "sp + 1"?

4. Note that varlena.c has a static state variable, and a cleanup
function that currently does:

static void
text_position_cleanup(TextPositionState *state)
{
/* no cleanup needed */
}

It seems to be the detoast iterator could be embedded in this state
variable, and then free-ing can happen here. That has a possible
advantage that the iterator struct would be on the same cache line as
the state data. That would also remove the need to pass "iter" as a
parameter, since these functions already pass "state". I'm not sure if
this would be good for other users of the iterator, so maybe we can
hold off on that for now.

5. Would it be a good idea to add tests (not always practical), or
more Assert()'s? You probably already know this, but as a reminder
it's good to develop with asserts enabled, but never build with them
for performance testing.

I think that's enough for now. If you have any questions or
counter-arguments, let me know. I've set the commitfest entry to
waiting on author.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0002-addendum-to-detoast-v5.patchapplication/x-patch; name=0002-addendum-to-detoast-v5.patchDownload
diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index fb52bfba3e..f0f84699c5 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -375,18 +375,14 @@ DetoastIterator create_detoast_iterator(struct varlena *attr) {
 			/* If it's compressed, prepare buffer for raw data */
 			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
 			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
-			iterator->source = NULL;
 			iterator->ctrlc = 0;
 			iterator->compressed = true;
-			iterator->done = false;
 		}
 		else
 		{
 			iterator->buf = iterator->fetch_datum_iterator->buf;
-			iterator->source = NULL;
 			iterator->ctrlc = 0;
 			iterator->compressed = false;
-			iterator->done = false;
 		}
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
@@ -410,21 +406,9 @@ DetoastIterator create_detoast_iterator(struct varlena *attr) {
 	{
 		/*
 		 * This is a compressed value inside of the main tuple
+		 * Skip the iterator and just decompress the whole thing.
 		 */
-		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
-		iterator->fetch_datum_iterator = NULL;
-		iterator->source = palloc0(sizeof(ToastBuffer));
-		iterator->source->buf = (const char*) attr;
-		iterator->source->position = TOAST_COMPRESS_RAWDATA(attr);
-		iterator->source->limit = (char *)attr + VARSIZE(attr);
-		iterator->source->capacity = iterator->source->limit;
-
-		iterator->buf = palloc0(sizeof(ToastBuffer));
-		init_toast_buffer(iterator->buf, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ, false);
-
-		iterator->ctrlc = 0;
-		iterator->compressed = true;
-		iterator->done = false;
+		return NULL;
 	}
 
 	return iterator;
@@ -447,7 +431,6 @@ bool free_detoast_iterator(DetoastIterator iter) {
 		free_toast_buffer(iter->buf);
 	}
 	free_fetch_datum_iterator(iter->fetch_datum_iterator);
-	free_toast_buffer(iter->source);
 	pfree(iter);
 	return true;
 }
@@ -463,34 +446,26 @@ bool free_detoast_iterator(DetoastIterator iter) {
  * Return -1 when no more data.
  * ----------
  */
-extern int32 detoast_iterate(int32 length, DetoastIterator iter)
+extern int32 detoast_iterate(DetoastIterator iter)
 {
 	if (iter == NULL)
 	{
 		elog(ERROR, "detoast_iterate shouln't be called for NULL iterator");
 	}
 
-	if (iter->buf->limit - iter->buf->position >= length || iter->done)
-	{
-		return iter->buf->limit - iter->buf->position;
-	}
-
-	if (iter->fetch_datum_iterator != NULL)
-	{
-		ToastBuffer *buf = iter->fetch_datum_iterator->buf;
-		FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
-		while(iter->buf->limit - iter->buf->position < length && !fetch_iter->done) {
-			fetch_datum_iterate(fetch_iter);
-			if (iter->compressed)
-			{
-				pglz_decompress_iterate(buf, iter->buf, iter, -1);
-			}
-		}
+	FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
+	Assert(fetch_iter != NULL);
 
+	fetch_datum_iterate(fetch_iter);
+	if (fetch_iter->done)
+		/* XXX should this an error? */
 		return iter->buf->limit - iter->buf->position;
-	}
 
-	return pglz_decompress_iterate(iter->source, iter->buf, iter, length);
+	if (iter->compressed)
+		pglz_decompress_iterate(fetch_iter->buf, iter->buf, iter, -1);
+
+	/* XXX Can the caller do any error checking with this? If not, just return void */
+	return iter->buf->limit - iter->buf->position;
 }
 
 
@@ -2665,6 +2640,7 @@ fetch_datum_iterate(FetchDatumIterator iter) {
 		elog(ERROR, "fetch_datum_iterate shouln't be called for NULL iterator");
 	}
 
+	/* XXX maybe should be an assert? */
 	if (iter->done)
 	{
 		return -1;
@@ -2763,6 +2739,8 @@ fetch_datum_iterate(FetchDatumIterator iter) {
 	iter->buf->limit += chunksize;
 
 	iter->nextidx++;
+
+	/* XXX Can the caller do error checking with this? */
 	return chunksize;
 }
 
@@ -2821,7 +2799,8 @@ pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator
 	dp = (unsigned char *) dest->limit;
 	destend = (unsigned char *) dest->capacity;
 
-	decompress_all = need_len < 0 ? true : false;
+	/* TODO: remove need_len parameter */
+	decompress_all = true;
 
 	while (sp + 1 < srcend && dp < destend &&
 	      (decompress_all || ((char *)dp - dest->position) < need_len))
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 72e10a3664..ff2a01b2b1 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1368,9 +1368,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIt
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
-			if (iter != NULL) {
-				detoast_iterate(hptr - iter->buf->position + 1, iter);
-			}
+			if (iter != NULL && hptr >= iter->buf->limit)
+				detoast_iterate(iter);
+
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1388,8 +1388,14 @@ text_position_next_internal(char *start_ptr, TextPositionState *state, DetoastIt
 			const char *nptr;
 			const char *p;
 
-			if (iter != NULL) {
-				detoast_iterate(hptr - iter->buf->position + 1, iter);
+			if (iter != NULL)
+			{
+				/*
+				 * The needle could be long enough that we need
+				 * to detoast more than one chunk each time.
+				 */
+				while (hptr >= iter->buf->limit)
+					detoast_iterate(iter);
 			}
 
 			nptr = needle_last;
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index e60815d2fd..3dbddd6c70 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -69,11 +69,9 @@ typedef struct DetoastIteratorData
 {
 	ToastBuffer 		*buf;
 	FetchDatumIterator	fetch_datum_iterator;
-	ToastBuffer			*source;
 	unsigned char		ctrl;
 	int					ctrlc;
 	bool				compressed;		/* toast value is compressed? */
-	bool				done;			/* iterator exhausted? */
 }			DetoastIteratorData;
 
 typedef struct DetoastIteratorData *DetoastIterator;
@@ -104,7 +102,7 @@ extern bool free_detoast_iterator(DetoastIterator iter);
  * Return -1 when no more data.
  * ----------
  */
-extern int32 detoast_iterate(int32 length, DetoastIterator iter);
+extern int32 detoast_iterate(DetoastIterator iter);
 
 #endif
 
#11Binguo Bao
djydewang@gmail.com
In reply to: John Naylor (#10)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

John Naylor <john.naylor@2ndquadrant.com> 于2019年7月29日周一 上午11:49写道:

On Thu, Jul 25, 2019 at 10:21 PM Binguo Bao <djydewang@gmail.com> wrote:
My goal for this stage of review was to understand more fully what the

code is doing, and make it as simple and clear as possible, starting

at the top level. In doing so, it looks like I found some additional
performance gains. I haven't looked much yet at the TOAST fetching
logic.

1). For every needle comparison, text_position_next_internal()
calculates how much of the value is needed and passes that to
detoast_iterate(), which then calculates if it has to do something or
not. This is a bit hard to follow. There might also be a performance
penalty -- the following is just a theory, but it sounds plausible:
The CPU can probably correctly predict that detoast_iterate() will
usually return the same value it did last time, but it still has to
call the function and make sure, which I imagine is more expensive
than advancing the needle. Ideally, we want to call the iterator only
if we have to.

In the attached patch (applies on top of your v5),
text_position_next_internal() simply compares hptr to the detoast
buffer limit, and calls detoast_iterate() until it can proceed. I
think this is clearer.

Yes, I think this is a general scenario where the caller continually
calls detoast_iterate until gets enough data, so I think such operations can
be extracted as a macro, as I did in patch v6. In the macro, the
detoast_iterate
function is called only when the data requested by the caller is greater
than the
buffer limit.

(I'm not sure of the error handling, see #2.)

In this scheme, the only reason to know length is to pass to
pglz_decompress_iterate() in the case of in-line compression. As I
alluded to in my first review, I don't think it's worth the complexity
to handle that iteratively since the value is only a few kB. I made it
so in-line datums are fully decompressed as in HEAD and removed struct
members to match.

Sounds good. This not only simplifies the structure and logic of Detoast
Iterator
but also has no major impact on efficiency.

I also noticed that no one updates or looks at
"toast_iter.done" so I removed that as well.

toast_iter.done is updated when the buffer limit reached the buffer
capacity now.
So, I added it back.

Now pglz_decompress_iterate() doesn't need length at all. For testing
I just set decompress_all = true and let the compiler optimize away
the rest. I left finishing it for you if you agree with these changes.

Done.

2). detoast_iterate() and fetch_datum_iterate() return a value but we
don't check it or do anything with it. Should we do something with it?
It's also not yet clear if we should check the iterator state instead
of return values. I've added some XXX comments as a reminder. We
should also check the return value of pglz_decompress_iterate().

IMO, we need to provide users with a simple iterative interface.
Using the required data pointer to compare with the buffer limit is an easy
way.
And the application scenarios of the iterator are mostly read operations.
So I think there is no need to return a value, and the iterator needs to
throw an
exception for some wrong calls, such as all the data have been iterated,
but the user still calls the iterator.

3). Speaking of pglz_decompress_iterate(), I diff'd it with
pglz_decompress(), and I have some questions on it:

a).
+ srcend = (const unsigned char *) (source->limit == source->capacity
? source->limit : (source->limit - 4));

What does the 4 here mean in this expression?

Since we fetch chunks one by one, if we make srcend equals to the source
buffer limit,
In the while loop "while (sp < srcend && dp < destend)", sp may exceed the
source buffer limit and
read unallocated bytes. Giving a four-byte buffer can prevent sp from
exceeding the source buffer limit.
If we have read all the chunks, we don't need to be careful to cross the
border,
just make srcend equal to source buffer limit. I've added comments to
explain it in patch v6.

Is it possible it's
compensating for this bit in init_toast_buffer()?

+ buf->limit = VARDATA(buf->buf);

It seems the initial limit should also depend on whether the datum is
compressed, right? Can we just do this:

+ buf->limit = buf->position;

I'm afraid not. buf->position points to the data portion of the buffer, but
the beginning of
the chunks we read may contain header information. For example, for
compressed data chunks,
the first four bytes record the size of raw data, this means that limit is
four bytes ahead of position.
This initialization doesn't cause errors, although the position is less
than the limit in other cases.
Because we always fetch chunks first, then decompress it.

b).
- while (sp < srcend && dp < destend)
...
+ while (sp + 1 < srcend && dp < destend &&
...

Why is it here "sp + 1"?

Ignore it, I set the inactive state of detoast_iter->ctrl to 8 in patch v6
to
achieve the purpose of parsing ctrl correctly every time.

4. Note that varlena.c has a static state variable, and a cleanup
function that currently does:

static void
text_position_cleanup(TextPositionState *state)
{
/* no cleanup needed */
}

It seems to be the detoast iterator could be embedded in this state
variable, and then free-ing can happen here. That has a possible
advantage that the iterator struct would be on the same cache line as
the state data. That would also remove the need to pass "iter" as a
parameter, since these functions already pass "state". I'm not sure if
this would be good for other users of the iterator, so maybe we can
hold off on that for now.

Good idea. I've implemented it in patch v6.

5. Would it be a good idea to add tests (not always practical), or
more Assert()'s? You probably already know this, but as a reminder
it's good to develop with asserts enabled, but never build with them
for performance testing.

I've added more Assert()'s to check iterator state.

I think that's enough for now. If you have any questions or
counter-arguments, let me know. I've set the commitfest entry to
waiting on author.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

BTW, I found that iterators come in handy for json/jsonb's find field value
or get array elements operations.
I will continue to optimize the json/jsonb query based on the detoast
iterator patch.

--
Best regards,
Binguo Bao

Attachments:

0001-de-TOASTing-using-a-iterator-6.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-6.patchDownload
From f59bd661d49d9c84941616280c8e4c66d53b570f Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 447 +++++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c      |  48 +++-
 src/include/access/tuptoaster.h      |  90 +++++++
 src/include/fmgr.h                   |   7 +
 4 files changed, 580 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..2f77f11 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static void fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static void pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									DetoastIterator iter);
 
 
 /* ----------
@@ -347,6 +354,115 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->ctrlc = 8;
+			iterator->compressed = true;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->ctrlc = 8;
+			iterator->compressed = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 * Skip the iterator and just decompress the whole thing.
+		 */
+		return NULL;
+	}
+
+	return iterator;
+}
+
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+bool free_detoast_iterator(DetoastIterator iter) {
+	if (iter == NULL)
+	{
+		return false;
+	}
+	if (iter->buf != iter->fetch_datum_iterator->buf)
+	{
+		free_toast_buffer(iter->buf);
+	}
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	pfree(iter);
+	return true;
+}
+
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator iter)
+{
+	FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
+
+	Assert(iter != NULL && !iter->done);
+
+	fetch_datum_iterate(fetch_iter);
+
+	if (iter->compressed)
+		pglz_decompress_iterate(fetch_iter->buf, iter->buf, iter);
+
+	if (iter->buf->limit == iter->buf->capacity) {
+		iter->done = true;
+	}
+}
+
+
+/* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
@@ -2409,3 +2525,334 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+static FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr) {
+	int			validIndex;
+	FetchDatumIterator iterator;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");
+
+	iterator = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iterator->toast_pointer, attr);
+
+	iterator->ressize = iterator->toast_pointer.va_extsize;
+	iterator->numchunks = ((iterator->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iterator->toastrel = table_open(iterator->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iterator->toastrel,
+									AccessShareLock,
+									&iterator->toastidxs,
+									&iterator->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iterator->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iterator->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iterator->SnapshotToast);
+	iterator->toastscan = systable_beginscan_ordered(iterator->toastrel, iterator->toastidxs[validIndex],
+										   &iterator->SnapshotToast, 1, &iterator->toastkey);
+
+	iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	init_toast_buffer(iterator->buf, iterator->ressize + VARHDRSZ, VARATT_EXTERNAL_IS_COMPRESSED(iterator->toast_pointer));
+
+	iterator->nextidx = 0;
+	iterator->done = false;
+
+	return iterator;
+}
+
+static bool
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+	{
+		return false;
+	}
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+	return true;
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in compression or external storage,
+ * fetch it into buffer in iterator.
+ * ----------
+ */
+static void
+fetch_datum_iterate(FetchDatumIterator iter) {
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	Assert(iter != NULL && !iter->done);
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+}
+
+
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed) {
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+	buf->buf_size = size;
+}
+
+
+static bool
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+	{
+		return false;
+	}
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+
+	return true;
+}
+
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ *		Decompresses source into dest until the source is exhausted.
+ * ----------
+ */
+static void
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	sp = (const unsigned char *) source->position;
+	/*
+	 * Provides a four-byte buffer to prevent sp from reading unallocated bytes.
+	 */
+	srcend = (const unsigned char *)
+		(source->limit == source->capacity ? source->limit : (source->limit - 4));
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *) dest->capacity;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+		if (iter->ctrlc < 8) {
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		iter->ctrlc = ctrlc;
+		iter->ctrl = ctrl;
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 0864838..2f3c7ac 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -56,6 +56,8 @@ typedef struct
 	int			len1;			/* string lengths in bytes */
 	int			len2;
 
+	DetoastIterator iter;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
@@ -122,8 +124,9 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
-static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
+static void text_position_setup(text *t1, text *t2, DetoastIterator iter,
+								Oid collid, TextPositionState *state);
 static bool text_position_next(TextPositionState *state);
 static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
 static char *text_position_get_match_ptr(TextPositionState *state);
@@ -1092,10 +1095,20 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	DetoastIterator iter = create_detoast_iterator((struct varlena *)(DatumGetPointer(PG_GETARG_DATUM(0))));
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	if (iter != NULL)
+	{
+		str = (text *) iter->buf->buf;
+	}
+	else
+	{
+		str = PG_GETARG_TEXT_PP(0);
+	}
+
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1126,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1121,7 +1134,7 @@ text_position(text *t1, text *t2, Oid collid)
 	if (VARSIZE_ANY_EXHDR(t1) < 1 || VARSIZE_ANY_EXHDR(t2) < 1)
 		return 0;
 
-	text_position_setup(t1, t2, collid, &state);
+	text_position_setup(t1, t2, iter, collid, &state);
 	if (!text_position_next(&state))
 		result = 0;
 	else
@@ -1130,7 +1143,6 @@ text_position(text *t1, text *t2, Oid collid)
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1148,7 +1160,7 @@ text_position(text *t1, text *t2, Oid collid)
  */
 
 static void
-text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
+text_position_setup(text *t1, text *t2, DetoastIterator iter, Oid collid, TextPositionState *state)
 {
 	int			len1 = VARSIZE_ANY_EXHDR(t1);
 	int			len2 = VARSIZE_ANY_EXHDR(t2);
@@ -1196,6 +1208,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	state->str2 = VARDATA_ANY(t2);
 	state->len1 = len1;
 	state->len2 = len2;
+	state->iter = iter;
 	state->last_match = NULL;
 	state->refpoint = state->str1;
 	state->refpos = 0;
@@ -1358,6 +1371,10 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (state->iter != NULL) {
+				PG_DETOAST_ITERATE(state->iter, hptr);
+			}
+
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1375,6 +1392,11 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 			const char *nptr;
 			const char *p;
 
+			if (state->iter != NULL)
+			{
+				PG_DETOAST_ITERATE(state->iter, hptr);
+			}
+
 			nptr = needle_last;
 			p = hptr;
 			while (*nptr == *p)
@@ -1438,7 +1460,9 @@ text_position_get_match_pos(TextPositionState *state)
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	/* no cleanup needed */
+	if (state->iter != NULL) {
+		free_detoast_iterator(state->iter);
+	}
 }
 
 static void
@@ -4229,7 +4253,7 @@ replace_text(PG_FUNCTION_ARGS)
 		PG_RETURN_TEXT_P(src_text);
 	}
 
-	text_position_setup(src_text, from_sub_text, PG_GET_COLLATION(), &state);
+	text_position_setup(src_text, from_sub_text, NULL, PG_GET_COLLATION(), &state);
 
 	found = text_position_next(&state);
 
@@ -4590,7 +4614,7 @@ split_text(PG_FUNCTION_ARGS)
 			PG_RETURN_TEXT_P(cstring_to_text(""));
 	}
 
-	text_position_setup(inputstring, fldsep, PG_GET_COLLATION(), &state);
+	text_position_setup(inputstring, fldsep, NULL, PG_GET_COLLATION(), &state);
 
 	/* identify bounds of first field */
 	start_ptr = VARDATA_ANY(inputstring);
@@ -4754,7 +4778,7 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 													 TEXTOID, -1, false, 'i'));
 		}
 
-		text_position_setup(inputstring, fldsep, PG_GET_COLLATION(), &state);
+		text_position_setup(inputstring, fldsep, NULL, PG_GET_COLLATION(), &state);
 
 		start_ptr = VARDATA_ANY(inputstring);
 
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f0aea24..049782f 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -17,6 +17,96 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consume.
+ * limit: point to the next char to be produce.
+ * capacity: point to the end of buffer.
+ *
+ * Constrains that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+	int32		buf_size;
+} ToastBuffer;
+
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			SnapshotToast;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+extern DetoastIterator create_detoast_iterator(struct varlena *attr);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+extern bool free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator iter);
+
+#endif
+
+
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
  * smart enough to rebuild indexes from scratch.
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 3ff0999..446c880 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -239,6 +239,13 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
 #define PG_DETOAST_DATUM_SLICE(datum,f,c) \
 		pg_detoast_datum_slice((struct varlena *) DatumGetPointer(datum), \
 		(int32) (f), (int32) (c))
+#define PG_DETOAST_ITERATE(iter, need)									\
+	do {																\
+		Assert(need >= iter->buf->buf && need <= iter->buf->capacity);	\
+		while (!iter->done && need >= iter->buf->limit) { 				\
+			detoast_iterate(iter);										\
+		}																\
+	} while (0)
 /* WARNING -- unaligned pointer */
 #define PG_DETOAST_DATUM_PACKED(datum) \
 	pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))
-- 
2.7.4

#12John Naylor
john.naylor@2ndquadrant.com
In reply to: Binguo Bao (#11)
Re: [proposal] de-TOAST'ing using a iterator

On Tue, Jul 30, 2019 at 8:20 PM Binguo Bao <djydewang@gmail.com> wrote:

John Naylor <john.naylor@2ndquadrant.com> 于2019年7月29日周一 上午11:49写道:

1). For every needle comparison, text_position_next_internal()
calculates how much of the value is needed and passes that to
detoast_iterate(), which then calculates if it has to do something or
not. This is a bit hard to follow. There might also be a performance
penalty -- the following is just a theory, but it sounds plausible:
The CPU can probably correctly predict that detoast_iterate() will
usually return the same value it did last time, but it still has to
call the function and make sure, which I imagine is more expensive
than advancing the needle. Ideally, we want to call the iterator only
if we have to.

In the attached patch (applies on top of your v5),
text_position_next_internal() simply compares hptr to the detoast
buffer limit, and calls detoast_iterate() until it can proceed. I
think this is clearer.

Yes, I think this is a general scenario where the caller continually
calls detoast_iterate until gets enough data, so I think such operations can
be extracted as a macro, as I did in patch v6. In the macro, the detoast_iterate
function is called only when the data requested by the caller is greater than the
buffer limit.

I like the use of a macro here. However, I think we can find a better
location for the definition. See the header comment of fmgr.h:
"Definitions for the Postgres function manager and function-call
interface." Maybe tuptoaster.h is as good a place as any?

I also noticed that no one updates or looks at
"toast_iter.done" so I removed that as well.

toast_iter.done is updated when the buffer limit reached the buffer capacity now.
So, I added it back.

Okay.

2). detoast_iterate() and fetch_datum_iterate() return a value but we
don't check it or do anything with it. Should we do something with it?
It's also not yet clear if we should check the iterator state instead
of return values. I've added some XXX comments as a reminder. We
should also check the return value of pglz_decompress_iterate().

IMO, we need to provide users with a simple iterative interface.
Using the required data pointer to compare with the buffer limit is an easy way.
And the application scenarios of the iterator are mostly read operations.
So I think there is no need to return a value, and the iterator needs to throw an
exception for some wrong calls, such as all the data have been iterated,
but the user still calls the iterator.

Okay, and see these functions now return void. The orignal
pglz_decompress() returned a value that was check against corruption.
Is there a similar check we can do for the iterative version?

3). Speaking of pglz_decompress_iterate(), I diff'd it with
pglz_decompress(), and I have some questions on it:

a).
+ srcend = (const unsigned char *) (source->limit == source->capacity
? source->limit : (source->limit - 4));

What does the 4 here mean in this expression?

Since we fetch chunks one by one, if we make srcend equals to the source buffer limit,
In the while loop "while (sp < srcend && dp < destend)", sp may exceed the source buffer limit and read unallocated bytes.

Why is this? That tells me the limit is incorrect. Can the setter not
determine the right value?

Giving a four-byte buffer can prevent sp from exceeding the source buffer limit.

Why 4? That's a magic number. Why not 2, or 27?

If we have read all the chunks, we don't need to be careful to cross the border,
just make srcend equal to source buffer limit. I've added comments to explain it in patch v6.

That's a good thing to comment on, but it doesn't explain why. This
logic seems like a band-aid and I think a committer would want this to
be handled in a more principled way.

Is it possible it's
compensating for this bit in init_toast_buffer()?

+ buf->limit = VARDATA(buf->buf);

It seems the initial limit should also depend on whether the datum is
compressed, right? Can we just do this:

+ buf->limit = buf->position;

I'm afraid not. buf->position points to the data portion of the buffer, but the beginning of
the chunks we read may contain header information. For example, for compressed data chunks,
the first four bytes record the size of raw data, this means that limit is four bytes ahead of position.
This initialization doesn't cause errors, although the position is less than the limit in other cases.
Because we always fetch chunks first, then decompress it.

I see what you mean now. This could use a comment or two to explain
the stated constraints may not actually be satisfied at
initialization.

b).
- while (sp < srcend && dp < destend)
...
+ while (sp + 1 < srcend && dp < destend &&
...

Why is it here "sp + 1"?

Ignore it, I set the inactive state of detoast_iter->ctrl to 8 in patch v6 to
achieve the purpose of parsing ctrl correctly every time.

Please explain further. Was the "sp + 1" correct behavior (and why),
or only for debugging setting ctrl/c correctly? Also, I don't think
the new logic for the ctrl/c variables is an improvement:

1. iter->ctrlc is intialized with '8' (even in the uncompressed case,
which is confusing). Any time you initialize with something not 0 or
1, it's a magic number, and here it's far from where the loop variable
is used. This is harder to read.

2. First time though the loop, iter->ctrlc = 8, which immediately gets
set back to 0.

3. At the end of the loop, iter->ctrl/c are unconditionally set. In
v5, there was a condition which would usually avoid this copying of
values through pointers.

4. Note that varlena.c has a static state variable, and a cleanup
function that currently does:

static void
text_position_cleanup(TextPositionState *state)
{
/* no cleanup needed */
}

It seems to be the detoast iterator could be embedded in this state
variable, and then free-ing can happen here. That has a possible
advantage that the iterator struct would be on the same cache line as
the state data. That would also remove the need to pass "iter" as a
parameter, since these functions already pass "state". I'm not sure if
this would be good for other users of the iterator, so maybe we can
hold off on that for now.

Good idea. I've implemented it in patch v6.

That's better, and I think we can take it a little bit farther.

1. Notice that TextPositionState is allocated on the stack in
text_position(), which passes both the "state" pointer and the "iter"
pointer to text_position_setup(), and only then sets state->iter =
iter. We can easily set this inside text_position(). That would get
rid of the need for other callers to pass NULL iter to
text_position_setup().

2. DetoastIteratorData is fixed size, so I see no reason to allocate
it on the heap. We could allocate it on the stack in text_pos(), and
pass the pointer to create_detoast_iterator() (in this case maybe a
better name is init_detoast_iterator), which would return a bool to
tell text_pos() whether to pass down the pointer or a NULL. The
allocation of other structs (toast buffer and fetch iterator) probably
can't be changed without more work.

5. Would it be a good idea to add tests (not always practical), or
more Assert()'s? You probably already know this, but as a reminder
it's good to develop with asserts enabled, but never build with them
for performance testing.

I've added more Assert()'s to check iterator state.

Okay.

BTW, I found that iterators come in handy for json/jsonb's find field value or get array elements operations.
I will continue to optimize the json/jsonb query based on the detoast iterator patch.

That will be an interesting use case.

There are other aspects of the patch I should investigate, but I'll
put that off for another time. Commitfest is over, but note that
review can happen at any time. I'll continue to do so as time permits.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Binguo Bao
djydewang@gmail.com
In reply to: John Naylor (#12)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

John Naylor <john.naylor@2ndquadrant.com> 于2019年8月2日周五 下午3:12写道:

On Tue, Jul 30, 2019 at 8:20 PM Binguo Bao <djydewang@gmail.com> wrote:

John Naylor <john.naylor@2ndquadrant.com> 于2019年7月29日周一 上午11:49写道:

1). For every needle comparison, text_position_next_internal()
calculates how much of the value is needed and passes that to
detoast_iterate(), which then calculates if it has to do something or
not. This is a bit hard to follow. There might also be a performance
penalty -- the following is just a theory, but it sounds plausible:
The CPU can probably correctly predict that detoast_iterate() will
usually return the same value it did last time, but it still has to
call the function and make sure, which I imagine is more expensive
than advancing the needle. Ideally, we want to call the iterator only
if we have to.

In the attached patch (applies on top of your v5),
text_position_next_internal() simply compares hptr to the detoast
buffer limit, and calls detoast_iterate() until it can proceed. I
think this is clearer.

Yes, I think this is a general scenario where the caller continually
calls detoast_iterate until gets enough data, so I think such operations

can

be extracted as a macro, as I did in patch v6. In the macro, the

detoast_iterate

function is called only when the data requested by the caller is greater

than the

buffer limit.

I like the use of a macro here. However, I think we can find a better
location for the definition. See the header comment of fmgr.h:
"Definitions for the Postgres function manager and function-call
interface." Maybe tuptoaster.h is as good a place as any?

PG_DETOAST_ITERATE isn't a sample function-call interface,
But I notice that PG_FREE_IF_COPY is also defined in fmgr.h, whose logic is
similar to PG_DETOAST_ITERATE, make condition check first and then
decide whether to call the function. Besides, PG_DETOAST_DATUM,
PG_DETOAST_DATUM_COPY, PG_DETOAST_DATUM_SLICE,
PG_DETOAST_DATUM_PACKED are all defined in fmgr.h, it is reasonable
to put all the de-TOAST interface together.

2). detoast_iterate() and fetch_datum_iterate() return a value but we

don't check it or do anything with it. Should we do something with it?
It's also not yet clear if we should check the iterator state instead
of return values. I've added some XXX comments as a reminder. We
should also check the return value of pglz_decompress_iterate().

IMO, we need to provide users with a simple iterative interface.
Using the required data pointer to compare with the buffer limit is an

easy way.

And the application scenarios of the iterator are mostly read operations.
So I think there is no need to return a value, and the iterator needs to

throw an

exception for some wrong calls, such as all the data have been iterated,
but the user still calls the iterator.

Okay, and see these functions now return void. The orignal
pglz_decompress() returned a value that was check against corruption.
Is there a similar check we can do for the iterative version?

As far as I know, we can just do such check after all compressed data is
decompressed.
If we are slicing, we can't do the check.

3). Speaking of pglz_decompress_iterate(), I diff'd it with
pglz_decompress(), and I have some questions on it:

a).
+ srcend = (const unsigned char *) (source->limit == source->capacity
? source->limit : (source->limit - 4));

What does the 4 here mean in this expression?

Since we fetch chunks one by one, if we make srcend equals to the source

buffer limit,

In the while loop "while (sp < srcend && dp < destend)", sp may exceed

the source buffer limit and read unallocated bytes.

Why is this? That tells me the limit is incorrect. Can the setter not
determine the right value?

There are three statments change `sp` value in the while loop `while (sp <
srcend && dp < destend)`:
`ctrl = *sp++;`
`off = ((sp[0]) & 0xf0) << 4) | sp[1]; sp += 2;`
`len += *sp++`
Although we make sure `sp` is less than `srcend` when enter while loop,
`sp` is likely to
go beyond the `srcend` in the loop, and we should ensure that `sp` is
always smaller than `buf->limit` to avoid
reading unallocated data. So, `srcend` can't be initialized to
`buf->limit`. Only one case is exceptional,
we've fetched all data chunks and 'buf->limit' reaches 'buf->capacity',
it's imposisble to read unallocated
data via `sp`.

Giving a four-byte buffer can prevent sp from exceeding the source buffer
limit.

Why 4? That's a magic number. Why not 2, or 27?

As I explained above, `sp` may go beyond the `srcend`in the loop, up to the
`srcend + 2`.
In theory, it's ok to set the buffer size to be greater than or equal 2.

If we have read all the chunks, we don't need to be careful to cross the

border,

just make srcend equal to source buffer limit. I've added comments to

explain it in patch v6.

That's a good thing to comment on, but it doesn't explain why.

Yes, the current comment is puzzling. I'll improve it.

This
logic seems like a band-aid and I think a committer would want this to
be handled in a more principled way.

I don't want to change pglz_decompress logic too much, the iterator should
pay more attention to saving and restoring the original pglz_decompress
state.

Is it possible it's
compensating for this bit in init_toast_buffer()?

+ buf->limit = VARDATA(buf->buf);

It seems the initial limit should also depend on whether the datum is
compressed, right? Can we just do this:

+ buf->limit = buf->position;

I'm afraid not. buf->position points to the data portion of the buffer,

but the beginning of

the chunks we read may contain header information. For example, for

compressed data chunks,

the first four bytes record the size of raw data, this means that limit

is four bytes ahead of position.

This initialization doesn't cause errors, although the position is less

than the limit in other cases.

Because we always fetch chunks first, then decompress it.

I see what you mean now. This could use a comment or two to explain
the stated constraints may not actually be satisfied at
initialization.

Done.

b).
- while (sp < srcend && dp < destend)
...
+ while (sp + 1 < srcend && dp < destend &&
...

Why is it here "sp + 1"?

Ignore it, I set the inactive state of detoast_iter->ctrl to 8 in patch

v6 to

achieve the purpose of parsing ctrl correctly every time.

Please explain further. Was the "sp + 1" correct behavior (and why),
or only for debugging setting ctrl/c correctly?

In patch v5, If the condition is `sp < srcend`, suppose `sp = srcend - 1`
before
entering the loop `while (sp < srcend && dp < destend)`, when entering the
loop
and read a control byte(sp equals to `srcend` now), the program can't enter
the
loop `for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)`, then set
`iter->ctrlc` to 0,
exit the first loop and then this iteration is over. At the next iteration,
the control byte will be reread since `iter->ctrlc` equals to 0, but the
previous control byte
is not used. Changing the condition to `sp + 1 < srcend` avoid only one
control byte is read
then the iterator is over.

Also, I don't think
the new logic for the ctrl/c variables is an improvement:

1. iter->ctrlc is intialized with '8' (even in the uncompressed case,
which is confusing). Any time you initialize with something not 0 or
1, it's a magic number, and here it's far from where the loop variable
is used. This is harder to read.

`iter->ctrlc` is used to record the value of `ctrl` in pglz_decompress at
the end of
the last iteration(or loop). In the pglz_decompress, `ctrlc`’s valid value
is 0~7,
When `ctrlc` reaches 8, a control byte is read from the source
buffer to `ctrl` then set `ctrlc` to 0. And a control bytes should be read
from the
source buffer to `ctrlc` on the first iteration. So `iter->ctrlc` should be
intialized with '8'.

2. First time though the loop, iter->ctrlc = 8, which immediately gets
set back to 0.

As I explained above, `iter->ctrlc = 8` make a control byte be read
from the source buffer to `ctrl` on the first iteration. Besides,
`iter->ctrlc = 8`
indicates that the valid value of `ctrlc` at the end of the last iteration
was not
recorded, Obviously, there are no other iterations before the first
iteration.

3. At the end of the loop, iter->ctrl/c are unconditionally set. In
v5, there was a condition which would usually avoid this copying of
values through pointers.

Patch v6 just records the value of `ctrlc` at the end of each iteration(or
loop)
whether it is valid (0~7) or 8, and initializes `ctrlc` on the next
iteration(or loop) correctly.
I think it is more concise in patch v6.

4. Note that varlena.c has a static state variable, and a cleanup
function that currently does:

static void
text_position_cleanup(TextPositionState *state)
{
/* no cleanup needed */
}

It seems to be the detoast iterator could be embedded in this state
variable, and then free-ing can happen here. That has a possible
advantage that the iterator struct would be on the same cache line as
the state data. That would also remove the need to pass "iter" as a
parameter, since these functions already pass "state". I'm not sure if
this would be good for other users of the iterator, so maybe we can
hold off on that for now.

Good idea. I've implemented it in patch v6.

That's better, and I think we can take it a little bit farther.

1. Notice that TextPositionState is allocated on the stack in
text_position(), which passes both the "state" pointer and the "iter"
pointer to text_position_setup(), and only then sets state->iter =
iter. We can easily set this inside text_position(). That would get
rid of the need for other callers to pass NULL iter to
text_position_setup().

Done.

2. DetoastIteratorData is fixed size, so I see no reason to allocate
it on the heap. We could allocate it on the stack in text_pos(), and
pass the pointer to create_detoast_iterator() (in this case maybe a
better name is init_detoast_iterator), which would return a bool to
tell text_pos() whether to pass down the pointer or a NULL. The
allocation of other structs (toast buffer and fetch iterator) probably
can't be changed without more work.

Done

If there is anything else that is not explained clearly, please point it
out.

--
Best regards,
Binguo Bao

Attachments:

0001-de-TOASTing-using-a-iterator-7.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-7.patchDownload
From 8971fbdc8c072f72918a46feec65e385995660b1 Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 458 +++++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c      |  37 ++-
 src/include/access/tuptoaster.h      |  90 +++++++
 src/include/fmgr.h                   |   7 +
 4 files changed, 586 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..7123f62 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static void fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static void pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									DetoastIterator iter);
 
 
 /* ----------
@@ -347,6 +354,117 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+bool create_detoast_iterator(struct varlena *attr, DetoastIterator iterator) {
+	struct varatt_external toast_pointer;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->ctrl = 0;
+			iterator->ctrlc = 8;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->ctrl = 0;
+			iterator->ctrlc = 8;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+		return true;
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		return create_detoast_iterator(attr, iterator);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 * Skip the iterator and just decompress the whole thing.
+		 */
+		return false;
+	}
+
+	return false;
+}
+
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+bool free_detoast_iterator(DetoastIterator iter) {
+	if (iter == NULL)
+	{
+		return false;
+	}
+	if (iter->buf != iter->fetch_datum_iterator->buf)
+	{
+		free_toast_buffer(iter->buf);
+	}
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	return true;
+}
+
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator iter)
+{
+	FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
+
+	Assert(iter != NULL && !iter->done);
+
+	fetch_datum_iterate(fetch_iter);
+
+	if (iter->compressed)
+		pglz_decompress_iterate(fetch_iter->buf, iter->buf, iter);
+
+	if (iter->buf->limit == iter->buf->capacity) {
+		iter->done = true;
+	}
+}
+
+
+/* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
@@ -2409,3 +2527,343 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+static FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr) {
+	int			validIndex;
+	FetchDatumIterator iterator;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");
+
+	iterator = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iterator->toast_pointer, attr);
+
+	iterator->ressize = iterator->toast_pointer.va_extsize;
+	iterator->numchunks = ((iterator->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iterator->toastrel = table_open(iterator->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iterator->toastrel,
+									AccessShareLock,
+									&iterator->toastidxs,
+									&iterator->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iterator->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iterator->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iterator->SnapshotToast);
+	iterator->toastscan = systable_beginscan_ordered(iterator->toastrel, iterator->toastidxs[validIndex],
+										   &iterator->SnapshotToast, 1, &iterator->toastkey);
+
+	iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	init_toast_buffer(iterator->buf, iterator->ressize + VARHDRSZ, VARATT_EXTERNAL_IS_COMPRESSED(iterator->toast_pointer));
+
+	iterator->nextidx = 0;
+	iterator->done = false;
+
+	return iterator;
+}
+
+static bool
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+	{
+		return false;
+	}
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+	return true;
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in compression or external storage,
+ * fetch it into buffer in iterator.
+ * ----------
+ */
+static void
+fetch_datum_iterate(FetchDatumIterator iter) {
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	Assert(iter != NULL && !iter->done);
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+}
+
+
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed) {
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		/*
+		 * Note the constrain buf->position <= buf->limit may be broken
+		 * at initialization. Make sure that the constrain is satisfied
+		 * when consume chars.
+		 */
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+	buf->buf_size = size;
+}
+
+
+static bool
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+	{
+		return false;
+	}
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+
+	return true;
+}
+
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ *		Decompresses source into dest until the source is exhausted.
+ * ----------
+ */
+static void
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	/*
+	 * In the while loop, sp may go beyond the srcend, provides a four-byte
+	 * buffer to prevent sp from reading unallocated bytes from source buffer.
+	 * When source->limit reaches source->capacity, don't worry about reading
+	 * unallocated bytes.
+	 */
+	srcend = (const unsigned char *)
+		(source->limit == source->capacity ? source->limit : (source->limit - 4));
+	sp = (const unsigned char *) source->position;
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *) dest->capacity;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+
+		if (iter->ctrlc < 8) {
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		iter->ctrlc = ctrlc;
+		iter->ctrl = ctrl;
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 0864838..0312dba 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -56,6 +56,8 @@ typedef struct
 	int			len1;			/* string lengths in bytes */
 	int			len2;
 
+	DetoastIterator iter;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
@@ -122,7 +124,7 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
 static bool text_position_next(TextPositionState *state);
 static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
@@ -1092,10 +1094,22 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	DetoastIteratorData iteratorData;
+	DetoastIterator iter = &iteratorData;
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	if (create_detoast_iterator((struct varlena *)(DatumGetPointer(PG_GETARG_DATUM(0))), iter))
+	{
+		str = (text *) iter->buf->buf;
+	}
+	else
+	{
+		str = PG_GETARG_TEXT_PP(0);
+		iter = NULL;
+	}
+
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1127,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1121,6 +1135,7 @@ text_position(text *t1, text *t2, Oid collid)
 	if (VARSIZE_ANY_EXHDR(t1) < 1 || VARSIZE_ANY_EXHDR(t2) < 1)
 		return 0;
 
+	state.iter = iter;
 	text_position_setup(t1, t2, collid, &state);
 	if (!text_position_next(&state))
 		result = 0;
@@ -1130,7 +1145,6 @@ text_position(text *t1, text *t2, Oid collid)
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1358,6 +1372,10 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (state->iter != NULL) {
+				PG_DETOAST_ITERATE(state->iter, hptr);
+			}
+
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1375,6 +1393,11 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 			const char *nptr;
 			const char *p;
 
+			if (state->iter != NULL)
+			{
+				PG_DETOAST_ITERATE(state->iter, hptr);
+			}
+
 			nptr = needle_last;
 			p = hptr;
 			while (*nptr == *p)
@@ -1438,7 +1461,9 @@ text_position_get_match_pos(TextPositionState *state)
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	/* no cleanup needed */
+	if (state->iter != NULL) {
+		free_detoast_iterator(state->iter);
+	}
 }
 
 static void
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f0aea24..589bc27 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -17,6 +17,96 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consume.
+ * limit: point to the next char to be produce.
+ * capacity: point to the end of buffer.
+ *
+ * Constrains that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+	int32		buf_size;
+} ToastBuffer;
+
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			SnapshotToast;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * ----------
+ */
+extern bool create_detoast_iterator(struct varlena *attr, DetoastIterator iterator);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * ----------
+ */
+extern bool free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * detoast it into toast buffer in iterator.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator iter);
+
+#endif
+
+
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
  * smart enough to rebuild indexes from scratch.
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 3ff0999..446c880 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -239,6 +239,13 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
 #define PG_DETOAST_DATUM_SLICE(datum,f,c) \
 		pg_detoast_datum_slice((struct varlena *) DatumGetPointer(datum), \
 		(int32) (f), (int32) (c))
+#define PG_DETOAST_ITERATE(iter, need)									\
+	do {																\
+		Assert(need >= iter->buf->buf && need <= iter->buf->capacity);	\
+		while (!iter->done && need >= iter->buf->limit) { 				\
+			detoast_iterate(iter);										\
+		}																\
+	} while (0)
 /* WARNING -- unaligned pointer */
 #define PG_DETOAST_DATUM_PACKED(datum) \
 	pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))
-- 
2.7.4

#14John Naylor
john.naylor@2ndquadrant.com
In reply to: Binguo Bao (#13)
Re: [proposal] de-TOAST'ing using a iterator

On Sat, Aug 3, 2019 at 11:11 PM Binguo Bao <djydewang@gmail.com> wrote:

John Naylor <john.naylor@2ndquadrant.com> 于2019年8月2日周五 下午3:12写道:

I like the use of a macro here. However, I think we can find a better
location for the definition. See the header comment of fmgr.h:
"Definitions for the Postgres function manager and function-call
interface." Maybe tuptoaster.h is as good a place as any?

PG_DETOAST_ITERATE isn't a sample function-call interface,
But I notice that PG_FREE_IF_COPY is also defined in fmgr.h, whose logic is
similar to PG_DETOAST_ITERATE, make condition check first and then
decide whether to call the function. Besides, PG_DETOAST_DATUM,
PG_DETOAST_DATUM_COPY, PG_DETOAST_DATUM_SLICE,
PG_DETOAST_DATUM_PACKED are all defined in fmgr.h, it is reasonable
to put all the de-TOAST interface together.

Hmm, it's strange that those macros ended up there, but now I see why
it makes sense to add new ones there also.

Okay, and see these functions now return void. The orignal
pglz_decompress() returned a value that was check against corruption.
Is there a similar check we can do for the iterative version?

As far as I know, we can just do such check after all compressed data is decompressed.
If we are slicing, we can't do the check.

Okay.

3). Speaking of pglz_decompress_iterate(), I diff'd it with
pglz_decompress(), and I have some questions on it:

a).
+ srcend = (const unsigned char *) (source->limit == source->capacity
? source->limit : (source->limit - 4));

What does the 4 here mean in this expression?

Since we fetch chunks one by one, if we make srcend equals to the source buffer limit,
In the while loop "while (sp < srcend && dp < destend)", sp may exceed the source buffer limit and read unallocated bytes.

Why is this? That tells me the limit is incorrect. Can the setter not
determine the right value?

There are three statments change `sp` value in the while loop `while (sp < srcend && dp < destend)`:
`ctrl = *sp++;`
`off = ((sp[0]) & 0xf0) << 4) | sp[1]; sp += 2;`
`len += *sp++`
Although we make sure `sp` is less than `srcend` when enter while loop, `sp` is likely to
go beyond the `srcend` in the loop, and we should ensure that `sp` is always smaller than `buf->limit` to avoid
reading unallocated data. So, `srcend` can't be initialized to `buf->limit`. Only one case is exceptional,
we've fetched all data chunks and 'buf->limit' reaches 'buf->capacity', it's imposisble to read unallocated
data via `sp`.

Thank you for the detailed explanation and the comment.

Please explain further. Was the "sp + 1" correct behavior (and why),
or only for debugging setting ctrl/c correctly?

In patch v5, If the condition is `sp < srcend`, suppose `sp = srcend - 1` before
entering the loop `while (sp < srcend && dp < destend)`, when entering the loop
and read a control byte(sp equals to `srcend` now), the program can't enter the
loop `for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)`, then set `iter->ctrlc` to 0,
exit the first loop and then this iteration is over. At the next iteration,
the control byte will be reread since `iter->ctrlc` equals to 0, but the previous control byte
is not used. Changing the condition to `sp + 1 < srcend` avoid only one control byte is read
then the iterator is over.

Okay, that's quite subtle. I agree the v6/7 way is more clear in this regard.

Also, I don't think
the new logic for the ctrl/c variables is an improvement:

1. iter->ctrlc is intialized with '8' (even in the uncompressed case,
which is confusing). Any time you initialize with something not 0 or
1, it's a magic number, and here it's far from where the loop variable
is used. This is harder to read.

`iter->ctrlc` is used to record the value of `ctrl` in pglz_decompress at the end of
the last iteration(or loop). In the pglz_decompress, `ctrlc`’s valid value is 0~7,
When `ctrlc` reaches 8, a control byte is read from the source
buffer to `ctrl` then set `ctrlc` to 0. And a control bytes should be read from the
source buffer to `ctrlc` on the first iteration. So `iter->ctrlc` should be intialized with '8'.

My point here is it looks strange out of context, but "0" looked
normal. Maybe a comment in init_detoast_buffer(), something like "8
means read a control byte from the source buffer on the first
iteration, see pg_lzdecompress_iterate()".

Or, possibly, we could have a macro like INVALID_CTRLC. That might
even improve the readability of the original function. This is just an
idea, and maybe others would disagree, so you don't need to change it
for now.

3. At the end of the loop, iter->ctrl/c are unconditionally set. In
v5, there was a condition which would usually avoid this copying of
values through pointers.

Patch v6 just records the value of `ctrlc` at the end of each iteration(or loop)
whether it is valid (0~7) or 8, and initializes `ctrlc` on the next iteration(or loop) correctly.
I think it is more concise in patch v6.

And, in the case mentioned above where we enter the while loop with sp
= src_end - 1 , we can read the control byte and still store the
correct value for ctrlc.

[varlena.c api]

That's better, and I think we can take it a little bit farther.

1. Notice that TextPositionState is allocated on the stack in
text_position(), which passes both the "state" pointer and the "iter"
pointer to text_position_setup(), and only then sets state->iter =
iter. We can easily set this inside text_position(). That would get
rid of the need for other callers to pass NULL iter to
text_position_setup().

Done.

That looks much cleaner, thanks.

I've repeated my performance test to make sure there's no additional
regression in my suffix tests:

master patch v7
comp. end 1560s 1600ms
uncomp. end 896ms 890ms

The regression from master in the compressed case is about 2.5%, which
is no different from the last patch I tested, so that's good.

At this point, there are no functional things that I think we need to
change. It's close to ready-for-committer. For the next version, I'd
like you go through the comments and edit for grammar, spelling, and
clarity as you see fit. I know you're not a native speaker of English,
so I can help you with anything that remains. Also note we use braces
on their own lines
{
like this
}

We do have a source formatting tool (pgindent), but it helps
readability for committers to have it mostly standard beforehand.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Binguo Bao
djydewang@gmail.com
In reply to: John Naylor (#14)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

Hi John,

Also, I don't think
the new logic for the ctrl/c variables is an improvement:

1. iter->ctrlc is intialized with '8' (even in the uncompressed case,
which is confusing). Any time you initialize with something not 0 or
1, it's a magic number, and here it's far from where the loop variable
is used. This is harder to read.

`iter->ctrlc` is used to record the value of `ctrl` in pglz_decompress

at the end of

the last iteration(or loop). In the pglz_decompress, `ctrlc`’s valid

value is 0~7,

When `ctrlc` reaches 8, a control byte is read from the source
buffer to `ctrl` then set `ctrlc` to 0. And a control bytes should be

read from the

source buffer to `ctrlc` on the first iteration. So `iter->ctrlc` should

be intialized with '8'.

My point here is it looks strange out of context, but "0" looked
normal. Maybe a comment in init_detoast_buffer(), something like "8
means read a control byte from the source buffer on the first
iteration, see pg_lzdecompress_iterate()".

Or, possibly, we could have a macro like INVALID_CTRLC. That might
even improve the readability of the original function. This is just an
idea, and maybe others would disagree, so you don't need to change it
for now.

All in all, the idea is much better than a magic number 8. So, I've
implemented it.

At this point, there are no functional things that I think we need to
change. It's close to ready-for-committer. For the next version, I'd
like you go through the comments and edit for grammar, spelling, and
clarity as you see fit. I know you're not a native speaker of English,
so I can help you with anything that remains.

I've tried my best to improve the comments, but there should be room for
further improvement
I hope you can help me perfect it.

Also note we use braces
on their own lines
{
like this
}

Done.

--
Best regards,
Binguo Bao

Attachments:

0001-de-TOASTing-using-a-iterator-8.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-8.patchDownload
From 13648adb56b96a75910bb5d0a8e21d358b266d51 Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 470 +++++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c      |  40 ++-
 src/include/access/tuptoaster.h      |  97 ++++++++
 src/include/fmgr.h                   |   7 +
 4 files changed, 608 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 74233bb..a8924f1 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static void fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static void pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									DetoastIterator iter);
 
 
 /* ----------
@@ -347,6 +354,125 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* ----------
+ * init_detoast_iterator -
+ *
+ * The "iterator" variable is normally just a local variable in the caller.
+ * It only make sense to initialize de-TOAST iterator for external on-disk value.
+ *
+ * ----------
+ */
+bool init_detoast_iterator(struct varlena *attr, DetoastIterator iterator)
+{
+	struct varatt_external toast_pointer;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- initialize fetch datum iterator
+		 */
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->ctrl = 0;
+			iterator->ctrlc = INVALID_CTRLC;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->ctrl = 0;
+			iterator->ctrlc = INVALID_CTRLC;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+		return true;
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		return init_detoast_iterator(attr, iterator);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 * skip the iterator and just decompress the whole thing.
+		 */
+		return false;
+	}
+
+	return false;
+}
+
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-TOAST iterator include buffers and
+ * fetch datum iterator.
+ *
+ * Note: "iter" variable is normally just a local variable in the caller, so
+ * shouldn't free de-TOAST iterator itself.
+ * ----------
+ */
+bool free_detoast_iterator(DetoastIterator iter)
+{
+	if (iter == NULL)
+	{
+		return false;
+	}
+	if (iter->buf != iter->fetch_datum_iterator->buf)
+	{
+		free_toast_buffer(iter->buf);
+	}
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	return true;
+}
+
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another data chunk in compression or external storage,
+ * de-TOAST it into toast buffer in iterator.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator iter)
+{
+	FetchDatumIterator fetch_iter = iter->fetch_datum_iterator;
+
+	Assert(iter != NULL && !iter->done);
+
+	fetch_datum_iterate(fetch_iter);
+
+	if (iter->compressed)
+		pglz_decompress_iterate(fetch_iter->buf, iter->buf, iter);
+
+	if (iter->buf->limit == iter->buf->capacity) {
+		iter->done = true;
+	}
+}
+
+
+/* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
@@ -2419,3 +2545,347 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+static FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr)
+{
+	int			validIndex;
+	FetchDatumIterator iterator;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");
+
+	iterator = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iterator->toast_pointer, attr);
+
+	iterator->ressize = iterator->toast_pointer.va_extsize;
+	iterator->numchunks = ((iterator->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iterator->toastrel = table_open(iterator->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iterator->toastrel,
+									AccessShareLock,
+									&iterator->toastidxs,
+									&iterator->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iterator->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iterator->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iterator->SnapshotToast);
+	iterator->toastscan = systable_beginscan_ordered(iterator->toastrel, iterator->toastidxs[validIndex],
+										   &iterator->SnapshotToast, 1, &iterator->toastkey);
+
+	iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	init_toast_buffer(iterator->buf, iterator->ressize + VARHDRSZ, VARATT_EXTERNAL_IS_COMPRESSED(iterator->toast_pointer));
+
+	iterator->nextidx = 0;
+	iterator->done = false;
+
+	return iterator;
+}
+
+static bool
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+	{
+		return false;
+	}
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+	return true;
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in compression or external storage,
+ * fetch it into buffer in iterator.
+ * ----------
+ */
+static void
+fetch_datum_iterate(FetchDatumIterator iter)
+{
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	Assert(iter != NULL && !iter->done);
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+}
+
+
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed)
+{
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		/*
+		 * Note the constrain buf->position <= buf->limit may be broken
+		 * at initialization. Make sure that the constrain is satisfied
+		 * when consume chars.
+		 */
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+	buf->buf_size = size;
+}
+
+
+static bool
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+	{
+		return false;
+	}
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+
+	return true;
+}
+
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ *		Decompresses source into dest until the source is exhausted.
+ * ----------
+ */
+static void
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	/*
+	 * In the while loop, sp may go beyond the srcend, provides a four-byte
+	 * buffer to prevent sp from reading unallocated bytes from source buffer.
+	 * When source->limit reaches source->capacity, don't worry about reading
+	 * unallocated bytes.
+	 */
+	srcend = (const unsigned char *)
+		(source->limit == source->capacity ? source->limit : (source->limit - 4));
+	sp = (const unsigned char *) source->position;
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *) dest->capacity;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+
+		if (iter->ctrlc != INVALID_CTRLC)
+		{
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		iter->ctrlc = ctrlc;
+		iter->ctrl = ctrl;
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index fa08b55..cb63891 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -56,6 +56,8 @@ typedef struct
 	int			len1;			/* string lengths in bytes */
 	int			len2;
 
+	DetoastIterator iter;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
@@ -122,7 +124,7 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
 static bool text_position_next(TextPositionState *state);
 static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
@@ -1092,10 +1094,22 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	DetoastIteratorData iteratorData;
+	DetoastIterator iter = &iteratorData;
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	if (init_detoast_iterator((struct varlena *) (DatumGetPointer(PG_GETARG_DATUM(0))), iter))
+	{
+		str = (text *) iter->buf->buf;
+	}
+	else
+	{
+		str = PG_GETARG_TEXT_PP(0);
+		iter = NULL;
+	}
+
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1127,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1122,6 +1136,7 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
+	state.iter = iter;
 	if (!text_position_next(&state))
 		result = 0;
 	else
@@ -1130,7 +1145,6 @@ text_position(text *t1, text *t2, Oid collid)
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1196,6 +1210,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	state->str2 = VARDATA_ANY(t2);
 	state->len1 = len1;
 	state->len2 = len2;
+	state->iter = NULL;
 	state->last_match = NULL;
 	state->refpoint = state->str1;
 	state->refpos = 0;
@@ -1358,6 +1373,11 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (state->iter != NULL)
+			{
+				PG_DETOAST_ITERATE(state->iter, hptr);
+			}
+
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1375,6 +1395,11 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 			const char *nptr;
 			const char *p;
 
+			if (state->iter != NULL)
+			{
+				PG_DETOAST_ITERATE(state->iter, hptr);
+			}
+
 			nptr = needle_last;
 			p = hptr;
 			while (*nptr == *p)
@@ -1438,7 +1463,10 @@ text_position_get_match_pos(TextPositionState *state)
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	/* no cleanup needed */
+	if (state->iter != NULL)
+	{
+		free_detoast_iterator(state->iter);
+	}
 }
 
 static void
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f0aea24..79fd4c4 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -17,6 +17,103 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consumed.
+ * limit: point to the next char to be produced.
+ * capacity: point to the end of buffer.
+ *
+ * Constrains that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+	int32		buf_size;
+} ToastBuffer;
+
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			SnapshotToast;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+/*
+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that
+ * the field is invalid and need to read the control byte from the
+ * source buffer in the next iteration, see pglz_decompress_iterate().
+ */
+#define INVALID_CTRLC 8
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * init_detoast_iterator -
+ *
+ * Initialize de-TOAST iterator.
+ * ----------
+ */
+extern bool init_detoast_iterator(struct varlena *attr, DetoastIterator iterator);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-TOAST iterator.
+ * ----------
+ */
+extern bool free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * de-TOAST it into buffer in iterator.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator iter);
+
+#endif
+
+
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
  * smart enough to rebuild indexes from scratch.
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 3ff0999..446c880 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -239,6 +239,13 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
 #define PG_DETOAST_DATUM_SLICE(datum,f,c) \
 		pg_detoast_datum_slice((struct varlena *) DatumGetPointer(datum), \
 		(int32) (f), (int32) (c))
+#define PG_DETOAST_ITERATE(iter, need)									\
+	do {																\
+		Assert(need >= iter->buf->buf && need <= iter->buf->capacity);	\
+		while (!iter->done && need >= iter->buf->limit) { 				\
+			detoast_iterate(iter);										\
+		}																\
+	} while (0)
 /* WARNING -- unaligned pointer */
 #define PG_DETOAST_DATUM_PACKED(datum) \
 	pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))
-- 
2.7.4

#16John Naylor
john.naylor@2ndquadrant.com
In reply to: Binguo Bao (#15)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

On Fri, Aug 16, 2019 at 10:48 PM Binguo Bao <djydewang@gmail.com> wrote:

[v8 patch with cosmetic changes]

Okay, looks good. I'll make a few style suggestions and corrections.
In the course of looking at this again, I have a few other questions
below as well.

It looks like you already do this for the most part, but I'll mention
that we try to keep lines, including comments, less than 80 characters
long. pgindent can try to fix that, but the results don't always look
nice.

About variable names: The iterator pointers are variously called
"iter", "iterator", and "fetch_iter". I found this confusing the first
time I read this code. I think we should use "iter" if we have only
one kind in the function, and "detoast_iter" and "fetch_iter" if we
have both kinds.
--

init_detoast_iterator():

+ * The "iterator" variable is normally just a local variable in the caller.

I don't think this comment is helpful to understand this function or its use.

+ * It only make sense to initialize de-TOAST iterator for external
on-disk value.

s/make/makes/
"a" de-TOAST iterator
s/value/values/

The comments in this function that start with "This is a ..." could be
shortened like this:

/* indirect pointer -- dereference it */

While looking at this again, I noticed we no longer need to test for
the in-line compressed case at all. I also tried some other cosmetic
rearrangements. Let me know what you think about the attached patch.
Also, I wonder if the VARATT_IS_EXTERNAL_INDIRECT case should come
first. Then the two normal cases are next to eachother.

free_detoast_iterator(), free_fetch_datum_iterator(), and free_toast_buffer():

These functions should return void.

+ * Free the memory space occupied by the de-TOAST iterator include buffers and
+ * fetch datum iterator.

Perhaps "Free memory used by the de-TOAST iterator, including buffers
and fetch datum iterator."

The check

if (iter->buf != iter->fetch_datum_iterator->buf)

is what we need to do for the compressed case. Could we use this
directly instead of having a separate state variable iter->compressed,
with a macro like this?

#define TOAST_ITER_COMPRESSED(iter) \
(iter->buf != iter->fetch_datum_iterator->buf)

Or maybe that's too clever?

detoast_iterate():

+ * As long as there is another data chunk in compression or external storage,

We no longer use the iterator with in-line compressed values.

+ * de-TOAST it into toast buffer in iterator.

Maybe "into the iterator's toast buffer"

fetch_datum_iterate():

My remarks for detoast_iterate() also apply here.

init_toast_buffer():

+ * Note the constrain buf->position <= buf->limit may be broken
+ * at initialization. Make sure that the constrain is satisfied
+ * when consume chars.

s/constrain/constraint/ (2 times)
s/consume/consuming/

Also, this comment might be better at the top the whole function?

pglz_decompress_iterate():

+ * Decompresses source into dest until the source is exhausted.

This comment is from the original function, but I think it would be
better to highlight the differences from the original, something like:

"This function is based on pglz_decompress(), with these additional
requirements:

1. We need to save the current control byte and byte position for the
caller's next iteration.

2. In pglz_decompress(), we can assume we have all the source bytes
available. This is not the case when we decompress one chunk at a
time, so we have to make sure that we only read bytes available in the
current chunk."

(I'm not sure about the term 'byte position', maybe there's a better one.)

+ * In the while loop, sp may go beyond the srcend, provides a four-byte
+ * buffer to prevent sp from reading unallocated bytes from source buffer.
+ * When source->limit reaches source->capacity, don't worry about reading
+ * unallocated bytes.

Here's my suggestion:

"In the while loop, sp may be incremented such that it points beyond
srcend. To guard against reading beyond the end of the current chunk,
we set srcend such that we exit the loop when we are within four bytes
of the end of the current chunk. When source->limit reaches
source->capacity, we are decompressing the last chunk, so we can (and
need to) read every byte."

+ for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)

Note you can also replace 8 with INVALID_CTRLC here.

tuptoaster.h:
+ * Constrains that need to be satisfied:

s/constrains/constraints/

+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that
+ * the field is invalid and need to read the control byte from the
+ * source buffer in the next iteration, see pglz_decompress_iterate().
+ */
+#define INVALID_CTRLC 8

I think the macro might be better placed in pg_lzcompress.h, and for
consistency used in pglz_decompress(). Then the comment can be shorter
and more general. With my additional comment in
init_detoast_iterator(), hopefully it will be clear to readers.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

rearrange-init-iter.patchapplication/x-patch; name=rearrange-init-iter.patchDownload
diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 878f8aff01..d0eeb83ab7 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -366,6 +366,8 @@ bool init_detoast_iterator(struct varlena *attr, DetoastIterator iterator)
 	struct varatt_external toast_pointer;
 	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
+		iterator->done = false;
+
 		/*
 		 * This is an externally stored datum --- initialize fetch datum iterator
 		 */
@@ -373,21 +375,22 @@ bool init_detoast_iterator(struct varlena *attr, DetoastIterator iterator)
 		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
 		{
-			/* If it's compressed, prepare buffer for raw data */
+			iterator->compressed = true;
+
+			/* prepare buffer to received decompressed data */
 			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
 			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+
+			/* initialize state for pglz_decompress_iterate() */
 			iterator->ctrl = 0;
 			iterator->ctrlc = INVALID_CTRLC;
-			iterator->compressed = true;
-			iterator->done = false;
 		}
 		else
 		{
-			iterator->buf = iterator->fetch_datum_iterator->buf;
-			iterator->ctrl = 0;
-			iterator->ctrlc = INVALID_CTRLC;
 			iterator->compressed = false;
-			iterator->done = false;
+
+			/* point the buffer directly at the raw data */
+			iterator->buf = iterator->fetch_datum_iterator->buf;
 		}
 		return true;
 	}
@@ -408,16 +411,11 @@ bool init_detoast_iterator(struct varlena *attr, DetoastIterator iterator)
 		return init_detoast_iterator(attr, iterator);
 
 	}
-	else if (VARATT_IS_COMPRESSED(attr))
+	else
 	{
-		/*
-		 * This is a compressed value inside of the main tuple
-		 * skip the iterator and just decompress the whole thing.
-		 */
+		/* in-line value -- no iteration used, even if it's compressed */
 		return false;
 	}
-
-	return false;
 }
 
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 4688313076..1752f4f9ec 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1097,9 +1097,11 @@ textpos(PG_FUNCTION_ARGS)
 	text		*str;
 	DetoastIteratorData iteratorData;
 	DetoastIterator iter = &iteratorData;
+	struct varlena *attr = (struct varlena *)
+								DatumGetPointer(PG_GETARG_DATUM(0));
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	if (init_detoast_iterator((struct varlena *) (DatumGetPointer(PG_GETARG_DATUM(0))), iter))
+	if (init_detoast_iterator(attr, iter))
 	{
 		str = (text *) iter->buf->buf;
 	}
#17Binguo Bao
djydewang@gmail.com
In reply to: John Naylor (#16)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

John Naylor <john.naylor@2ndquadrant.com> 于2019年8月19日周一 下午12:55写道:

init_toast_buffer():

+ * Note the constrain buf->position <= buf->limit may be broken
+ * at initialization. Make sure that the constrain is satisfied
+ * when consume chars.

s/constrain/constraint/ (2 times)
s/consume/consuming/

Also, this comment might be better at the top the whole function?

The constraint is broken in the if branch, so I think put this comment in
the branch
is more precise.

The check

if (iter->buf != iter->fetch_datum_iterator->buf)
is what we need to do for the compressed case. Could we use this
directly instead of having a separate state variable iter->compressed,
with a macro like this?
#define TOAST_ITER_COMPRESSED(iter) \
(iter->buf != iter->fetch_datum_iterator->buf)

The logic of the macro may be hard to understand, so I think it's ok to
just check the compressed state variable.

+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that

+ * the field is invalid and need to read the control byte from the
+ * source buffer in the next iteration, see pglz_decompress_iterate().
+ */
+#define INVALID_CTRLC 8

I think the macro might be better placed in pg_lzcompress.h, and for
consistency used in pglz_decompress(). Then the comment can be shorter
and more general. With my additional comment in
init_detoast_iterator(), hopefully it will be clear to readers.

The main role of this macro is to explain the iterator's "ctrlc" state, IMO
it's reasonable to put
the macro and definition of de-TOAST iterator together.

Thanks for your suggestion, I have updated the patch.
--
Best regards,
Binguo Bao

Attachments:

0001-de-TOASTing-using-a-iterator-9.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-9.patchDownload
From d8d14600e2be0c11f233bf3302a823a8c4f19c2a Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 472 +++++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c      |  42 +++-
 src/include/access/tuptoaster.h      |  97 +++++++
 src/include/fmgr.h                   |   7 +
 4 files changed, 612 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 74233bb..ba05fd1 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static void free_fetch_datum_iterator(FetchDatumIterator iter);
+static void fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static void free_toast_buffer(ToastBuffer *buf);
+static void pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									DetoastIterator iter);
 
 
 /* ----------
@@ -347,6 +354,119 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* ----------
+ * init_detoast_iterator -
+ *
+ * It only makes sense to initialize a de-TOAST iterator for external on-disk values.
+ *
+ * ----------
+ */
+bool init_detoast_iterator(struct varlena *attr, DetoastIterator iter)
+{
+	struct varatt_external toast_pointer;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		iter->done = false;
+
+		/*
+		 * This is an externally stored datum --- initialize fetch datum iterator
+		 */
+		iter->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			iter->compressed = true;
+
+			/* prepare buffer to received decompressed data */
+			iter->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iter->buf, toast_pointer.va_rawsize, false);
+
+			/* initialize state for pglz_decompress_iterate() */
+			iter->ctrl = 0;
+			iter->ctrlc = INVALID_CTRLC;
+		}
+		else
+		{
+			iter->compressed = false;
+
+			/* point the buffer directly at the raw data */
+			iter->buf = iter->fetch_datum_iterator->buf;
+		}
+		return true;
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		 /* indirect pointer --- dereference it */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		return init_detoast_iterator(attr, iter);
+
+	}
+	else
+	{
+		/* in-line value -- no iteration used, even if it's compressed */
+		return false;
+	}
+}
+
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free memory used by the de-TOAST iterator, including buffers and
+ * fetch datum iterator.
+ *
+ * Note: "iter" variable is normally just a local variable in the caller, so
+ * shouldn't free de-TOAST iterator itself.
+ * ----------
+ */
+void free_detoast_iterator(DetoastIterator iter)
+{
+	if (iter == NULL)
+	{
+		return;
+	}
+	if (iter->compressed)
+	{
+		free_toast_buffer(iter->buf);
+	}
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+}
+
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another data chunk in external storage,
+ * de-TOAST it into iterator's toast buffer.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator detoast_iter)
+{
+	FetchDatumIterator fetch_iter = detoast_iter->fetch_datum_iterator;
+
+	Assert(detoast_iter != NULL && !detoast_iter->done);
+
+	fetch_datum_iterate(fetch_iter);
+
+	if (detoast_iter->compressed)
+		pglz_decompress_iterate(fetch_iter->buf, detoast_iter->buf, detoast_iter);
+
+	if (detoast_iter->buf->limit == detoast_iter->buf->capacity) {
+		detoast_iter->done = true;
+	}
+}
+
+
+/* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
@@ -2419,3 +2539,355 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+static FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr)
+{
+	int			validIndex;
+	FetchDatumIterator iter;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");
+
+	iter = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iter->toast_pointer, attr);
+
+	iter->ressize = iter->toast_pointer.va_extsize;
+	iter->numchunks = ((iter->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iter->toastrel = table_open(iter->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iter->toastrel,
+									AccessShareLock,
+									&iter->toastidxs,
+									&iter->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iter->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iter->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iter->SnapshotToast);
+	iter->toastscan = systable_beginscan_ordered(iter->toastrel, iter->toastidxs[validIndex],
+										   &iter->SnapshotToast, 1, &iter->toastkey);
+
+	iter->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	init_toast_buffer(iter->buf, iter->ressize + VARHDRSZ, VARATT_EXTERNAL_IS_COMPRESSED(iter->toast_pointer));
+
+	iter->nextidx = 0;
+	iter->done = false;
+
+	return iter;
+}
+
+static void
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+	{
+		return;
+	}
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in external storage,
+ * fetch it into iterator's toast buffer.
+ * ----------
+ */
+static void
+fetch_datum_iterate(FetchDatumIterator iter)
+{
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	Assert(iter != NULL && !iter->done);
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+}
+
+
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed)
+{
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		/*
+		 * Note the constraint buf->position <= buf->limit may be broken
+		 * at initialization. Make sure that the constraint is satisfied
+		 * when consuming chars.
+		 */
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+	buf->buf_size = size;
+}
+
+
+static void
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+	{
+		return;
+	}
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+}
+
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ * This function is based on pglz_decompress(), with these additional
+ * requirements:
+ *
+ * 1. We need to save the current control byte and byte position for the
+ * caller's next iteration.
+ *
+ * 2. In pglz_decompress(), we can assume we have all the source bytes
+ * available. This is not the case when we decompress one chunk at a
+ * time, so we have to make sure that we only read bytes available in the
+ * current chunk.
+ * ----------
+ */
+static void
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	/*
+	 * In the while loop, sp may be incremented such that it points beyond
+	 * srcend. To guard against reading beyond the end of the current chunk,
+	 * we set srcend such that we exit the loop when we are within four bytes
+	 * of the end of the current chunk. When source->limit reaches
+	 * source->capacity, we are decompressing the last chunk, so we can (and
+	 * need to) read every byte.
+	 */
+	srcend = (const unsigned char *)
+		(source->limit == source->capacity ? source->limit : (source->limit - 4));
+	sp = (const unsigned char *) source->position;
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *) dest->capacity;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+
+		if (iter->ctrlc != INVALID_CTRLC)
+		{
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < INVALID_CTRLC && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		iter->ctrlc = ctrlc;
+		iter->ctrl = ctrl;
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index fa08b55..73e5805 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -56,6 +56,8 @@ typedef struct
 	int			len1;			/* string lengths in bytes */
 	int			len2;
 
+	DetoastIterator iter;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
@@ -122,7 +124,7 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
 static bool text_position_next(TextPositionState *state);
 static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
@@ -1092,10 +1094,24 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	DetoastIteratorData iteratorData;
+	DetoastIterator iter = &iteratorData;
+	struct varlena *attr = (struct varlena *)
+								DatumGetPointer(PG_GETARG_DATUM(0));
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	if (init_detoast_iterator(attr, iter))
+	{
+		str = (text *) iter->buf->buf;
+	}
+	else
+	{
+		str = PG_GETARG_TEXT_PP(0);
+		iter = NULL;
+	}
+
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1129,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1122,6 +1138,7 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
+	state.iter = iter;
 	if (!text_position_next(&state))
 		result = 0;
 	else
@@ -1130,7 +1147,6 @@ text_position(text *t1, text *t2, Oid collid)
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1196,6 +1212,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	state->str2 = VARDATA_ANY(t2);
 	state->len1 = len1;
 	state->len2 = len2;
+	state->iter = NULL;
 	state->last_match = NULL;
 	state->refpoint = state->str1;
 	state->refpos = 0;
@@ -1358,6 +1375,11 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (state->iter != NULL)
+			{
+				PG_DETOAST_ITERATE(state->iter, hptr);
+			}
+
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1375,6 +1397,11 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 			const char *nptr;
 			const char *p;
 
+			if (state->iter != NULL)
+			{
+				PG_DETOAST_ITERATE(state->iter, hptr);
+			}
+
 			nptr = needle_last;
 			p = hptr;
 			while (*nptr == *p)
@@ -1438,7 +1465,10 @@ text_position_get_match_pos(TextPositionState *state)
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	/* no cleanup needed */
+	if (state->iter != NULL)
+	{
+		free_detoast_iterator(state->iter);
+	}
 }
 
 static void
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f0aea24..505ac81 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -17,6 +17,103 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consumed.
+ * limit: point to the next char to be produced.
+ * capacity: point to the end of buffer.
+ *
+ * Constraints that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+	int32		buf_size;
+} ToastBuffer;
+
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			SnapshotToast;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+/*
+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that
+ * the field is invalid and need to read the control byte from the
+ * source buffer in the next iteration, see pglz_decompress_iterate().
+ */
+#define INVALID_CTRLC 8
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * init_detoast_iterator -
+ *
+ * Initialize de-TOAST iterator.
+ * ----------
+ */
+extern bool init_detoast_iterator(struct varlena *attr, DetoastIterator iter);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-TOAST iterator.
+ * ----------
+ */
+extern void free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another slice in compression or external storage,
+ * de-TOAST it into buffer in iterator.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator detoast_iter);
+
+#endif
+
+
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
  * smart enough to rebuild indexes from scratch.
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 3ff0999..446c880 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -239,6 +239,13 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
 #define PG_DETOAST_DATUM_SLICE(datum,f,c) \
 		pg_detoast_datum_slice((struct varlena *) DatumGetPointer(datum), \
 		(int32) (f), (int32) (c))
+#define PG_DETOAST_ITERATE(iter, need)									\
+	do {																\
+		Assert(need >= iter->buf->buf && need <= iter->buf->capacity);	\
+		while (!iter->done && need >= iter->buf->limit) { 				\
+			detoast_iterate(iter);										\
+		}																\
+	} while (0)
 /* WARNING -- unaligned pointer */
 #define PG_DETOAST_DATUM_PACKED(datum) \
 	pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))
-- 
2.7.4

#18John Naylor
john.naylor@2ndquadrant.com
In reply to: Binguo Bao (#17)
Re: [proposal] de-TOAST'ing using a iterator

On Thu, Aug 22, 2019 at 12:10 AM Binguo Bao <djydewang@gmail.com> wrote:

[v9 patch]

Thanks, looks good. I'm setting it to ready for committer.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Binguo Bao (#17)
Re: [proposal] de-TOAST'ing using a iterator
+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed)
+{
+	buf->buf = (const char *) palloc0(size);

This API is weird -- you always palloc the ToastBuffer first, then call
init_toast_bufer on it. Why not palloc the ToastBuffer struct in
init_toast_buffer and return it from there instead? This is
particularly strange since the ToastBuffer itself is freed by the "free"
routine ... so it's not like we're thinking that caller can take
ownership of the struct by embedding it in a larger struct.

Also, this function needs a comment on top explaining what it does and
what the params are.

Why do we need ToastBuffer->buf_size? Seems unused.

+	if (iter == NULL)
+	{
+		return;
+	}

Please, no braces around single-statement blocks. (Many places).

+/*
+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that
+ * the field is invalid and need to read the control byte from the
+ * source buffer in the next iteration, see pglz_decompress_iterate().
+ */
+#define INVALID_CTRLC 8

What does CTRLC stand for? Also: this comment should explain why the
value 8 is what it is.

+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}

So why not use memmove?

+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;

Starting this para with "Otherwise" makes no sense, since there's no
previous opposite case. Please reword. However, I don't recognize this
code from anywhere, and it seems to have a lot of magical numbers. Is
this code completely new?

Didn't much like FetchDatumIteratorData SnapshotToast struct member
name. How about just "snapshot"?

+#define PG_DETOAST_ITERATE(iter, need)									\
+	do {																\
+		Assert(need >= iter->buf->buf && need <= iter->buf->capacity);	\
+		while (!iter->done && need >= iter->buf->limit) { 				\
+			detoast_iterate(iter);										\
+		}																\
+	} while (0)

This needs parens around each "iter" and "need" in the macro definition.
Also, please add a comment documenting what the arguments are, since
it's not immediately obvious.

+void free_detoast_iterator(DetoastIterator iter)
+{
+	if (iter == NULL)
+	{
+		return;
+	}

If this function is going to do this, why do callers need to check for
NULL also? Seems pointless. I'd rather make callers simpler and keep
only the NULL-check inside the function, since this is not perf-critical
anyway.

+extern void detoast_iterate(DetoastIterator detoast_iter)
+{

Please, no "extern" in function definitions, only in prototypes in the
.h files. Also, we indent the function name at the start of line, with
the return type appearing on its own in the previous line.

+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_itearator shouldn't be called for non-ondisk datums");

Typo for "iterator".

+		iter->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
[...]
+		}
+		else
+		{
+			iter->compressed = false;
+
+			/* point the buffer directly at the raw data */
+			iter->buf = iter->fetch_datum_iterator->buf;
+		}

This arrangement where there are two ToastBuffers and they sometimes are
the same is cute, but I think we need a better way to know when each
needs to be freed afterwards; the proposed coding is confusing. And it
certainly it needs more than zero comments about what's going on there.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20Alvaro Herrera from 2ndQuadrant
alvherre@alvh.no-ip.org
In reply to: Alvaro Herrera (#19)
Re: [proposal] de-TOAST'ing using a iterator

Also: this patch no longer applies. Please rebase.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21Binguo Bao
djydewang@gmail.com
In reply to: Alvaro Herrera (#19)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

Alvaro Herrera <alvherre@2ndquadrant.com> 于2019年9月4日周三 上午4:12写道:

+static void
+init_toast_buffer(ToastBuffer *buf, int32 size, bool compressed)
+{
+     buf->buf = (const char *) palloc0(size);

This API is weird -- you always palloc the ToastBuffer first, then call
init_toast_bufer on it. Why not palloc the ToastBuffer struct in
init_toast_buffer and return it from there instead? This is
particularly strange since the ToastBuffer itself is freed by the "free"
routine ... so it's not like we're thinking that caller can take
ownership of the struct by embedding it in a larger struct.

I agree with you. I also change "init_detoast_iterator" to
"create_detoast_iterator"
so the caller doesn't need to manage the memory allocation of the iterator

Also, this function needs a comment on top explaining what it does and
what the params are.

Done.

Why do we need ToastBuffer->buf_size? Seems unused.

+     if (iter == NULL)
+     {
+             return;
+     }

Removed.

Please, no braces around single-statement blocks. (Many places).

Done.

+/*
+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that
+ * the field is invalid and need to read the control byte from the
+ * source buffer in the next iteration, see pglz_decompress_iterate().
+ */
+#define INVALID_CTRLC 8

What does CTRLC stand for? Also: this comment should explain why the
value 8 is what it is.

I've improved the comment.

+                             /*
+                              * Now we copy the bytes specified by the

tag from OUTPUT to

+ * OUTPUT. It is dangerous and platform

dependent to use

+ * memcpy() here, because the copied areas

could overlap

+                              * extremely!
+                              */
+                             len = Min(len, destend - dp);
+                             while (len--)
+                             {
+                                     *dp = dp[-off];
+                                     dp++;
+                             }

So why not use memmove?

+                             /*
+                              * Otherwise it contains the match length

minus 3 and the

+ * upper 4 bits of the offset. The next

following byte

+ * contains the lower 8 bits of the

offset. If the length is

+ * coded as 18, another extension tag byte

tells how much

+                              * longer the match really was (0-255).
+                              */
+                             int32           len;
+                             int32           off;
+
+                             len = (sp[0] & 0x0f) + 3;
+                             off = ((sp[0] & 0xf0) << 4) | sp[1];
+                             sp += 2;
+                             if (len == 18)
+                                     len += *sp++;

Starting this para with "Otherwise" makes no sense, since there's no
previous opposite case. Please reword. However, I don't recognize this
code from anywhere, and it seems to have a lot of magical numbers. Is
this code completely new?

This function is based on pglz_decompress() in src/common/pg_lzcompress.c
and I've
mentioned that in the function's comment at the beginning.

Didn't much like FetchDatumIteratorData SnapshotToast struct member
name. How about just "snapshot"?

Done.

+#define PG_DETOAST_ITERATE(iter, need)
\

+ do {

\

+ Assert(need >= iter->buf->buf && need <=

iter->buf->capacity); \

+ while (!iter->done && need >= iter->buf->limit) {

\

+ detoast_iterate(iter);

\

+ }

\

+ } while (0)

This needs parens around each "iter" and "need" in the macro definition.
Also, please add a comment documenting what the arguments are, since
it's not immediately obvious.

Parens makes the macro more reliable. Done.

+void free_detoast_iterator(DetoastIterator iter)

+{
+     if (iter == NULL)
+     {
+             return;
+     }

If this function is going to do this, why do callers need to check for
NULL also? Seems pointless. I'd rather make callers simpler and keep
only the NULL-check inside the function, since this is not perf-critical
anyway.

Good catch. Done.

+ iter->fetch_datum_iterator =

create_fetch_datum_iterator(attr);

+             VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+             if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+             {
[...]
+             }
+             else
+             {
+                     iter->compressed = false;
+
+                     /* point the buffer directly at the raw data */
+                     iter->buf = iter->fetch_datum_iterator->buf;
+             }

This arrangement where there are two ToastBuffers and they sometimes are
the same is cute, but I think we need a better way to know when each
needs to be freed afterwards;

We only need to check the "compressed" field in the iterator to figure out
which buffer should be freed.

--
Best regards,
Binguo Bao

Attachments:

0001-de-TOASTing-using-a-iterator-10.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-10.patchDownload
From 90b8ee9af981a3d7a36edb275f7c8c7d3fdebfb6 Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/common/detoast.c         | 103 ++++++++
 src/backend/access/common/toast_internals.c | 355 ++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c             |  29 ++-
 src/include/access/detoast.h                |  97 ++++++++
 src/include/access/toast_internals.h        |  10 +
 src/include/fmgr.h                          |  13 +
 6 files changed, 601 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index c8b49d6..905deac 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -290,6 +290,109 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 }
 
 /* ----------
+ * create_detoast_iterator -
+ *
+ * It only makes sense to initialize a de-TOAST iterator for external on-disk values.
+ *
+ * ----------
+ */
+DetoastIterator
+create_detoast_iterator(struct varlena *attr)
+{
+	struct varatt_external toast_pointer;
+	DetoastIterator iter;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		iter = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iter->done = false;
+
+		/* This is an externally stored datum --- initialize fetch datum iterator */
+		iter->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			iter->compressed = true;
+
+			/* prepare buffer to received decompressed data */
+			iter->buf = create_toast_buffer(toast_pointer.va_rawsize, false);
+
+			/* initialize state for pglz_decompress_iterate() */
+			iter->ctrl = 0;
+			iter->ctrlc = INVALID_CTRLC;
+		}
+		else
+		{
+			iter->compressed = false;
+
+			/* point the buffer directly at the raw data */
+			iter->buf = iter->fetch_datum_iterator->buf;
+		}
+		return iter;
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/* indirect pointer --- dereference it */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		return create_detoast_iterator(attr);
+
+	}
+	else
+		/* in-line value -- no iteration used, even if it's compressed */
+		return NULL;
+}
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free memory used by the de-TOAST iterator, including buffers and
+ * fetch datum iterator.
+ * ----------
+ */
+void
+free_detoast_iterator(DetoastIterator iter)
+{
+	if (iter == NULL)
+		return;
+	if (iter->compressed)
+		free_toast_buffer(iter->buf);
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	pfree(iter);
+}
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another data chunk in external storage,
+ * de-TOAST it into iterator's toast buffer.
+ * ----------
+ */
+void
+detoast_iterate(DetoastIterator detoast_iter)
+{
+	FetchDatumIterator fetch_iter = detoast_iter->fetch_datum_iterator;
+
+	Assert(detoast_iter != NULL && !detoast_iter->done);
+
+	fetch_datum_iterate(fetch_iter);
+
+	if (detoast_iter->compressed)
+		pglz_decompress_iterate(fetch_iter->buf, detoast_iter->buf, detoast_iter);
+
+	if (detoast_iter->buf->limit == detoast_iter->buf->capacity)
+		detoast_iter->done = true;
+}
+
+/* ----------
  * toast_fetch_datum -
  *
  *	Reconstruct an in memory Datum from the chunks saved
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index a971242..ac8ae77 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -630,3 +630,358 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr)
+{
+	int			validIndex;
+	FetchDatumIterator iter;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_iterator shouldn't be called for non-ondisk datums");
+
+	iter = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iter->toast_pointer, attr);
+
+	iter->ressize = iter->toast_pointer.va_extsize;
+	iter->numchunks = ((iter->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iter->toastrel = table_open(iter->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iter->toastrel,
+									AccessShareLock,
+									&iter->toastidxs,
+									&iter->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iter->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iter->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iter->snapshot);
+	iter->toastscan = systable_beginscan_ordered(iter->toastrel, iter->toastidxs[validIndex],
+												 &iter->snapshot, 1, &iter->toastkey);
+
+	iter->buf = create_toast_buffer(iter->ressize + VARHDRSZ,
+		VARATT_EXTERNAL_IS_COMPRESSED(iter->toast_pointer));
+
+	iter->nextidx = 0;
+	iter->done = false;
+
+	return iter;
+}
+
+void
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+		return;
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in external storage,
+ * fetch it into iterator's toast buffer.
+ * ----------
+ */
+void
+fetch_datum_iterate(FetchDatumIterator iter)
+{
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	Assert(iter != NULL && !iter->done);
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+}
+
+/* ----------
+ * create_toast_buffer -
+ *
+ * Create and initialize a TOAST buffer.
+ *
+ * size: buffer size include header
+ * compressed: whether TOAST value is compressed
+ * ----------
+ */
+ToastBuffer *
+create_toast_buffer(int32 size, bool compressed)
+{
+	ToastBuffer *buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		/*
+		 * Note the constraint buf->position <= buf->limit may be broken
+		 * at initialization. Make sure that the constraint is satisfied
+		 * when consuming chars.
+		 */
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+
+	return buf;
+}
+
+void
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+		return;
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+}
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ * This function is based on pglz_decompress(), with these additional
+ * requirements:
+ *
+ * 1. We need to save the current control byte and byte position for the
+ * caller's next iteration.
+ *
+ * 2. In pglz_decompress(), we can assume we have all the source bytes
+ * available. This is not the case when we decompress one chunk at a
+ * time, so we have to make sure that we only read bytes available in the
+ * current chunk.
+ * ----------
+ */
+void
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	/*
+	 * In the while loop, sp may be incremented such that it points beyond
+	 * srcend. To guard against reading beyond the end of the current chunk,
+	 * we set srcend such that we exit the loop when we are within four bytes
+	 * of the end of the current chunk. When source->limit reaches
+	 * source->capacity, we are decompressing the last chunk, so we can (and
+	 * need to) read every byte.
+	 */
+	srcend = (const unsigned char *)
+		(source->limit == source->capacity ? source->limit : (source->limit - 4));
+	sp = (const unsigned char *) source->position;
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *) dest->capacity;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+
+		if (iter->ctrlc != INVALID_CTRLC)
+		{
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < INVALID_CTRLC && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		iter->ctrlc = ctrlc;
+		iter->ctrl = ctrl;
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index d36156f..e93870b 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -56,6 +56,8 @@ typedef struct
 	int			len1;			/* string lengths in bytes */
 	int			len2;
 
+	DetoastIterator iter;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
@@ -122,7 +124,7 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
 static bool text_position_next(TextPositionState *state);
 static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
@@ -1092,10 +1094,18 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	struct varlena *attr = (struct varlena *)
+								DatumGetPointer(PG_GETARG_DATUM(0));
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
+	DetoastIterator iter = create_detoast_iterator(attr);
+
+	if (iter != NULL)
+		str = (text *) iter->buf->buf;
+	else
+		str = PG_GETARG_TEXT_PP(0);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1123,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1122,6 +1132,7 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
+	state.iter = iter;
 	if (!text_position_next(&state))
 		result = 0;
 	else
@@ -1130,7 +1141,6 @@ text_position(text *t1, text *t2, Oid collid)
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1196,6 +1206,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	state->str2 = VARDATA_ANY(t2);
 	state->len1 = len1;
 	state->len2 = len2;
+	state->iter = NULL;
 	state->last_match = NULL;
 	state->refpoint = state->str1;
 	state->refpos = 0;
@@ -1358,6 +1369,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (state->iter != NULL)
+				PG_DETOAST_ITERATE(state->iter, hptr);
+
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1375,6 +1389,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 			const char *nptr;
 			const char *p;
 
+			if (state->iter != NULL)
+				PG_DETOAST_ITERATE(state->iter, hptr);
+
 			nptr = needle_last;
 			p = hptr;
 			while (*nptr == *p)
@@ -1438,7 +1455,7 @@ text_position_get_match_pos(TextPositionState *state)
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	/* no cleanup needed */
+	free_detoast_iterator(state->iter);
 }
 
 static void
diff --git a/src/include/access/detoast.h b/src/include/access/detoast.h
index 02029a9..0aa38eb 100644
--- a/src/include/access/detoast.h
+++ b/src/include/access/detoast.h
@@ -73,12 +73,109 @@ extern struct varlena *heap_tuple_untoast_attr_slice(struct varlena *attr,
 							  int32 sliceoffset,
 							  int32 slicelength);
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consumed.
+ * limit: point to the next char to be produced.
+ * capacity: point to the end of buffer.
+ *
+ * Constraints that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+} ToastBuffer;
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			snapshot;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+/*
+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that
+ * the field is invalid and need to read the control byte from the
+ * source buffer in the next iteration, see pglz_decompress_iterate().
+ */
+#define INVALID_CTRLC 8
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * It only makes sense to initialize a de-TOAST iterator for external on-disk values.
+ *
+ * ----------
+ */
+extern DetoastIterator create_detoast_iterator(struct varlena *attr);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free memory used by the de-TOAST iterator, including buffers and
+ * fetch datum iterator.
+ * ----------
+ */
+extern void free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another data chunk in external storage,
+ * de-TOAST it into iterator's toast buffer.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator detoast_iter);
+
+#endif
+
 /* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
  * ----------
  */
+
 extern Size toast_raw_datum_size(Datum value);
 
 /* ----------
diff --git a/src/include/access/toast_internals.h b/src/include/access/toast_internals.h
index 494b07a..18c7000 100644
--- a/src/include/access/toast_internals.h
+++ b/src/include/access/toast_internals.h
@@ -15,6 +15,7 @@
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
+#include "detoast.h"
 
 /*
  *	The information at the start of the compressed toast data.
@@ -51,4 +52,13 @@ extern void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 extern void init_toast_snapshot(Snapshot toast_snapshot);
 
+
+extern FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+extern void free_fetch_datum_iterator(FetchDatumIterator iter);
+extern void fetch_datum_iterate(FetchDatumIterator iter);
+extern ToastBuffer *create_toast_buffer(int32 size, bool compressed);
+extern void free_toast_buffer(ToastBuffer *buf);
+extern void pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									DetoastIterator iter);
+
 #endif							/* TOAST_INTERNALS_H */
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 29ae467..6ceb6bd 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -239,6 +239,19 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
 #define PG_DETOAST_DATUM_SLICE(datum,f,c) \
 		pg_detoast_datum_slice((struct varlena *) DatumGetPointer(datum), \
 		(int32) (f), (int32) (c))
+
+/*
+ * Support for de-TOASTing toasted value iteratively. "need" is a pointer
+ * between the beginning and end of iterator's ToastBuffer. The marco
+ * de-TOAST all bytes before "need" into iterator's ToastBuffer.
+ */
+#define PG_DETOAST_ITERATE(iter, need)											\
+	do {																		\
+		Assert((need) >= (iter)->buf->buf && (need) <= (iter)->buf->capacity);	\
+		while (!(iter)->done && (need) >= (iter)->buf->limit) { 				\
+			detoast_iterate(iter);												\
+		}																		\
+	} while (0)
 /* WARNING -- unaligned pointer */
 #define PG_DETOAST_DATUM_PACKED(datum) \
 	pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))
-- 
2.7.4

#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Binguo Bao (#21)
Re: [proposal] de-TOAST'ing using a iterator

On 2019-Sep-10, Binguo Bao wrote:

+/*
+ * Support for de-TOASTing toasted value iteratively. "need" is a pointer
+ * between the beginning and end of iterator's ToastBuffer. The marco
+ * de-TOAST all bytes before "need" into iterator's ToastBuffer.
+ */
+#define PG_DETOAST_ITERATE(iter, need)											\
+	do {																		\
+		Assert((need) >= (iter)->buf->buf && (need) <= (iter)->buf->capacity);	\
+		while (!(iter)->done && (need) >= (iter)->buf->limit) { 				\
+			detoast_iterate(iter);												\
+		}																		\
+	} while (0)
/* WARNING -- unaligned pointer */
#define PG_DETOAST_DATUM_PACKED(datum) \
pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))

In broad terms this patch looks pretty good to me. I only have a small
quibble with this API definition in fmgr.h -- namely that it forces us
to export the definition of all the structs (that could otherwise be
private to toast_internals.h) in order to satisfy callers of this macro.
I am wondering if it would be possible to change detoast_iterate and
PG_DETOAST_ITERATE in a way that those details remain hidden -- I am
thinking something like "if this returns NULL, then iteration has
finished"; and relieve the macro from doing the "->buf->buf" and
"->buf->limit" checks. I think changing that would require a change in
how the rest of the code is structured around this (the textpos internal
function), but seems like it would be better overall.

(AFAICS that would enable us to expose much less about the
iterator-related structs to detoast.h -- you should be able to move the
struct defs to toast_internals.h)

Then again, it might be just wishful thinking, but it seems worth
considering at least.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#22)
Re: [proposal] de-TOAST'ing using a iterator

On Mon, Sep 16, 2019 at 06:22:51PM -0300, Alvaro Herrera wrote:

On 2019-Sep-10, Binguo Bao wrote:

+/*
+ * Support for de-TOASTing toasted value iteratively. "need" is a pointer
+ * between the beginning and end of iterator's ToastBuffer. The marco
+ * de-TOAST all bytes before "need" into iterator's ToastBuffer.
+ */
+#define PG_DETOAST_ITERATE(iter, need)											\
+	do {																		\
+		Assert((need) >= (iter)->buf->buf && (need) <= (iter)->buf->capacity);	\
+		while (!(iter)->done && (need) >= (iter)->buf->limit) { 				\
+			detoast_iterate(iter);												\
+		}																		\
+	} while (0)
/* WARNING -- unaligned pointer */
#define PG_DETOAST_DATUM_PACKED(datum) \
pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))

In broad terms this patch looks pretty good to me. I only have a small
quibble with this API definition in fmgr.h -- namely that it forces us
to export the definition of all the structs (that could otherwise be
private to toast_internals.h) in order to satisfy callers of this macro.
I am wondering if it would be possible to change detoast_iterate and
PG_DETOAST_ITERATE in a way that those details remain hidden -- I am
thinking something like "if this returns NULL, then iteration has
finished"; and relieve the macro from doing the "->buf->buf" and
"->buf->limit" checks. I think changing that would require a change in
how the rest of the code is structured around this (the textpos internal
function), but seems like it would be better overall.

(AFAICS that would enable us to expose much less about the
iterator-related structs to detoast.h -- you should be able to move the
struct defs to toast_internals.h)

Then again, it might be just wishful thinking, but it seems worth
considering at least.

I do agree hiding the exact struct definition would be nice. IMHO if the
only reason for exposing it is the PG_DETOAST_ITERATE() macro (or rather
the references to buf fields in it) then we can simply provide functions
to return those fields.

Granted, that may have impact on performance, but I'm not sure it'll be
even measurable. Also, the other detoast macros right before this new
one are also ultimately just a function calls.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24Binguo Bao
djydewang@gmail.com
In reply to: Alvaro Herrera (#22)
3 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

Alvaro Herrera <alvherre@2ndquadrant.com> 于2019年9月17日周二 上午5:51写道:

On 2019-Sep-10, Binguo Bao wrote:

+/*
+ * Support for de-TOASTing toasted value iteratively. "need" is a

pointer

+ * between the beginning and end of iterator's ToastBuffer. The marco
+ * de-TOAST all bytes before "need" into iterator's ToastBuffer.
+ */
+#define PG_DETOAST_ITERATE(iter, need)

\

+ do {

\

+ Assert((need) >= (iter)->buf->buf && (need) <=

(iter)->buf->capacity); \

+ while (!(iter)->done && (need) >= (iter)->buf->limit) {

\

+ detoast_iterate(iter);

\

+ }

\

+ } while (0)
/* WARNING -- unaligned pointer */
#define PG_DETOAST_DATUM_PACKED(datum) \
pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))

In broad terms this patch looks pretty good to me. I only have a small
quibble with this API definition in fmgr.h -- namely that it forces us
to export the definition of all the structs (that could otherwise be
private to toast_internals.h) in order to satisfy callers of this macro.
I am wondering if it would be possible to change detoast_iterate and
PG_DETOAST_ITERATE in a way that those details remain hidden -- I am
thinking something like "if this returns NULL, then iteration has
finished"; and relieve the macro from doing the "->buf->buf" and
"->buf->limit" checks. I think changing that would require a change in
how the rest of the code is structured around this (the textpos internal
function), but seems like it would be better overall.

(AFAICS that would enable us to expose much less about the
iterator-related structs to detoast.h -- you should be able to move the
struct defs to toast_internals.h)

Then again, it might be just wishful thinking, but it seems worth
considering at least.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

I've tied to hide the details of the struct in patch v11 with checking
"need" pointer
inside detoast_iterate function. I also compared the performance of the two
versions.

patch v10 patch v11
comp. beg. 1413ms 1489ms
comp. end 24327ms 28011ms
uncomp. beg. 1439ms 1432ms
uncomp. end 25019ms 29007ms

We can see that v11 is about 15% slower than v10 on suffix queries since
this involves
the complete de-TOASTing and detoast_iterate() function is called
frequently in v11.

Personally, I prefer patch v10. Its performance is superior, although it
exposes some struct details.

--
Best regards,
Binguo Bao

Attachments:

0001-de-TOASTing-using-a-iterator-11.patchtext/x-patch; charset=US-ASCII; name=0001-de-TOASTing-using-a-iterator-11.patchDownload
From 8f6381d272816175d3e681c9d1cd8c6b5f27f44f Mon Sep 17 00:00:00 2001
From: BBG <djydewang@gmail.com>
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/common/detoast.c         | 110 +++++++++
 src/backend/access/common/toast_internals.c | 355 ++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c             |  29 ++-
 src/include/access/detoast.h                |  32 +++
 src/include/access/toast_internals.h        |  75 ++++++
 src/include/fmgr.h                          |   8 +
 6 files changed, 603 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index c8b49d6..f437d27 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -290,6 +290,116 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 }
 
 /* ----------
+ * create_detoast_iterator -
+ *
+ * It only makes sense to initialize a de-TOAST iterator for external on-disk values.
+ *
+ * ----------
+ */
+DetoastIterator
+create_detoast_iterator(struct varlena *attr)
+{
+	struct varatt_external toast_pointer;
+	DetoastIterator iter;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		iter = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iter->done = false;
+
+		/* This is an externally stored datum --- initialize fetch datum iterator */
+		iter->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			iter->compressed = true;
+
+			/* prepare buffer to received decompressed data */
+			iter->buf = create_toast_buffer(toast_pointer.va_rawsize, false);
+
+			/* initialize state for pglz_decompress_iterate() */
+			iter->ctrl = 0;
+			iter->ctrlc = INVALID_CTRLC;
+		}
+		else
+		{
+			iter->compressed = false;
+
+			/* point the buffer directly at the raw data */
+			iter->buf = iter->fetch_datum_iterator->buf;
+		}
+		return iter;
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/* indirect pointer --- dereference it */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		return create_detoast_iterator(attr);
+
+	}
+	else
+		/* in-line value -- no iteration used, even if it's compressed */
+		return NULL;
+}
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free memory used by the de-TOAST iterator, including buffers and
+ * fetch datum iterator.
+ * ----------
+ */
+void
+free_detoast_iterator(DetoastIterator iter)
+{
+	if (iter == NULL)
+		return;
+	if (iter->compressed)
+		free_toast_buffer(iter->buf);
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	pfree(iter);
+}
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * "need" is a pointer between the beginning and end of iterator's
+ * ToastBuffer, de-TOAST all bytes before "need" into iterator's ToastBuffer.
+ * ----------
+ */
+void
+detoast_iterate(DetoastIterator detoast_iter, const char *need)
+{
+	FetchDatumIterator fetch_iter = detoast_iter->fetch_datum_iterator;
+
+	Assert(detoast_iter != NULL);
+	if (detoast_iter->done)
+		return;
+
+	Assert((need) >= (detoast_iter)->buf->buf && (need) <= (detoast_iter)->buf->capacity);
+
+	while (!(detoast_iter)->done && (need) >= (detoast_iter)->buf->limit)
+	{
+		fetch_datum_iterate(fetch_iter);
+
+		if (detoast_iter->compressed)
+			pglz_decompress_iterate(fetch_iter->buf, detoast_iter->buf, detoast_iter);
+
+		if (detoast_iter->buf->limit == detoast_iter->buf->capacity)
+			detoast_iter->done = true;
+	}
+}
+
+/* ----------
  * toast_fetch_datum -
  *
  *	Reconstruct an in memory Datum from the chunks saved
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index a971242..ac8ae77 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -630,3 +630,358 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr)
+{
+	int			validIndex;
+	FetchDatumIterator iter;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_iterator shouldn't be called for non-ondisk datums");
+
+	iter = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iter->toast_pointer, attr);
+
+	iter->ressize = iter->toast_pointer.va_extsize;
+	iter->numchunks = ((iter->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iter->toastrel = table_open(iter->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iter->toastrel,
+									AccessShareLock,
+									&iter->toastidxs,
+									&iter->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iter->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iter->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iter->snapshot);
+	iter->toastscan = systable_beginscan_ordered(iter->toastrel, iter->toastidxs[validIndex],
+												 &iter->snapshot, 1, &iter->toastkey);
+
+	iter->buf = create_toast_buffer(iter->ressize + VARHDRSZ,
+		VARATT_EXTERNAL_IS_COMPRESSED(iter->toast_pointer));
+
+	iter->nextidx = 0;
+	iter->done = false;
+
+	return iter;
+}
+
+void
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+		return;
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in external storage,
+ * fetch it into iterator's toast buffer.
+ * ----------
+ */
+void
+fetch_datum_iterate(FetchDatumIterator iter)
+{
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	Assert(iter != NULL && !iter->done);
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+}
+
+/* ----------
+ * create_toast_buffer -
+ *
+ * Create and initialize a TOAST buffer.
+ *
+ * size: buffer size include header
+ * compressed: whether TOAST value is compressed
+ * ----------
+ */
+ToastBuffer *
+create_toast_buffer(int32 size, bool compressed)
+{
+	ToastBuffer *buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		/*
+		 * Note the constraint buf->position <= buf->limit may be broken
+		 * at initialization. Make sure that the constraint is satisfied
+		 * when consuming chars.
+		 */
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+
+	return buf;
+}
+
+void
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+		return;
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+}
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ * This function is based on pglz_decompress(), with these additional
+ * requirements:
+ *
+ * 1. We need to save the current control byte and byte position for the
+ * caller's next iteration.
+ *
+ * 2. In pglz_decompress(), we can assume we have all the source bytes
+ * available. This is not the case when we decompress one chunk at a
+ * time, so we have to make sure that we only read bytes available in the
+ * current chunk.
+ * ----------
+ */
+void
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	/*
+	 * In the while loop, sp may be incremented such that it points beyond
+	 * srcend. To guard against reading beyond the end of the current chunk,
+	 * we set srcend such that we exit the loop when we are within four bytes
+	 * of the end of the current chunk. When source->limit reaches
+	 * source->capacity, we are decompressing the last chunk, so we can (and
+	 * need to) read every byte.
+	 */
+	srcend = (const unsigned char *)
+		(source->limit == source->capacity ? source->limit : (source->limit - 4));
+	sp = (const unsigned char *) source->position;
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *) dest->capacity;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+
+		if (iter->ctrlc != INVALID_CTRLC)
+		{
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < INVALID_CTRLC && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT. It is dangerous and platform dependent to use
+				 * memcpy() here, because the copied areas could overlap
+				 * extremely!
+				 */
+				len = Min(len, destend - dp);
+				while (len--)
+				{
+					*dp = dp[-off];
+					dp++;
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		iter->ctrlc = ctrlc;
+		iter->ctrl = ctrl;
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 722b2c7..b03aa94 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -56,6 +56,8 @@ typedef struct
 	int			len1;			/* string lengths in bytes */
 	int			len2;
 
+	DetoastIterator iter;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
@@ -122,7 +124,7 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
 static bool text_position_next(TextPositionState *state);
 static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
@@ -1092,10 +1094,18 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	struct varlena *attr = (struct varlena *)
+								DatumGetPointer(PG_GETARG_DATUM(0));
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
+	DetoastIterator iter = create_detoast_iterator(attr);
+
+	if (iter != NULL)
+		str = (text *) iter->buf->buf;
+	else
+		str = PG_GETARG_TEXT_PP(0);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1123,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1122,6 +1132,7 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
+	state.iter = iter;
 	if (!text_position_next(&state))
 		result = 0;
 	else
@@ -1130,7 +1141,6 @@ text_position(text *t1, text *t2, Oid collid)
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1196,6 +1206,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	state->str2 = VARDATA_ANY(t2);
 	state->len1 = len1;
 	state->len2 = len2;
+	state->iter = NULL;
 	state->last_match = NULL;
 	state->refpoint = state->str1;
 	state->refpos = 0;
@@ -1358,6 +1369,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (state->iter != NULL)
+				PG_DETOAST_ITERATE(state->iter, hptr);
+
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1375,6 +1389,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 			const char *nptr;
 			const char *p;
 
+			if (state->iter != NULL)
+				PG_DETOAST_ITERATE(state->iter, hptr);
+
 			nptr = needle_last;
 			p = hptr;
 			while (*nptr == *p)
@@ -1438,7 +1455,7 @@ text_position_get_match_pos(TextPositionState *state)
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	/* no cleanup needed */
+	free_detoast_iterator(state->iter);
 }
 
 static void
diff --git a/src/include/access/detoast.h b/src/include/access/detoast.h
index 02029a9..f77d102 100644
--- a/src/include/access/detoast.h
+++ b/src/include/access/detoast.h
@@ -11,6 +11,7 @@
  */
 #ifndef DETOAST_H
 #define DETOAST_H
+#include "toast_internals.h"
 
 /*
  * Testing whether an externally-stored value is compressed now requires
@@ -73,12 +74,43 @@ extern struct varlena *heap_tuple_untoast_attr_slice(struct varlena *attr,
 							  int32 sliceoffset,
 							  int32 slicelength);
 
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * It only makes sense to initialize a de-TOAST iterator for external on-disk values.
+ *
+ * ----------
+ */
+extern DetoastIterator create_detoast_iterator(struct varlena *attr);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free memory used by the de-TOAST iterator, including buffers and
+ * fetch datum iterator.
+ * ----------
+ */
+extern void free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * "need" is a pointer between the beginning and end of iterator's
+ * ToastBuffer, de-TOAST all bytes before "need" into iterator's ToastBuffer.
+ * ----------
+ */
+extern void detoast_iterate(DetoastIterator detoast_iter, const char *need);
+
 /* ----------
  * toast_raw_datum_size -
  *
  *	Return the raw (detoasted) size of a varlena datum
  * ----------
  */
+
 extern Size toast_raw_datum_size(Datum value);
 
 /* ----------
diff --git a/src/include/access/toast_internals.h b/src/include/access/toast_internals.h
index 494b07a..5034e2a 100644
--- a/src/include/access/toast_internals.h
+++ b/src/include/access/toast_internals.h
@@ -51,4 +51,79 @@ extern void toast_close_indexes(Relation *toastidxs, int num_indexes,
 								LOCKMODE lock);
 extern void init_toast_snapshot(Snapshot toast_snapshot);
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consumed.
+ * limit: point to the next char to be produced.
+ * capacity: point to the end of buffer.
+ *
+ * Constraints that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+} ToastBuffer;
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			snapshot;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+/*
+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that
+ * the field is invalid and need to read the control byte from the
+ * source buffer in the next iteration, see pglz_decompress_iterate().
+ */
+#define INVALID_CTRLC 8
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+#endif
+
+extern FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+extern void free_fetch_datum_iterator(FetchDatumIterator iter);
+extern void fetch_datum_iterate(FetchDatumIterator iter);
+extern ToastBuffer *create_toast_buffer(int32 size, bool compressed);
+extern void free_toast_buffer(ToastBuffer *buf);
+extern void pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									DetoastIterator iter);
+
 #endif							/* TOAST_INTERNALS_H */
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 29ae467..b1dec0e 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -239,6 +239,14 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
 #define PG_DETOAST_DATUM_SLICE(datum,f,c) \
 		pg_detoast_datum_slice((struct varlena *) DatumGetPointer(datum), \
 		(int32) (f), (int32) (c))
+
+/*
+ * Support for de-TOASTing toasted value iteratively. "need" is a pointer
+ * between the beginning and end of iterator's ToastBuffer. The marco
+ * de-TOAST all bytes before "need" into iterator's ToastBuffer.
+ */
+#define PG_DETOAST_ITERATE(iter, need) \
+	detoast_iterate(iter, (const char *)need);
 /* WARNING -- unaligned pointer */
 #define PG_DETOAST_DATUM_PACKED(datum) \
 	pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))
-- 
2.7.4

init-test.shapplication/x-shellscript; name=init-test.shDownload
iterator-test.shapplication/x-shellscript; name=iterator-test.shDownload
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Binguo Bao (#24)
Re: [proposal] de-TOAST'ing using a iterator

Paul Ramsey, do you have opinions to share about this patch? I think
PostGIS might benefit from it. Thread starts here:

/messages/by-id/CAL-OGks_onzpc9M9bXPCztMofWULcFkyeCeKiAgXzwRL8kXiag@mail.gmail.com

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#26Paul Ramsey
pramsey@cleverelephant.ca
In reply to: Alvaro Herrera (#25)
Re: [proposal] de-TOAST'ing using a iterator

On Sep 23, 2019, at 9:45 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Paul Ramsey, do you have opinions to share about this patch? I think
PostGIS might benefit from it. Thread starts here:

I like the idea a great deal, but actually PostGIS is probably neutral on it: we generally want to retrieve things off the front of our serializations (the metadata header) rather than search through them for things in the middle. So the improvements to Pg12 cover all of our use cases. Haven’t had time to do any performance checks on it yet.

ATB,

P.

#27Michael Paquier
michael@paquier.xyz
In reply to: Binguo Bao (#24)
Re: [proposal] de-TOAST'ing using a iterator

On Mon, Sep 23, 2019 at 09:55:24PM +0800, Binguo Bao wrote:

Personally, I prefer patch v10. Its performance is superior, although it
exposes some struct details.

Please be careful. The patch was waiting for author input, but its
latest status does not match what the CF app was saying. I have moved
this patch to next CF, with "Needs review" as status.
--
Michael

#28John Naylor
john.naylor@2ndquadrant.com
In reply to: Binguo Bao (#24)
Re: [proposal] de-TOAST'ing using a iterator

On Mon, Sep 23, 2019 at 9:55 PM Binguo Bao <djydewang@gmail.com> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> 于2019年9月17日周二 上午5:51写道:

In broad terms this patch looks pretty good to me. I only have a small
quibble with this API definition in fmgr.h -- namely that it forces us
to export the definition of all the structs (that could otherwise be
private to toast_internals.h) in order to satisfy callers of this macro.
I am wondering if it would be possible to change detoast_iterate and
PG_DETOAST_ITERATE in a way that those details remain hidden -- I am
thinking something like "if this returns NULL, then iteration has
finished"; and relieve the macro from doing the "->buf->buf" and
"->buf->limit" checks. I think changing that would require a change in
how the rest of the code is structured around this (the textpos internal
function), but seems like it would be better overall.

(AFAICS that would enable us to expose much less about the
iterator-related structs to detoast.h -- you should be able to move the
struct defs to toast_internals.h)

Then again, it might be just wishful thinking, but it seems worth
considering at least.

I've tied to hide the details of the struct in patch v11 with checking "need" pointer
inside detoast_iterate function.

I took a brief look at v11 to see if there's anything I can do to help
it move forward. I'm not yet sure how it would look code-wise to
implement Alvaro and Tomas's comments upthread, but I'm pretty sure
this part means the iterator-related structs are just as exposed as
before, but in a roundabout way that completely defeats the purpose of
hiding internals:

--- a/src/include/access/detoast.h
+++ b/src/include/access/detoast.h
@@ -11,6 +11,7 @@
  */
 #ifndef DETOAST_H
 #define DETOAST_H
+#include "toast_internals.h"

That said, the idea behind the PG_DETOAST_ITERATE macro was my
suggestion so that text_position_next_internal() didn't have to call
the iterator function every time the needle advances, which caused a
noticeable performance penalty. The toast code has moved around quite
a bit since then, and I'm not sure of the best way forward.

Also, c60e520f6e0 changed the standard pglz decompression algorithm.
It might be worth it to see if those changes are applicable to the
iterator case. At least one of the improved comments could be brought
over.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: John Naylor (#28)
Re: [proposal] de-TOAST'ing using a iterator

On 2020-Jan-12, John Naylor wrote:

I took a brief look at v11 to see if there's anything I can do to help
it move forward. I'm not yet sure how it would look code-wise to
implement Alvaro and Tomas's comments upthread, but I'm pretty sure
this part means the iterator-related structs are just as exposed as
before, but in a roundabout way that completely defeats the purpose of
hiding internals:

Agreed -- I think this patch still needs more work before being
committable; I agree with John that the changes after v10 made it worse,
not better. Rather than cross-including header files, it seems better
to expose some struct definitions after all and let the main iterator
interface (detoast_iterate) be a "static inline" function in detoast.h.

So let's move forward with v10 (submitted on Sept 10th).

Looking at that version, I don't think the function protos that were put
in toast_internals.h should be there at all; I think they should be in
detoast.h so that they can be used. But I don't like the fact that
detoast.h now has to include genam.h; that seems pretty bogus. I think
this can be fixed by moving the FetchDatumIteratorData struct definition
(but not its typedef) to toast_internals.h.

OTOH we've recently seen the TOAST interface (and header files) heavily
reworked because of table-AM considerations, so probably this needs even
more changes to avoid parts of it becoming heapam-dependant again.

create_toast_buffer() doing two pallocs seems a waste.  It could be a
single one,
+   buf = (ToastBuffer *) palloc0(MAXALIGN(sizeof(ToastBuffer)) + size);
+   buf->buf = buf + MAXALIGN(sizeof(ToastBuffer));
(I'm not sure that the MAXALIGNs are strictly necessary there; I think
we access the buf as four-byte aligned stuff somewhere in the toast
innards, but maybe I'm wrong about that.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#30John Naylor
john.naylor@2ndquadrant.com
In reply to: Alvaro Herrera (#29)
1 attachment(s)
Re: [proposal] de-TOAST'ing using a iterator

On Fri, Mar 13, 2020 at 10:19 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

So let's move forward with v10 (submitted on Sept 10th).

In the attached v12, based on v10, I've made some progress to address
some of the remaining issues. There's still some work to be done, in
particular to think about how to hide the struct details better, as
mentioned by you and Tomas back in September, but wanted to put this
much out there to keep things moving.

Rather than cross-including header files, it seems better
to expose some struct definitions after all and let the main iterator
interface (detoast_iterate) be a "static inline" function in detoast.h.

The cross-include is gone, and detoast_iterate is now static inline.

Looking at that version, I don't think the function protos that were put
in toast_internals.h should be there at all; I think they should be in
detoast.h so that they can be used.

Done.

But I don't like the fact that
detoast.h now has to include genam.h; that seems pretty bogus. I think
this can be fixed by moving the FetchDatumIteratorData struct definition
(but not its typedef) to toast_internals.h.

I took a stab at this, but I ended up playing whack-a-mole with
compiler warnings. I'll have to step back and try again later.

OTOH we've recently seen the TOAST interface (and header files) heavily
reworked because of table-AM considerations, so probably this needs even
more changes to avoid parts of it becoming heapam-dependant again.

Haven't thought about this.

create_toast_buffer() doing two pallocs seems a waste.  It could be a
single one,
+   buf = (ToastBuffer *) palloc0(MAXALIGN(sizeof(ToastBuffer)) + size);
+   buf->buf = buf + MAXALIGN(sizeof(ToastBuffer));
(I'm not sure that the MAXALIGNs are strictly necessary there; I think
we access the buf as four-byte aligned stuff somewhere in the toast
innards, but maybe I'm wrong about that.)

I tried this briefly and got backend crashes, and didn't try to analyze further.

In addition, I brought in the memcpy() and comment changes in
c60e520f6e from common/pg_lzcompress.c to pglz_decompress_iterate(). I
also made a typo correction in the former, which could be extracted
into a separate patch if this one is not ready in time.

For this comment back in [1]/messages/by-id/20190903201226.GA16197@alvherre.pgsql:

This arrangement where there are two ToastBuffers and they sometimes are
the same is cute, but I think we need a better way to know when each
needs to be freed afterwards; the proposed coding is confusing. And it
certainly it needs more than zero comments about what's going on there.

One idea is to test if the pointers are equal via a macro, rather than
setting and testing a member bool var. And I agree commentary could be
improved in this area.

[1]: /messages/by-id/20190903201226.GA16197@alvherre.pgsql

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v12-detoasting-using-an-iterator.patchapplication/octet-stream; name=v12-detoasting-using-an-iterator.patchDownload
 src/backend/access/common/detoast.c         |  78 ++++++
 src/backend/access/common/toast_internals.c | 362 ++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c             |  29 ++-
 src/common/pg_lzcompress.c                  |   2 +-
 src/include/access/detoast.h                | 119 +++++++++
 src/include/fmgr.h                          |  13 +
 6 files changed, 596 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index 496240c755..bed34bfc1e 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -312,6 +312,84 @@ detoast_attr_slice(struct varlena *attr,
 	return result;
 }
 
+/* ----------
+ * create_detoast_iterator -
+ *
+ * It only makes sense to initialize a de-TOAST iterator for external on-disk values.
+ *
+ * ----------
+ */
+DetoastIterator
+create_detoast_iterator(struct varlena *attr)
+{
+	struct varatt_external toast_pointer;
+	DetoastIterator iter;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		iter = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iter->done = false;
+
+		/* This is an externally stored datum --- initialize fetch datum iterator */
+		iter->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			iter->compressed = true;
+
+			/* prepare buffer to received decompressed data */
+			iter->buf = create_toast_buffer(toast_pointer.va_rawsize, false);
+
+			/* initialize state for pglz_decompress_iterate() */
+			iter->ctrl = 0;
+			iter->ctrlc = INVALID_CTRLC;
+		}
+		else
+		{
+			iter->compressed = false;
+
+			/* point the buffer directly at the raw data */
+			iter->buf = iter->fetch_datum_iterator->buf;
+		}
+		return iter;
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/* indirect pointer --- dereference it */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		return create_detoast_iterator(attr);
+
+	}
+	else
+		/* in-line value -- no iteration used, even if it's compressed */
+		return NULL;
+}
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free memory used by the de-TOAST iterator, including buffers and
+ * fetch datum iterator.
+ * ----------
+ */
+void
+free_detoast_iterator(DetoastIterator iter)
+{
+	if (iter == NULL)
+		return;
+	if (iter->compressed)
+		free_toast_buffer(iter->buf);
+	free_fetch_datum_iterator(iter->fetch_datum_iterator);
+	pfree(iter);
+}
+
 /* ----------
  * toast_fetch_datum -
  *
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 25a81e5ec6..c17e5cc4fe 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -630,3 +630,365 @@ init_toast_snapshot(Snapshot toast_snapshot)
 
 	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
+
+/* ----------
+ * create_fetch_datum_iterator -
+ *
+ * Initialize fetch datum iterator.
+ * ----------
+ */
+FetchDatumIterator
+create_fetch_datum_iterator(struct varlena *attr)
+{
+	int			validIndex;
+	FetchDatumIterator iter;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		elog(ERROR, "create_fetch_datum_iterator shouldn't be called for non-ondisk datums");
+
+	iter = (FetchDatumIterator) palloc0(sizeof(FetchDatumIteratorData));
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(iter->toast_pointer, attr);
+
+	iter->ressize = iter->toast_pointer.va_extsize;
+	iter->numchunks = ((iter->ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	iter->toastrel = table_open(iter->toast_pointer.va_toastrelid, AccessShareLock);
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(iter->toastrel,
+									AccessShareLock,
+									&iter->toastidxs,
+									&iter->num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&iter->toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(iter->toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+
+	init_toast_snapshot(&iter->snapshot);
+	iter->toastscan = systable_beginscan_ordered(iter->toastrel, iter->toastidxs[validIndex],
+												 &iter->snapshot, 1, &iter->toastkey);
+
+	iter->buf = create_toast_buffer(iter->ressize + VARHDRSZ,
+		VARATT_EXTERNAL_IS_COMPRESSED(iter->toast_pointer));
+
+	iter->nextidx = 0;
+	iter->done = false;
+
+	return iter;
+}
+
+void
+free_fetch_datum_iterator(FetchDatumIterator iter)
+{
+	if (iter == NULL)
+		return;
+
+	if (!iter->done)
+	{
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+	}
+	free_toast_buffer(iter->buf);
+	pfree(iter);
+}
+
+/* ----------
+ * fetch_datum_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another chunk data in external storage,
+ * fetch it into iterator's toast buffer.
+ * ----------
+ */
+void
+fetch_datum_iterate(FetchDatumIterator iter)
+{
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	int32		residx;
+	Pointer		chunk;
+	bool		isnull;
+	char		*chunkdata;
+	int32		chunksize;
+
+	Assert(iter != NULL && !iter->done);
+
+	ttup = systable_getnext_ordered(iter->toastscan, ForwardScanDirection);
+	if (ttup == NULL)
+	{
+		/*
+		 * Final checks that we successfully fetched the datum
+		 */
+		if (iter->nextidx != iter->numchunks)
+			elog(ERROR, "missing chunk number %d for toast value %u in %s",
+				 iter->nextidx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+
+		/*
+		 * End scan and close relations
+		 */
+		systable_endscan_ordered(iter->toastscan);
+		toast_close_indexes(iter->toastidxs, iter->num_indexes, AccessShareLock);
+		table_close(iter->toastrel, AccessShareLock);
+
+		iter->done = true;
+		return;
+	}
+
+	/*
+	 * Have a chunk, extract the sequence number and the data
+	 */
+	toasttupDesc = iter->toastrel->rd_att;
+	residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+	Assert(!isnull);
+	chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+	Assert(!isnull);
+	if (!VARATT_IS_EXTENDED(chunk))
+	{
+		chunksize = VARSIZE(chunk) - VARHDRSZ;
+		chunkdata = VARDATA(chunk);
+	}
+	else if (VARATT_IS_SHORT(chunk))
+	{
+		/* could happen due to heap_form_tuple doing its thing */
+		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		chunkdata = VARDATA_SHORT(chunk);
+	}
+	else
+	{
+		/* should never happen */
+		elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+		chunksize = 0;		/* keep compiler quiet */
+		chunkdata = NULL;
+	}
+
+	/*
+	 * Some checks on the data we've found
+	 */
+	if (residx != iter->nextidx)
+		elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
+			 residx, iter->nextidx,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+	if (residx < iter->numchunks - 1)
+	{
+		if (chunksize != TOAST_MAX_CHUNK_SIZE)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
+				 residx, iter->numchunks,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else if (residx == iter->numchunks - 1)
+	{
+		if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != iter->ressize)
+			elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+				 chunksize,
+				 (int) (iter->ressize - residx * TOAST_MAX_CHUNK_SIZE),
+				 residx,
+				 iter->toast_pointer.va_valueid,
+				 RelationGetRelationName(iter->toastrel));
+	}
+	else
+		elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
+			 residx,
+			 0, iter->numchunks - 1,
+			 iter->toast_pointer.va_valueid,
+			 RelationGetRelationName(iter->toastrel));
+
+	/*
+	 * Copy the data into proper place in our iterator buffer
+	 */
+	memcpy(iter->buf->limit, chunkdata, chunksize);
+	iter->buf->limit += chunksize;
+
+	iter->nextidx++;
+}
+
+/* ----------
+ * create_toast_buffer -
+ *
+ * Create and initialize a TOAST buffer.
+ *
+ * size: buffer size include header
+ * compressed: whether TOAST value is compressed
+ * ----------
+ */
+ToastBuffer *
+create_toast_buffer(int32 size, bool compressed)
+{
+	ToastBuffer *buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+	buf->buf = (const char *) palloc0(size);
+	if (compressed) {
+		SET_VARSIZE_COMPRESSED(buf->buf, size);
+		/*
+		 * Note the constraint buf->position <= buf->limit may be broken
+		 * at initialization. Make sure that the constraint is satisfied
+		 * when consuming chars.
+		 */
+		buf->position = VARDATA_4B_C(buf->buf);
+	}
+	else
+	{
+		SET_VARSIZE(buf->buf, size);
+		buf->position = VARDATA_4B(buf->buf);
+	}
+	buf->limit = VARDATA(buf->buf);
+	buf->capacity = buf->buf + size;
+
+	return buf;
+}
+
+void
+free_toast_buffer(ToastBuffer *buf)
+{
+	if (buf == NULL)
+		return;
+
+	pfree((void *)buf->buf);
+	pfree(buf);
+}
+
+/* ----------
+ * pglz_decompress_iterate -
+ *
+ * This function is based on pglz_decompress(), with these additional
+ * requirements:
+ *
+ * 1. We need to save the current control byte and byte position for the
+ * caller's next iteration.
+ *
+ * 2. In pglz_decompress(), we can assume we have all the source bytes
+ * available. This is not the case when we decompress one chunk at a
+ * time, so we have to make sure that we only read bytes available in the
+ * current chunk.
+ * ----------
+ */
+void
+pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest, DetoastIterator iter)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+
+	/*
+	 * In the while loop, sp may be incremented such that it points beyond
+	 * srcend. To guard against reading beyond the end of the current chunk,
+	 * we set srcend such that we exit the loop when we are within four bytes
+	 * of the end of the current chunk. When source->limit reaches
+	 * source->capacity, we are decompressing the last chunk, so we can (and
+	 * need to) read every byte.
+	 */
+	srcend = (const unsigned char *)
+		(source->limit == source->capacity ? source->limit : (source->limit - 4));
+	sp = (const unsigned char *) source->position;
+	dp = (unsigned char *) dest->limit;
+	destend = (unsigned char *) dest->capacity;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl;
+		int			ctrlc;
+
+		if (iter->ctrlc != INVALID_CTRLC)
+		{
+			ctrl = iter->ctrl;
+			ctrlc = iter->ctrlc;
+		}
+		else
+		{
+			ctrl = *sp++;
+			ctrlc = 0;
+		}
+
+
+		for (; ctrlc < INVALID_CTRLC && sp < srcend && dp < destend; ctrlc++)
+		{
+
+			if (ctrl & 1)
+			{
+				/*
+				 * Set control bit means we must read a match tag. The match
+				 * is coded with two bytes. First byte uses lower nibble to
+				 * code length - 3. Higher nibble contains upper 4 bits of the
+				 * offset. The next following byte contains the lower 8 bits
+				 * of the offset. If the length is coded as 18, another
+				 * extension tag byte tells how much longer the match really
+				 * was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * OUTPUT (copy len bytes from dp - off to dp). The copied
+				 * areas could overlap; to prevent possible uncertainty, we
+				 * copy only non-overlapping regions.
+				 */
+				len = Min(len, destend - dp);
+				while (off < len)
+				{
+					/* see comments in common/pg_lzcompress.c */
+					memcpy(dp, dp - off, off);
+					len -= off;
+					dp += off;
+					off += off;
+				}
+				memcpy(dp, dp - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+
+		iter->ctrlc = ctrlc;
+		iter->ctrl = ctrl;
+	}
+
+	source->position = (char *) sp;
+	dest->limit = (char *) dp;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 907b5ab7b0..7831d84ee7 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -56,6 +56,8 @@ typedef struct
 	int			len1;			/* string lengths in bytes */
 	int			len2;
 
+	DetoastIterator iter;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
@@ -122,7 +124,7 @@ static text *text_substring(Datum str,
 							int32 length,
 							bool length_not_specified);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
-static int	text_position(text *t1, text *t2, Oid collid);
+static int	text_position(text *t1, text *t2, Oid collid, DetoastIterator iter);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
 static bool text_position_next(TextPositionState *state);
 static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
@@ -1092,10 +1094,18 @@ text_overlay(text *t1, text *t2, int sp, int sl)
 Datum
 textpos(PG_FUNCTION_ARGS)
 {
-	text	   *str = PG_GETARG_TEXT_PP(0);
+	text		*str;
+	struct varlena *attr = (struct varlena *)
+								DatumGetPointer(PG_GETARG_DATUM(0));
 	text	   *search_str = PG_GETARG_TEXT_PP(1);
+	DetoastIterator iter = create_detoast_iterator(attr);
+
+	if (iter != NULL)
+		str = (text *) iter->buf->buf;
+	else
+		str = PG_GETARG_TEXT_PP(0);
 
-	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION()));
+	PG_RETURN_INT32((int32) text_position(str, search_str, PG_GET_COLLATION(), iter));
 }
 
 /*
@@ -1113,7 +1123,7 @@ textpos(PG_FUNCTION_ARGS)
  *	functions.
  */
 static int
-text_position(text *t1, text *t2, Oid collid)
+text_position(text *t1, text *t2, Oid collid, DetoastIterator iter)
 {
 	TextPositionState state;
 	int			result;
@@ -1127,6 +1137,7 @@ text_position(text *t1, text *t2, Oid collid)
 		return 0;
 
 	text_position_setup(t1, t2, collid, &state);
+	state.iter = iter;
 	if (!text_position_next(&state))
 		result = 0;
 	else
@@ -1135,7 +1146,6 @@ text_position(text *t1, text *t2, Oid collid)
 	return result;
 }
 
-
 /*
  * text_position_setup, text_position_next, text_position_cleanup -
  *	Component steps of text_position()
@@ -1201,6 +1211,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	state->str2 = VARDATA_ANY(t2);
 	state->len1 = len1;
 	state->len2 = len2;
+	state->iter = NULL;
 	state->last_match = NULL;
 	state->refpoint = state->str1;
 	state->refpos = 0;
@@ -1366,6 +1377,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 		hptr = start_ptr;
 		while (hptr < haystack_end)
 		{
+			if (state->iter != NULL)
+				PG_DETOAST_ITERATE(state->iter, hptr);
+
 			if (*hptr == nchar)
 				return (char *) hptr;
 			hptr++;
@@ -1383,6 +1397,9 @@ text_position_next_internal(char *start_ptr, TextPositionState *state)
 			const char *nptr;
 			const char *p;
 
+			if (state->iter != NULL)
+				PG_DETOAST_ITERATE(state->iter, hptr);
+
 			nptr = needle_last;
 			p = hptr;
 			while (*nptr == *p)
@@ -1446,7 +1463,7 @@ text_position_get_match_pos(TextPositionState *state)
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	/* no cleanup needed */
+	free_detoast_iterator(state->iter);
 }
 
 static void
diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index 80c86d7924..cae49db459 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -734,7 +734,7 @@ pglz_decompress(const char *source, int32 slen, char *dest,
 				/*
 				 * Now we copy the bytes specified by the tag from OUTPUT to
 				 * OUTPUT (copy len bytes from dp - off to dp). The copied
-				 * areas could overlap, to preven possible uncertainty, we
+				 * areas could overlap; to prevent possible uncertainty, we
 				 * copy only non-overlapping regions.
 				 */
 				len = Min(len, destend - dp);
diff --git a/src/include/access/detoast.h b/src/include/access/detoast.h
index 86bad7e78c..732ec97d80 100644
--- a/src/include/access/detoast.h
+++ b/src/include/access/detoast.h
@@ -73,6 +73,125 @@ extern struct varlena *detoast_attr_slice(struct varlena *attr,
 										  int32 sliceoffset,
 										  int32 slicelength);
 
+#ifndef FRONTEND
+#include "access/genam.h"
+
+/*
+ * TOAST buffer is a producer consumer buffer.
+ *
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    |  |  |  |  |  |  |  |  |  |  |  |  |  |
+ *    +--+--+--+--+--+--+--+--+--+--+--+--+--+
+ *    ^           ^           ^              ^
+ *   buf      position      limit         capacity
+ *
+ * buf: point to the start of buffer.
+ * position: point to the next char to be consumed.
+ * limit: point to the next char to be produced.
+ * capacity: point to the end of buffer.
+ *
+ * Constraints that need to be satisfied:
+ * buf <= position <= limit <= capacity
+ */
+typedef struct ToastBuffer
+{
+	const char	*buf;
+	const char	*position;
+	char		*limit;
+	const char	*capacity;
+} ToastBuffer;
+
+typedef struct FetchDatumIteratorData
+{
+	ToastBuffer	*buf;
+	Relation	toastrel;
+	Relation	*toastidxs;
+	SysScanDesc	toastscan;
+	ScanKeyData	toastkey;
+	SnapshotData			snapshot;
+	struct varatt_external	toast_pointer;
+	int32		ressize;
+	int32		nextidx;
+	int32		numchunks;
+	int			num_indexes;
+	bool		done;
+}				FetchDatumIteratorData;
+
+typedef struct FetchDatumIteratorData *FetchDatumIterator;
+
+/*
+ * If "ctrlc" field in iterator is equal to INVALID_CTRLC, it means that
+ * the field is invalid and need to read the control byte from the
+ * source buffer in the next iteration, see pglz_decompress_iterate().
+ */
+#define INVALID_CTRLC 8
+
+typedef struct DetoastIteratorData
+{
+	ToastBuffer 		*buf;
+	FetchDatumIterator	fetch_datum_iterator;
+	unsigned char		ctrl;
+	int					ctrlc;
+	bool				compressed;		/* toast value is compressed? */
+	bool				done;
+}			DetoastIteratorData;
+
+typedef struct DetoastIteratorData *DetoastIterator;
+
+extern FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+extern void free_fetch_datum_iterator(FetchDatumIterator iter);
+extern void fetch_datum_iterate(FetchDatumIterator iter);
+extern ToastBuffer *create_toast_buffer(int32 size, bool compressed);
+extern void free_toast_buffer(ToastBuffer *buf);
+extern void pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+									DetoastIterator iter);
+
+/* ----------
+ * create_detoast_iterator -
+ *
+ * It only makes sense to initialize a de-TOAST iterator for external on-disk values.
+ *
+ * ----------
+ */
+extern DetoastIterator create_detoast_iterator(struct varlena *attr);
+
+/* ----------
+ * free_detoast_iterator -
+ *
+ * Free memory used by the de-TOAST iterator, including buffers and
+ * fetch datum iterator.
+ * ----------
+ */
+extern void free_detoast_iterator(DetoastIterator iter);
+
+/* ----------
+ * detoast_iterate -
+ *
+ * Iterate through the toasted value referenced by iterator.
+ *
+ * As long as there is another data chunk in external storage,
+ * de-TOAST it into iterator's toast buffer.
+ * ----------
+ */
+static inline void
+detoast_iterate(DetoastIterator detoast_iter)
+{
+	FetchDatumIterator fetch_iter = detoast_iter->fetch_datum_iterator;
+
+	Assert(detoast_iter != NULL && !detoast_iter->done);
+
+	fetch_datum_iterate(fetch_iter);
+
+	if (detoast_iter->compressed)
+		pglz_decompress_iterate(fetch_iter->buf, detoast_iter->buf, detoast_iter);
+
+	if (detoast_iter->buf->limit == detoast_iter->buf->capacity)
+		detoast_iter->done = true;
+}
+
+
+#endif
+
 /* ----------
  * toast_raw_datum_size -
  *
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 453df2220f..33d8ce786a 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -239,6 +239,19 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
 #define PG_DETOAST_DATUM_SLICE(datum,f,c) \
 		pg_detoast_datum_slice((struct varlena *) DatumGetPointer(datum), \
 		(int32) (f), (int32) (c))
+
+/*
+ * Support for de-TOASTing toasted value iteratively. "need" is a pointer
+ * between the beginning and end of iterator's ToastBuffer. The marco
+ * de-TOAST all bytes before "need" into iterator's ToastBuffer.
+ */
+#define PG_DETOAST_ITERATE(iter, need)											\
+	do {																		\
+		Assert((need) >= (iter)->buf->buf && (need) <= (iter)->buf->capacity);	\
+		while (!(iter)->done && (need) >= (iter)->buf->limit) { 				\
+			detoast_iterate(iter);												\
+		}																		\
+	} while (0)
 /* WARNING -- unaligned pointer */
 #define PG_DETOAST_DATUM_PACKED(datum) \
 	pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))
#31Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: John Naylor (#30)
Re: [proposal] de-TOAST'ing using a iterator

Status update for a commitfest entry.

This entry was inactive for a very long time.
John, are you going to continue working on this?

The last message mentions some open issues, namely backend crashes, so I move it to "Waiting on author".

The new status of this patch is: Waiting on Author

#32Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Anastasia Lubennikova (#31)
Re: [proposal] de-TOAST'ing using a iterator

On 2020-Nov-02, Anastasia Lubennikova wrote:

Status update for a commitfest entry.

This entry was inactive for a very long time.
John, are you going to continue working on this?

The last message mentions some open issues, namely backend crashes, so I move it to "Waiting on author".

As I understand, the patch he posted is fine -- it only crashes when he
tried a change I suggested. But (as is apparently common) I might be
suggesting the wrong thing. Since the cfbot says the patch still
applies and works, I suggest to keep it as needs-review.

#33John Naylor
john.naylor@enterprisedb.com
In reply to: Alvaro Herrera (#32)
Re: [proposal] de-TOAST'ing using a iterator

On Mon, Nov 2, 2020 at 1:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2020-Nov-02, Anastasia Lubennikova wrote:

Status update for a commitfest entry.

This entry was inactive for a very long time.
John, are you going to continue working on this?

Not in the near future. For background, this was a 2019 GSoC project where
I was reviewer of record, and the patch is mostly good, but there is some
architectural awkwardness. I have tried to address that, but have not had
success.

The last message mentions some open issues, namely backend crashes, so I

move it to "Waiting on author".

As I understand, the patch he posted is fine -- it only crashes when he
tried a change I suggested.

That's my recollection as well.

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: John Naylor (#33)
Re: [proposal] de-TOAST'ing using a iterator

On 02.11.2020 22:08, John Naylor wrote:

On Mon, Nov 2, 2020 at 1:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org
<mailto:alvherre@alvh.no-ip.org>> wrote:

On 2020-Nov-02, Anastasia Lubennikova wrote:

Status update for a commitfest entry.

This entry was inactive for a very long time.
John, are you going to continue working on this?

Not in the near future. For background, this was a 2019 GSoC project
where I was reviewer of record, and the patch is mostly good, but
there is some architectural awkwardness. I have tried to address that,
but have not had success.

The commitfest is nearing the end and as this thread has stalled, I've
marked it Returned with Feedback. Feel free to open a new entry if you
return to this patch.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company