Sparse bit set data structure

Started by Heikki Linnakangasalmost 7 years ago15 messages

hlinnaka@iki.fi

almost 7 years ago

2 attachment(s)

Hi,

I was reviewing Andrey Borodin's patch for GiST VACUUM [1]/messages/by-id/A51F64E3-850D-4249-814E-54967103A859@yandex-team.ru, which
includes a new "block set" data structure, to track internal and empty
pages while vacuuming a GiST. The blockset data structure was a pretty
simple radix tree, that can hold a set of BlockNumbers.

The memory usage of the radix tree would probably be good enough in real
life, as we also discussed on the thread. Nevertheless, I was somewhat
bothered by it, so I did some measurements. I added some
MemoryContextStats() calls to Andrey's test_blockset module, to print
out memory usage.

For storing 5000000 random 32-bit integers, or a density of about 1% of
bits set, the blockset consumed about 380 MB of memory. I think that's a
pretty realistic ratio of internal pages : leaf pages on a GiST index,
so I would hope the blockset to be efficient in that ballpark. However,
380 MB / 5000000 is about 76 bytes, so it's consuming about 76 bytes per
stored block number. That's a lot of overhead! For comparison, a plain
BlockNumber is just 4 bytes. With more sparse sets, it is even more
wasteful, on a per-item basis, although the total memory usage will of
course be smaller. (To be clear, no one is pushing around GiST indexes
with anywhere near 2^32 blocks, or 32 TB, but the per-BlockNumber stats
are nevertheless interesting.)

I started to consider rewriting the data structure into something more
like B-tree. Then I remembered that I wrote a data structure pretty much
like that last year already! We discussed that on the "Vacuum: allow
usage of more than 1GB of work mem" thread [2]/messages/by-id/8e5cbf08-5dd8-466d-9271-562fc65f133f@iki.fi, to replace the current
huge array that holds the dead TIDs during vacuum.

So I dusted off that patch, and made it more general, so that it can be
used to store arbitrary 64-bit integers, rather than ItemPointers or
BlockNumbers. I then added a rudimentary form of compression to the leaf
pages, so that clusters of nearby values can be stored as an array of
32-bit integers, or as a bitmap. That would perhaps be overkill, if it
was just to conserve some memory in GiST vacuum, but I think this will
turn out to be a useful general-purpose facility.

I plugged this new "sparse bitset" implementation into the same
test_blockset test. The memory usage for 5000000 values is now just over
20 MB, or about 4.3 bytes per value. That's much more reasonable than
the 76 bytes.

I'll do some more performance testing on this, to make sure it performs
well enough on random lookups, to also replace VACUUM's dead item
pointer array. Assuming that works out, I plan to polish up and commit
this, and use it in the GiST vacuum. I'm also tempted to change VACUUM
to use this, since that should be pretty straightforward once we have
the data structure.

[1]: /messages/by-id/A51F64E3-850D-4249-814E-54967103A859@yandex-team.ru
/messages/by-id/A51F64E3-850D-4249-814E-54967103A859@yandex-team.ru

[2]: /messages/by-id/8e5cbf08-5dd8-466d-9271-562fc65f133f@iki.fi
/messages/by-id/8e5cbf08-5dd8-466d-9271-562fc65f133f@iki.fi

- Heikki

Attachments:

0001-Add-SparseBitset-to-hold-a-large-set-of-64-bit-ints-.patchtext/x-patch; name=0001-Add-SparseBitset-to-hold-a-large-set-of-64-bit-ints-.patchDownload

From 678f56caf1c6eceb5195c531190b573c180e5ef2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 13 Mar 2019 20:05:04 +0200
Subject: [PATCH 1/2] Add SparseBitset, to hold a large set of 64-bit ints
 efficiently.

This doesn't include any use of the code yet, but we have two patches in
the works that would benefit from this:

* the GiST vacuum patch, to track empty GiST pages and internal GiST pages.

* Reducing memory usage, and also allowing more than 1 GB of memory to be
  used, to hold the dead TIDs in VACUUM.
---
 src/backend/lib/Makefile       |   2 +-
 src/backend/lib/README         |   2 +
 src/backend/lib/sparsebitset.c | 986 +++++++++++++++++++++++++++++++++
 src/include/lib/sparsebitset.h |  26 +
 4 files changed, 1015 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/sparsebitset.c
 create mode 100644 src/include/lib/sparsebitset.h

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..32ba388c12d 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+       ilist.o knapsack.o pairingheap.o rbtree.o sparsebitset.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index ae5debe1bc6..f938795b54a 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -19,6 +19,8 @@ pairingheap.c - a pairing heap
 
 rbtree.c - a red-black tree
 
+sparsebitset.c - a sparse bitset of integers
+
 stringinfo.c - an extensible string type
 
 
diff --git a/src/backend/lib/sparsebitset.c b/src/backend/lib/sparsebitset.c
new file mode 100644
index 00000000000..6df900b9ebc
--- /dev/null
+++ b/src/backend/lib/sparsebitset.c
@@ -0,0 +1,986 @@
+/*-------------------------------------------------------------------------
+ *
+ * sparsebitset.c
+ *	  Data structure to hold a large set of 64-bit integers efficiently
+ *
+ * This is somewhat similar to a Bitmapset, except that this works on
+ * 64-bit integers, and is more memory-efficient when the bitmap is sparse.
+ * The supported operations are much more limited, though.
+ *
+ * The data structure is a B-tree, with special packed representation
+ * at the leaf level, for clusters of nearby values.
+ *
+ * Usage:
+ *
+ * 1. Create the sparse bitset with sbs_create()
+ *
+ * 2. Add as many values as you want, with sbs_add_member(). Values must be
+ *    added in order!
+ *
+ * 3. You can test for presence of a value with sbs_is_member()
+ *
+ * 4. You can iterate through all values with sbs_begin_iterate() /
+ *    sbs_iterate_next().
+ *
+ *
+ * Limitations:
+ *
+ * - Values must be added in order.  (Random insertions would require
+ *   splitting nodes, which could be done, but it would be a fair amount of
+ *   extra code.)
+ *
+ * - Values cannot be added while iteration is in progress.
+ *
+ * - No support for removing values.
+ *
+ * None of these limitations are fundamental to the data structure, and
+ * could be lifted if needed, by writing some new code.  But the current
+ * users of this facility don't need them.
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/sparsebitset.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "lib/sparsebitset.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+
+/*
+ * Definitions for bit arrays used on bitmap-type nodes.
+ */
+#define SBS_BITS_PER_BITMAPWORD 32
+typedef uint32 sbsword;
+#define sbsword_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+
+#define WORDNUM(x)	((x) / SBS_BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((x) % SBS_BITS_PER_BITMAPWORD)
+
+/* Forward declarations, to break circular references between these structs */
+typedef struct sbs_node sbs_node;
+typedef struct sbs_leaf_node sbs_leaf_node;
+typedef struct sbs_internal_node sbs_internal_node;
+
+/*
+ * Parameters for the size and shape of the B-tree.
+ */
+
+/*
+ * Size of the array/bitmap on leaf nodes.  In bytes; the number of
+ * entries we can fit in that size depends on the format of the node.
+ *
+ * This is the size of the bulky array/bitmap, only. It does not include
+ * any other fields on each node.
+ */
+#define LEAF_NODE_SIZE		1024
+
+/*
+ * Number of children stored on internal nodes.
+ *
+ * An internal node has two arrays, with 64 elements each.  With
+ * sizeof(void *) == 8, they add up to 1024 bytes.
+ *
+ * Note that because nodes are just an in-memory data structure, there is
+ * no need internal and leaf nodes to be of the same size.
+ *
+ * If you change this, you must recalculate MAX_INTERVAL_LEVELS, too!
+ */
+#define MAX_INTERNAL_ITEMS	64
+
+/*
+ * Maximum height of the tree.
+ *
+ * MAX_INTERNAL_ITEMS determines the "fan-out" of the B-tree.  The theoretical
+ * maximum number of items that we can store in a bitset is 2^64.
+ * MAX_TREE_LEVELS should be set so that:
+ *
+ *   MAX_INTERNAL_ITEMS ^ MAX_INTERNAL_LEVELS >= 2^64.
+ *
+ * In practice, we'll need fewer levels, because leaf nodes will be packed
+ * very densely, as the bitset becomes more full.  But there's no harm in
+ * overshooting this.
+ */
+#define MAX_TREE_LEVELS		11
+
+
+/*
+ * SparseBitset is the main object representing the bitset.
+ *
+ * All the values are stored in an in-memory B-tree structure.  In addition
+ * to the B-tree, SparseBitset tracks information about memory usage, and the
+ * current position, when iterating the tree with sbm_begin_iterate /
+ * sbm_iterate_next.
+ */
+struct SparseBitset
+{
+	/*
+	 * 'context' is a dedicated memory context, used to hold the SparseBitset
+	 * struct itself, as well as all the tree nodes.
+	 *
+	 * 'mem_used' tracks the amount of memory used.  We don't do anything with
+	 * it in the sparse bitset code itself, but the callers can ask for it
+	 * with sbs_memory_usage().
+	 */
+	MemoryContext context;		/* memory context holding everything */
+	uint64		mem_used;		/* amount of memory used */
+
+	/*
+	 * 'root' points to the root node of the tree (or the only leaf node, if
+	 * num_levels == 1).  'leftmost_leaf' points to the leftmost leaf node.
+	 */
+	sbs_node   *root;
+	sbs_leaf_node *leftmost_leaf;
+	int			num_levels;		/* height of the tree */
+
+	uint64		num_entries;	/* total # of values in the tree */
+
+	/*
+	 * Pointer to the rightmost leaf node, and its parent, grandparent etc.
+	 * all the way up to the root.  These are needed when adding new values.
+	 * (Currently, we require that new values are added at the end.)
+	 */
+	sbs_leaf_node *rightmost_leaf;
+	sbs_internal_node *rightmost_parents[MAX_TREE_LEVELS - 1];
+	uint64		last_item;		/* highest value stored in this bitset */
+
+	/*
+	 * Iterator support.
+	 */
+	sbs_leaf_node *iter_node;
+	int			iter_itemno;
+	int			iter_wordno;
+	int			iter_bitno;
+};
+
+
+/*
+ * Node structures, for the in-memory B-tree.
+ *
+ * An internal node holds a number of downlink pointers, to leaf nodes, or
+ * nodes to internal nodes on lower level.  For each downlink, the key value
+ * corresponding lower level node is stored.  The stored key values are
+ * low keys.  If the downlink has value X, then all items stored on that child
+ * are >= X.
+ *
+ * A leaf node stores a number of values. There are three different formats,
+ * with varying levels of "compression".  Which one is used depending on the
+ * values stored.
+ *
+ * 1. A sorted array of uint64 values.
+ *
+ * 2. A sorted array of uint32 values, with a constant offset or 'prefix'
+ *    that's added to each value.  This representation takes half the space as
+ *    the uint64 array - or to put another way, it can store twice as many
+ *    items on a node.  But it has the limitation that the difference between
+ *    the lowest and highest entries on the node must be less than 2^32.  The
+ *    lowest entry is used as the 'prefix', and the highest entry must be
+ *    representable with prefix + uint32.
+ *
+ * 3. A bitmap.  On a bitmap node, values are not stored as integers, but as
+ *    a prefix, plus a bitmap indicating which values are present. This is much
+ *    more compact, when a lot of close-by values are set.
+ *
+ * Each leaf node also has a pointer to the next leaf node, so that the leaf
+ * nodes can be easily walked from beginning to end, when iterating.
+ */
+
+/*
+ * Common structure of both leaf and internal nodes.
+ */
+typedef enum
+{
+	SBS_INTERNAL,
+	SBS_LEAF_U64,
+	SBS_LEAF_U32,
+	SBS_LEAF_BITMAP
+} sbs_node_type;
+
+struct sbs_node
+{
+	sbs_node_type type;
+	uint16		num_items;
+};
+
+/*
+ * Internal node
+ */
+struct sbs_internal_node
+{
+	/* common header, must match sbs_node */
+	sbs_node_type type;
+	uint16		num_items;
+
+	uint16		level;			/* always >= 1 */
+
+	/*
+	 * 'items' is an array of key values, and 'downlinks' are pointers
+	 * to lower-level nodes, corresponding to the key values.
+	 */
+	uint64		items[MAX_INTERNAL_ITEMS];
+	sbs_node   *downlinks[MAX_INTERNAL_ITEMS];
+};
+
+/*
+ * Leaf node
+ */
+struct sbs_leaf_node
+{
+	/* common header, must match sbs_node */
+	sbs_node_type type;
+	uint16		num_items;
+
+	/* constant to be added to values, in U32 or BITMAP node */
+	uint64		prefix;
+
+	/*
+	 * Space used to hold the items. The format depends on the node type.
+	 *
+	 * All of these arrays should have the same size, in bytes. Otherwise,
+	 * we would waste memory.
+	 */
+#define MAX_U64_ITEMS		(LEAF_NODE_SIZE / sizeof(uint64))
+#define MAX_U32_ITEMS		(LEAF_NODE_SIZE / sizeof(uint32))
+#define MAX_BITMAP_WORDS	(LEAF_NODE_SIZE / sizeof(sbsword))
+	union
+	{
+		uint64		u64[MAX_U64_ITEMS];
+		uint32		u32[MAX_U32_ITEMS];
+		sbsword		words[MAX_BITMAP_WORDS];
+	} items;
+
+	sbs_leaf_node *next;		/* right sibling, if any */
+};
+
+/*
+ * On a U32-type leaf node, the difference between the smallest and largest
+ * value on the node can be at most 2^32 - 1.  Otherwise, the largest value
+ * could not be represented as a prefix + 32-bit offset.
+ *
+ * Likewise, on a bitmap node, the size of the bitmap is fixed, which dictates
+ * the highest value that can be stored in the bitmap, given a particular
+ * prefix.
+ */
+#define MAX_U32_DISTANCE	PG_UINT32_MAX
+#define MAX_BITMAP_DISTANCE	(MAX_BITMAP_WORDS * SBS_BITS_PER_BITMAPWORD - 1)
+
+
+/*
+ * prototypes for internal functions.
+ */
+static sbs_leaf_node *sbs_new_leaf(SparseBitset *sbs, uint64 newitem);
+static sbs_internal_node *sbs_new_internal(SparseBitset *sbs, int level,
+				 uint64 newitem, sbs_node *downlink);
+static bool sbs_repack_node(sbs_leaf_node *leaf);
+static void sbs_update_upper(SparseBitset *sbs, int level,
+				 void *new_node, uint64 new_node_item);
+
+static inline int sbs_binsrch_u64(uint64 value, uint64 *arr, int arr_elems,
+				bool nextkey);
+static inline int sbs_binsrch_u32(uint32 value, uint32 *arr, int arr_elems,
+				bool nextkey);
+static uint64 sbs_node_get_first_item(sbs_node *node);
+
+
+/*
+ * Create a new, initially empty, bit set.
+ */
+SparseBitset *
+sbs_create(void)
+{
+	MemoryContext context;
+	SparseBitset *sbs;
+
+	/*
+	 * Create a new memory context to hold everything.
+	 *
+	 * We never free any nodes, so the generational allocator works
+	 * well for us.
+	 *
+	 * Use a large block size, in the hopes that if we use a lot of memory,
+	 * the libc allocator will give it back to the OS when we free it, rather
+	 * than add it to a free-list. (On glibc, see M_MMAP_THRESHOLD.  As of this
+	 * writing, the effective threshold is somewhere between 128 kB and 4 MB.)
+	 */
+	context = GenerationContextCreate(CurrentMemoryContext,
+									  "Sparse bitmap",
+									  SLAB_LARGE_BLOCK_SIZE);
+
+	/* Allocate the SparseBitset object itself, in the context. */
+	sbs = (SparseBitset *) MemoryContextAlloc(context, sizeof(SparseBitset));
+	sbs->context = context;
+	sbs->mem_used = GetMemoryChunkSpace(sbs);
+
+	sbs->root = NULL;
+	sbs->leftmost_leaf = NULL;
+	sbs->num_levels = 0;
+	sbs->num_entries = 0;
+
+	sbs->rightmost_leaf = NULL;
+	memset(sbs->rightmost_parents, 0, sizeof(sbs->rightmost_parents));
+	sbs->last_item = 0;
+
+	sbs->iter_node = NULL;
+	sbs->iter_itemno = 0;
+	sbs->iter_wordno = 0;
+	sbs->iter_bitno = 0;
+
+	return sbs;
+}
+
+/*
+ * Allocate a new leaf node.
+ *
+ * The new node is installed as the right sibling of the current rightmost
+ * leaf.  (Or as 'first_child', if this is the first node in the tree.)
+ */
+static sbs_leaf_node *
+sbs_new_leaf(SparseBitset *sbs, uint64 newitem)
+{
+	sbs_leaf_node *n;
+
+	n = (sbs_leaf_node *)
+		MemoryContextAlloc(sbs->context, sizeof(sbs_leaf_node));
+	sbs->mem_used += GetMemoryChunkSpace(n);
+
+	n->type = SBS_LEAF_U64;
+	n->prefix = 0;
+	n->next = NULL;
+
+	/* Store the item on it. */
+	n->items.u64[0] = newitem;
+	n->num_items = 1;
+
+	/* Link it to the previous rightmost node. */
+	if (sbs->rightmost_leaf)
+		sbs->rightmost_leaf->next = n;
+	else
+	{
+		/* creating the very first node in the tree. */
+		Assert(sbs->leftmost_leaf == NULL);
+		Assert(sbs->num_entries == 0);
+		sbs->leftmost_leaf = n;
+	}
+	sbs->rightmost_leaf = n;
+
+	return n;
+}
+
+/*
+ * Allocate a new internal node.
+ *
+ * The 'rightmost_parents' is updated with the new node, but inserting the
+ * downlink to the parent is left to the caller.
+ */
+static sbs_internal_node *
+sbs_new_internal(SparseBitset *sbs, int level, uint64 newitem,
+				 sbs_node *downlink)
+{
+	sbs_internal_node *n;
+
+	n = (sbs_internal_node *)
+		MemoryContextAlloc(sbs->context, sizeof(sbs_internal_node));
+	sbs->mem_used += GetMemoryChunkSpace(n);
+
+	n->type = SBS_INTERNAL;
+	n->level = level;
+
+	/* Store the item on it. */
+	n->items[0] = newitem;
+	n->downlinks[0] = downlink;
+	n->num_items = 1;
+
+	sbs->rightmost_parents[level] = n;
+
+	return n;
+}
+
+/*
+ * Free a sparse bitmap.
+ */
+void
+sbs_free(SparseBitset *sbs)
+{
+	/* everything is allocated in the memory context */
+	MemoryContextDelete(sbs->context);
+}
+
+/*
+ * Return the number of entries in the bitset.
+ */
+uint64
+sbs_num_entries(SparseBitset *sbs)
+{
+	return sbs->num_entries;
+}
+
+/*
+ * Return the amount of memory used by the bitset.
+ */
+uint64
+sbs_memory_usage(SparseBitset *sbs)
+{
+	return sbs->mem_used;
+}
+
+/*
+ * Add a value in the bitmap.
+ *
+ * Values must be added in order.
+ */
+void
+sbs_add_member(SparseBitset *sbs, uint64 newitem)
+{
+	sbs_leaf_node *leaf;
+
+	if (sbs->iter_node)
+		elog(ERROR, "cannot add new values to sparse bitset when iteration is in progress");
+
+	if (sbs->num_entries == 0)
+	{
+		/*
+		 * This is the very first item in the bitmap.
+		 *
+		 * Allocate root node. It's also a leaf.
+		 */
+		leaf = sbs_new_leaf(sbs, newitem);
+		sbs->root = (sbs_node *) leaf;
+		sbs->num_levels = 1;
+	}
+	else
+	{
+		/* The new value must be larger than the last one added */
+		if (newitem <= sbs->last_item)
+			elog(ERROR, "cannot insert to sparse bitset out of order");
+
+		/*
+		 * The new value will go to the rightmost leaf node.  Or to a new leaf
+		 * node, if the currently rightmost leaf is full.
+		 */
+		leaf = sbs->rightmost_leaf;
+
+retry:
+		switch (leaf->type)
+		{
+			case SBS_LEAF_U64:
+				/*
+				 * On a U64 node, add the new entry to the end of the array,
+				 * if there's room.  If the array is full, try to repack the
+				 * node to a more dense format, to make room.
+				 */
+				if (leaf->num_items < MAX_U64_ITEMS)
+				{
+					/* There is room, add the item to the node. */
+					leaf->items.u64[leaf->num_items] = newitem;
+					leaf->num_items++;
+				}
+				else
+				{
+					if (sbs_repack_node(leaf))
+						goto retry;
+
+					leaf = sbs_new_leaf(sbs, newitem);
+					sbs_update_upper(sbs, 1, leaf, newitem);
+				}
+				break;
+
+			case SBS_LEAF_U32:
+				/*
+				 * The logic for a U32 node is the same as for U64, but we have
+				 * to also check that the new item is not too far from the
+				 * lowest value on the node, the 'prefix'.  Otherwise, it
+				 * cannot be represented as uint32 difference from the prefix.
+				 */
+				if (leaf->num_items < MAX_U32_ITEMS &&
+					newitem - leaf->prefix <= MAX_U32_DISTANCE)
+				{
+					leaf->items.u32[leaf->num_items] = newitem - leaf->prefix;
+					leaf->num_items++;
+				}
+				else
+				{
+					if (leaf->num_items >= MAX_U32_ITEMS)
+					{
+						if (sbs_repack_node(leaf))
+							goto retry;
+					}
+
+					leaf = sbs_new_leaf(sbs, newitem);
+					sbs_update_upper(sbs, 1, leaf, newitem);
+				}
+				break;
+
+			case SBS_LEAF_BITMAP:
+				/*
+				 * If the new item falls within the bitmap, set the
+				 * corresponding bit.  Otherwise, we have to allocate a new
+				 * node.
+				 */
+				if (newitem - leaf->prefix <= MAX_BITMAP_DISTANCE)
+				{
+					uint32		val = newitem - leaf->prefix;
+
+					leaf->items.words[WORDNUM(val)] |= (1 << BITNUM(val));
+					leaf->num_items++;
+				}
+				else
+				{
+					leaf = sbs_new_leaf(sbs, newitem);
+					sbs_update_upper(sbs, 1, leaf, newitem);
+				}
+				break;
+
+			default:
+				elog(ERROR, "unknown sparse bitset node type %u", leaf->type);
+		}
+	}
+
+	sbs->last_item = newitem;
+	sbs->num_entries++;
+}
+
+/*
+ * Rewrite a leaf node to a more densely packed fromat, if possible.
+ */
+static bool
+sbs_repack_node(sbs_leaf_node *leaf)
+{
+	if (leaf->num_items < 1)
+		return false;
+
+	if (leaf->type == SBS_LEAF_U64)
+	{
+		/*
+		 * Try to rewrite U64 node into a U32 node.
+		 */
+		uint64		lowest = leaf->items.u64[0];
+		uint64		highest = leaf->items.u64[leaf->num_items - 1];
+
+		if (highest - lowest > MAX_U32_DISTANCE)
+			return false;		/* cannot make this more dense */
+
+		for (int i = 0; i < leaf->num_items; i++)
+		{
+			uint64 x = leaf->items.u64[i];
+
+			leaf->items.u32[i] = x - lowest;
+		}
+
+		leaf->prefix = lowest;
+		leaf->type = SBS_LEAF_U32;
+		return true;
+	}
+	else if (leaf->type == SBS_LEAF_U32)
+	{
+		/*
+		 * Try to rewrite U32 node into a BITMAP node.
+		 */
+		uint32		tmpitems[LEAF_NODE_SIZE / sizeof(uint32)];
+
+		/* prefix is already set. */
+		if (leaf->items.u32[leaf->num_items - 1] > MAX_BITMAP_DISTANCE)
+			return false;		/* cannot make this more dense */
+
+		/* copy all the old items out of the way. */
+		memcpy(tmpitems, leaf->items.u32, LEAF_NODE_SIZE);
+
+		/* clear the bitmap */
+		memset(leaf->items.words, 0, LEAF_NODE_SIZE);
+
+		/* set all bits, representing the items. */
+		for (int i = 0; i < leaf->num_items; i++)
+		{
+			uint32		x = tmpitems[i];
+
+			leaf->items.words[WORDNUM(x)] |= (1 << BITNUM(x));
+		}
+
+		leaf->type = SBS_LEAF_BITMAP;
+		return true;
+	}
+	else if (leaf->type == SBS_LEAF_BITMAP)
+	{
+		/* already in the densest format */
+		return false;
+	}
+	else
+	{
+		elog(ERROR, "unknown sparse bitset leaf type %d", leaf->type);
+		return false;
+	}
+}
+
+/*
+ * Insert the downlink into parent node, after creating a new node.
+ *
+ * Recurses if the parent node is full, too.
+ */
+static void
+sbs_update_upper(SparseBitset *sbs, int level, void *new_node,
+				 uint64 new_node_item)
+{
+	sbs_internal_node *parent;
+
+	/*
+	 * Create a new root node, if necessary.
+	 */
+	if (level >= sbs->num_levels)
+	{
+		uint64		old_root_item;
+
+		/* MAX_TREE_LEVELS should be more than enough, this shouldn't happen */
+		if (sbs->num_levels == MAX_TREE_LEVELS)
+			elog(ERROR, "could not expand sparse bitset, maximum number of levels reached");
+		sbs->num_levels++;
+
+		/* Get the first item on the old root page, to be used as the downlink. */
+		old_root_item = sbs_node_get_first_item(sbs->root);
+
+		parent = sbs_new_internal(sbs, level, old_root_item, sbs->root);
+
+		sbs->root = (sbs_node *) parent;
+	}
+
+	parent = sbs->rightmost_parents[level];
+
+	/* Place the downlink on the parent page. */
+	if (parent->num_items < MAX_INTERNAL_ITEMS)
+	{
+		parent->items[parent->num_items] = new_node_item;
+		parent->downlinks[parent->num_items] = new_node;
+		parent->num_items++;
+	}
+	else
+	{
+		/*
+		 * Doesn't fit.  Allocate new parent, with the downlink, and recursively
+		 * insert the downlink to the new parent to the grandparent.
+		 */
+		parent = sbs_new_internal(sbs, level, new_node_item, new_node);
+		sbs_update_upper(sbs, level + 1, parent, new_node_item);
+	}
+}
+
+/*
+ * Does the bitmap contain the given value?
+ */
+bool
+sbs_is_member(SparseBitset *sbs, uint64 item)
+{
+	sbs_leaf_node *leaf;
+	sbs_internal_node *node;
+	int			level = sbs->num_levels - 1;
+	int			itemno;
+
+	if (sbs->num_entries == 0)
+		return false;
+
+	/*
+	 * Start from the root, and walk down the B-tree to find the right leaf
+	 * node.
+	 */
+	node = (sbs_internal_node *) sbs->root;
+	while (level > 0)
+	{
+		itemno = sbs_binsrch_u64(item, node->items, node->num_items, true);
+
+		if (itemno == 0)
+			return false;
+		itemno--;
+
+		node = (sbs_internal_node *) node->downlinks[itemno];
+		level--;
+	}
+
+	leaf = (sbs_leaf_node *) node;
+
+	if (leaf->type == SBS_LEAF_U64)
+	{
+		/* Binary search in the leaf node */
+		itemno = sbs_binsrch_u64(item, leaf->items.u64, leaf->num_items, false);
+
+		if (itemno >= leaf->num_items)
+			return false;
+		else
+			return leaf->items.u64[itemno] == item;
+	}
+	else if (leaf->type == SBS_LEAF_U32)
+	{
+		uint32		val = item - leaf->prefix;
+
+		/* Binary search in the leaf node */
+		itemno = sbs_binsrch_u32(val, leaf->items.u32, leaf->num_items, false);
+
+		if (itemno >= leaf->num_items)
+			return false;
+		else
+			return leaf->items.u32[itemno] == val;
+	}
+	else if (leaf->type == SBS_LEAF_BITMAP)
+	{
+		uint32	val = item - leaf->prefix;
+
+		if (val > MAX_BITMAP_DISTANCE)
+			return false;
+		else
+			return (leaf->items.words[WORDNUM(val)] & (1 << BITNUM(val))) != 0;
+	}
+	else
+		elog(ERROR, "invalid sparse bitset leaf node type %d", leaf->type);
+}
+
+
+/*
+ * Begin in-order scan through all the items.
+ *
+ * While the iteration is in-progress, you cannot add new items to the bitset.
+ */
+void
+sbs_begin_iterate(SparseBitset *sbs)
+{
+	sbs->iter_node = sbs->leftmost_leaf;
+	sbs->iter_itemno = 0;
+	sbs->iter_wordno = 0;
+	sbs->iter_bitno = 0;
+}
+
+/*
+ * Returns the next item, when iterating.
+ *
+ * sbs_begin_iterate() must be called first.  sbs_iterate_next() returns
+ * the next value in the bit set.  If there are no more values, *found is set
+ * to false.
+ */
+uint64
+sbs_iterate_next(SparseBitset *sbs, bool *found)
+{
+	uint64		result;
+
+	if (!sbs->iter_node)
+	{
+		/* We are already at the end. */
+		*found = false;
+		return 0;
+	}
+
+	while (sbs->iter_node)
+	{
+		if (sbs->iter_node->type == SBS_LEAF_U64 ||
+			sbs->iter_node->type == SBS_LEAF_U32)
+		{
+			if (sbs->iter_itemno >= sbs->iter_node->num_items)
+			{
+				/* We have reached end of this node.  Step to the next one. */
+				sbs->iter_node = sbs->iter_node->next;
+				sbs->iter_itemno = 0;
+				sbs->iter_wordno = 0;
+				sbs->iter_bitno = 0;
+				continue;
+			}
+
+			/* Return the next item on the page. */
+			if (sbs->iter_node->type == SBS_LEAF_U64)
+				result = sbs->iter_node->items.u64[sbs->iter_itemno];
+			else
+				result = sbs->iter_node->prefix + sbs->iter_node->items.u32[sbs->iter_itemno];
+
+			sbs->iter_itemno++;
+			*found = true;
+			return result;
+		}
+		else if (sbs->iter_node->type == SBS_LEAF_BITMAP)
+		{
+			sbsword		mask;
+
+			/* At the beginning, ignore bits that we had returned already */
+			mask = (~(sbsword) 0) << sbs->iter_bitno;
+
+			/*
+			 * Scan the bitmap words for the next set bit.
+			 */
+			while (sbs->iter_wordno < LEAF_NODE_SIZE / sizeof(sbsword))
+			{
+				sbsword		w = sbs->iter_node->items.words[sbs->iter_wordno];
+
+				/* ignore bits before bitno */
+				w &= mask;
+
+				if (w != 0)
+				{
+					/*
+					 * Found a set bit!  Reconstruct the original value that
+					 * the bit represents.
+					 */
+					int			match_bitno = sbsword_rightmost_one_pos(w);
+
+					result = sbs->iter_node->prefix;
+					result += sbs->iter_wordno * SBS_BITS_PER_BITMAPWORD;
+					result += match_bitno;
+
+					/* advance iterator to next bit, for next call */
+					if (match_bitno == SBS_BITS_PER_BITMAPWORD - 1)
+					{
+						sbs->iter_wordno++;
+						sbs->iter_bitno = 0;
+					}
+					else
+						sbs->iter_bitno = match_bitno + 1;
+
+					*found = true;
+					return result;
+				}
+
+				/* in subsequent words, consider all bits */
+				mask = (~(sbsword) 0);
+
+				sbs->iter_wordno++;
+				sbs->iter_bitno = 0;
+			}
+
+			/* No more matches on this bitmap. Step to the next node. */
+			sbs->iter_node = sbs->iter_node->next;
+			sbs->iter_itemno = 0;
+			sbs->iter_wordno = 0;
+			sbs->iter_bitno = 0;
+			continue;
+		}
+		else
+			elog(ERROR, "invalid sparse bitset leaf node type %d", sbs->iter_node->type);
+	}
+
+	/* No more results. */
+	*found = false;
+	return 0;
+}
+
+
+/*
+ * sbs_binsrch_u64() -- search a sorted array of uint64s
+ *
+ * Returns the first position with key equal or less than the given key.
+ * The returned position would be the "insert" location for the given key,
+ * that is, the position where the new key should be inserted to.
+ *
+ * 'nextkey' affects the behavior on equal keys.  If true, and there is an
+ * equal key in the array, this returns the position immediately after the
+ * equal key. If false, this returns the position of the equal key itself.
+ */
+static inline int
+sbs_binsrch_u64(uint64 item, uint64 *arr, int arr_elems, bool nextkey)
+{
+	int			low,
+				high,
+				mid;
+
+	low = 0;
+	high = arr_elems;
+	while (high > low)
+	{
+		mid = low + (high - low) / 2;
+
+		if (nextkey)
+		{
+			if (item >= arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+		else
+		{
+			if (item > arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+	}
+
+	return low;
+}
+
+/* same, but for an array of uint32 */
+static inline int
+sbs_binsrch_u32(uint32 item, uint32 *arr, int arr_elems, bool nextkey)
+{
+	int			low,
+				high,
+				mid;
+
+	low = 0;
+	high = arr_elems;
+	while (high > low)
+	{
+		mid = low + (high - low) / 2;
+
+		if (nextkey)
+		{
+			if (item >= arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+		else
+		{
+			if (item > arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+	}
+
+	return low;
+}
+
+/*
+ * Get the first item on a node.
+ *
+ * This knows how to deal with all node types.
+ */
+static uint64
+sbs_node_get_first_item(sbs_node *node)
+{
+	sbs_internal_node *internal;
+	sbs_leaf_node *leaf;
+
+	Assert(node->num_items > 0);
+
+	switch(node->type)
+	{
+		case SBS_INTERNAL:
+			internal = (sbs_internal_node *) node;
+
+			return internal->items[0];
+
+		case SBS_LEAF_U64:
+			leaf = (sbs_leaf_node *) node;
+
+			return leaf->items.u64[0];
+
+		case SBS_LEAF_U32:
+			leaf = (sbs_leaf_node *) node;
+
+			return leaf->prefix + leaf->items.u32[0];
+
+		case SBS_LEAF_BITMAP:
+			leaf = (sbs_leaf_node *) node;
+
+			/*
+			 * Currently, sbs_add_member() always chooses the prefix based on
+			 * the first item on the node. Therefore, the first bit should
+			 * always be set, and we can just return 'prefix'.
+			 */
+			Assert(leaf->items.words[0] & 1);
+
+			return leaf->prefix;
+
+		default:
+			elog(ERROR, "invalid sparse bitset node type %d", node->type);
+	}
+	return 0;
+}
diff --git a/src/include/lib/sparsebitset.h b/src/include/lib/sparsebitset.h
new file mode 100644
index 00000000000..7449f189f60
--- /dev/null
+++ b/src/include/lib/sparsebitset.h
@@ -0,0 +1,26 @@
+/*
+ * sparsebitset.h
+ *	  Data structure to hold a large set of integers efficiently
+ *
+ * Portions Copyright (c) 2012-2019, PostgreSQL Global Development Group
+ *
+ * src/include/lib/sparsebitset.h
+ */
+#ifndef SPARSEBITSET_H
+#define SPARSEBITSET_H
+
+typedef struct SparseBitset SparseBitset;
+
+extern SparseBitset *sbs_create(void);
+extern void sbs_free(SparseBitset *sbs);
+
+extern void sbs_add_member(SparseBitset *sbs, uint64 newitem);
+extern bool sbs_is_member(SparseBitset *sbs, uint64 item);
+
+extern uint64 sbs_num_entries(SparseBitset *sbs);
+extern uint64 sbs_memory_usage(SparseBitset *sbs);
+
+extern void sbs_begin_iterate(SparseBitset *sbs);
+extern uint64 sbs_iterate_next(SparseBitset *sbs, bool *found);
+
+#endif							/* SPARSEBITSET_H */
-- 
2.20.1

0002-Andrey-Borodin-s-test_blockset-tool-adapted-for-Spar.patchtext/x-patch; name=0002-Andrey-Borodin-s-test_blockset-tool-adapted-for-Spar.patchDownload

From 705980a0b7d38ee0f33a76de8cdc8cd0e7579436 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 13 Mar 2019 21:08:35 +0200
Subject: [PATCH 2/2] Andrey Borodin's test_blockset tool, adapted for
 SparseBitset

WIP: I don't know if we want to commit something like this in the end, but
it's pretty useful for manual testing.

from https://www.postgresql.org/message-id/FD68CBB1-716F-4344-9E3F-C27C14B8678B%40yandex-team.ru
---
 src/test/modules/test_blockset/.gitignore     |   4 +
 src/test/modules/test_blockset/Makefile       |  21 ++
 src/test/modules/test_blockset/README         |   8 +
 .../test_blockset/expected/test_blockset.out  |   7 +
 .../test_blockset/sql/test_blockset.sql       |   3 +
 .../test_blockset/test_blockset--1.0.sql      |   8 +
 .../modules/test_blockset/test_blockset.c     | 236 ++++++++++++++++++
 .../test_blockset/test_blockset.control       |   4 +
 src/test/regress/expected/gist.out            |   4 +-
 src/test/regress/sql/gist.sql                 |   4 +-
 10 files changed, 293 insertions(+), 6 deletions(-)
 create mode 100644 src/test/modules/test_blockset/.gitignore
 create mode 100644 src/test/modules/test_blockset/Makefile
 create mode 100644 src/test/modules/test_blockset/README
 create mode 100644 src/test/modules/test_blockset/expected/test_blockset.out
 create mode 100644 src/test/modules/test_blockset/sql/test_blockset.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset--1.0.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset.c
 create mode 100644 src/test/modules/test_blockset/test_blockset.control

diff --git a/src/test/modules/test_blockset/.gitignore b/src/test/modules/test_blockset/.gitignore
new file mode 100644
index 00000000000..5dcb3ff9723
--- /dev/null
+++ b/src/test/modules/test_blockset/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_blockset/Makefile b/src/test/modules/test_blockset/Makefile
new file mode 100644
index 00000000000..091cf8c0958
--- /dev/null
+++ b/src/test/modules/test_blockset/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_blockset/Makefile
+
+MODULE_big = test_blockset
+OBJS = test_blockset.o $(WIN32RES)
+PGFILEDESC = "test_blockset - test code for block set library"
+
+EXTENSION = test_blockset
+DATA = test_blockset--1.0.sql
+
+REGRESS = test_blockset
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_blockset
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_blockset/README b/src/test/modules/test_blockset/README
new file mode 100644
index 00000000000..730951ff03c
--- /dev/null
+++ b/src/test/modules/test_blockset/README
@@ -0,0 +1,8 @@
+test_blockset overview
+=========================
+
+test_blockset is a test harness module for testing block set data structure.
+There are two main tests:
+1. Test of comliance with Bitmapset on numbers avaiable to Bitmapset
+2. Test of numbers that can exceed INT32_MAX but are just shifted on one bit
+from numbers kept in Bitmapset
\ No newline at end of file
diff --git a/src/test/modules/test_blockset/expected/test_blockset.out b/src/test/modules/test_blockset/expected/test_blockset.out
new file mode 100644
index 00000000000..02c29d5fc0c
--- /dev/null
+++ b/src/test/modules/test_blockset/expected/test_blockset.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_blockset;
+SELECT test_blockset();
+ test_blockset 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_blockset/sql/test_blockset.sql b/src/test/modules/test_blockset/sql/test_blockset.sql
new file mode 100644
index 00000000000..85d2886676e
--- /dev/null
+++ b/src/test/modules/test_blockset/sql/test_blockset.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_blockset;
+
+SELECT test_blockset();
\ No newline at end of file
diff --git a/src/test/modules/test_blockset/test_blockset--1.0.sql b/src/test/modules/test_blockset/test_blockset--1.0.sql
new file mode 100644
index 00000000000..04eea8a6146
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_blockset/test_blockset--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_blockset" to load this file. \quit
+
+CREATE FUNCTION test_blockset()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_blockset/test_blockset.c b/src/test/modules/test_blockset/test_blockset.c
new file mode 100644
index 00000000000..218942a587b
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset.c
@@ -0,0 +1,236 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_blockset.c
+ *		Test block set data structure.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_blockset/test_blockset.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/sparsebitset.h"
+#include "nodes/bitmapset.h"
+#include "utils/memutils.h"
+#include "storage/block.h"
+#include "miscadmin.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_blockset);
+
+static void test_blockset_bms_compliance(int32_t);
+static void test_blockset_big_block_numbers(int32_t);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ */
+Datum
+test_blockset(PG_FUNCTION_ARGS)
+{
+	test_blockset_bms_compliance(0);
+	test_blockset_bms_compliance(1);
+	test_blockset_bms_compliance(2);
+	test_blockset_bms_compliance(1337);
+	test_blockset_bms_compliance(100000);
+	test_blockset_big_block_numbers(1337);
+	test_blockset_big_block_numbers(31337);
+	test_blockset_big_block_numbers(5000000);
+	PG_RETURN_VOID();
+}
+
+/*
+ * This test creates random bitmap with test_limit members
+ * and checks that block set behavior is similar to Bitmapset
+ */
+static void test_blockset_bms_compliance(int32_t test_limit)
+{
+	SparseBitset *sbs = NULL;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+
+	sbs = sbs_create();
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bms = bms_add_member(bms, (int)blockno);
+	}
+
+	/*
+	 * SparseBitset requires that the items are added in-order, so we populate
+	 * the SparseBitset after the Bitmapset, by iterating through all the values
+	 * from the Bitmapset in the right order.
+	 */
+	{
+		int x = -1;
+
+		while ((x = bms_next_member(bms, x)) >= 0)
+			sbs_add_member(sbs, (BlockNumber) (x));
+	}
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	sbs_begin_iterate(sbs);
+	while (true)
+	{
+		bool found;
+		BlockNumber next_bn = sbs_iterate_next(sbs, &found);
+		int next_index = bms_next_member(bms, index);
+
+		if (!found)
+			next_bn = InvalidBlockNumber;
+
+		point_index++;
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			return; /* We have found everything */
+
+		if (((BlockNumber)next_index) != next_bn)
+		{
+			elog(ERROR,
+				 "Bitmapset returned value %X different from block set %X,"
+				 " test_limit %d, point index %d",
+				 next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!sbs_is_member(sbs, next_bn))
+		{
+			elog(ERROR,
+				 "Block set did not found present item %X"
+				 " test_limit %d, point index %d",
+				 next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (sbs_is_member(sbs, blockno) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+				 "Block set did agree with bitmapset item %X"
+				 " test_limit %d, point index %d",
+				 blockno, test_limit, point_index);
+		}
+	}
+
+	sbs_free(sbs);
+	bms_free(bms);
+}
+
+/* 
+ * This test is similar to test_blockset_bms_compliance()
+ * except that it shifts BlockNumbers by one bit to ensure that blockset
+ * operates correctly on values higher that INT32_MAX
+ * This function is copy-pasted from previous with the exception of barrel
+ * shifts for BlockNumbers. I've tried various refactorings, but they all
+ * looked ugly.
+ */
+static void test_blockset_big_block_numbers(int32_t test_limit)
+{
+	SparseBitset *sbs;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+	MemoryContext old_cxt;
+
+	MemoryContext test_cxt;
+	
+	fprintf(stderr, "TEST %d:\n", test_limit);
+
+	test_cxt = AllocSetContextCreate(CurrentMemoryContext,
+									 "blockset test",
+									 ALLOCSET_SMALL_SIZES);
+	old_cxt = MemoryContextSwitchTo(test_cxt);
+	sbs = sbs_create();
+	MemoryContextSwitchTo(old_cxt);
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bms = bms_add_member(bms, (int)blockno);
+
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	/*
+	 * SparseBitset requires that the items are added in-order, so we populate
+	 * the SparseBitset after the Bitmapset, by iterating through all the values
+	 * from the Bitmapset in the right order.
+	 */
+	{
+		int x = -1;
+
+		while ((x = bms_next_member(bms, x)) >= 0)
+			sbs_add_member(sbs, ((BlockNumber) (x)) << 1);
+	}
+
+	MemoryContextStats(test_cxt);
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	sbs_begin_iterate(sbs);
+	while (true)
+	{
+		bool		found;
+		BlockNumber next_bn = sbs_iterate_next(sbs, &found);
+		int next_index = bms_next_member(bms, index);
+
+		if (!found)
+			next_bn = InvalidBlockNumber;
+		
+		point_index++;
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			break; /* We have found everything */
+
+		if (((BlockNumber)next_index) != (next_bn >> 1))
+		{
+			elog(ERROR,
+				 "Bitmapset returned value %X different from block set %X,"
+				 " test_limit %d, point index %d",
+				 next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!sbs_is_member(sbs, next_bn))
+		{
+			elog(ERROR,
+				 "Block set did not found present item %X"
+				 " test_limit %d, point index %d",
+				 next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (sbs_is_member(sbs, blockno << 1) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+				 "Block set did agree with bitmapset item %X"
+				 " test_limit %d, point index %d",
+				 blockno, test_limit, point_index);
+		}
+	}
+
+	sbs_free(sbs);
+	bms_free(bms);
+}
diff --git a/src/test/modules/test_blockset/test_blockset.control b/src/test/modules/test_blockset/test_blockset.control
new file mode 100644
index 00000000000..fdb7598c5a7
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset.control
@@ -0,0 +1,4 @@
+comment = 'Test code for block set library'
+default_version = '1.0'
+module_pathname = '$libdir/test_blockset'
+relocatable = true
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf2..5b92f08c747 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13c..e66396e851b 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.20.1

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Heikki Linnakangas (#1)

Re: Sparse bit set data structure

On Wed, Mar 13, 2019 at 3:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I started to consider rewriting the data structure into something more
like B-tree. Then I remembered that I wrote a data structure pretty much
like that last year already! We discussed that on the "Vacuum: allow
usage of more than 1GB of work mem" thread [2], to replace the current
huge array that holds the dead TIDs during vacuum.

So I dusted off that patch, and made it more general, so that it can be
used to store arbitrary 64-bit integers, rather than ItemPointers or
BlockNumbers. I then added a rudimentary form of compression to the leaf
pages, so that clusters of nearby values can be stored as an array of
32-bit integers, or as a bitmap. That would perhaps be overkill, if it
was just to conserve some memory in GiST vacuum, but I think this will
turn out to be a useful general-purpose facility.

Yeah, that sounds pretty cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andrey Borodin

x4mmm@yandex-team.ru

almost 7 years ago

In reply to: Heikki Linnakangas (#1)

Re: Sparse bit set data structure

Hi!

14 марта 2019 г., в 0:18, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
<0001-Add-SparseBitset-to-hold-a-large-set-of-64-bit-ints-.patch><0002-Andrey-Borodin-s-test_blockset-tool-adapted-for-Spar.patch>

That is very interesting idea. Basically, B-tree and radix tree is a tradeoff between space and time.

In general, lookup into radix tree will touch less CPU cache lines.
In this terms Bitmapset is on most performant and memory-wasteful side: lookup into Bitmapset is always 1 cache line.
Performance of radix tree can be good in case of skewed distribution, while B-tree will be OK on uniform. I think that distribution of GiST inner pages is uniform, distribution of empty leafs is... I have no idea, let's consider uniform too.

I'd review this data structure ASAP. I just need to understand: do we aim to v12 or v13? (I did not solve concurrency issues in GiST VACUUM yet, but will fix them at weekend)

Also, maybe we should consider using RoaringBitmaps? [0]https://github.com/RoaringBitmap/CRoaring
As a side not I would add that while balanced trees are widely used for operations on external memory, there are more performant versions for main memory. Like AVL-tree and RB-tree.

Brest regards, Andrey Borodin.

[0]: https://github.com/RoaringBitmap/CRoaring

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Andrey Borodin (#3)

Re: Sparse bit set data structure

On 14/03/2019 07:15, Andrey Borodin wrote:

14 марта 2019 г., в 0:18, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
<0001-Add-SparseBitset-to-hold-a-large-set-of-64-bit-ints-.patch><0002-Andrey-Borodin-s-test_blockset-tool-adapted-for-Spar.patch>

That is very interesting idea. Basically, B-tree and radix tree is a tradeoff between space and time.

In general, lookup into radix tree will touch less CPU cache lines.
In this terms Bitmapset is on most performant and memory-wasteful side: lookup into Bitmapset is always 1 cache line.
Performance of radix tree can be good in case of skewed distribution, while B-tree will be OK on uniform. I think that distribution of GiST inner pages is uniform, distribution of empty leafs is... I have no idea, let's consider uniform too.

Yeah. In this implementation, the leaf nodes are packed into bitmaps
when possible, so it should perform quite well on skewed distributions, too.

I'd review this data structure ASAP. I just need to understand: do we aim to v12 or v13? (I did not solve concurrency issues in GiST VACUUM yet, but will fix them at weekend)

I'm aiming v12 with this. It's a fairly large patch, but it's very
isolated. I think the most pressing issue is getting the rest of the
GiST vacuum patch fixed. If you get that fixed over the weekend, I'll
take another look at it on Monday.

Also, maybe we should consider using RoaringBitmaps? [0]
As a side not I would add that while balanced trees are widely used for operations on external memory, there are more performant versions for main memory. Like AVL-tree and RB-tree.

Hmm. Yeah, this is quite similar to Roaring Bitmaps. Roaring bitmaps
also have a top-level, at which you binary search, and "leaf" nodes
which can be bitmaps or arrays. In a roaring bitmap, the key space is
divided into fixed-size chunks, like in a radix tree, but different from
a B-tree.

Even if we used AVL-trees or RB-trees or something else for the top
layers of the tree, I think at the bottom level, we'd still need to use
sorted arrays or bitmaps, to get the density we want. So I think the
implementation at the leaf level would look pretty much the same, in any
case. And the upper levels don't take very much space, regardless of the
implementation. So I don't think it matters much.

- Heikki

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: Heikki Linnakangas (#1)

Re: Sparse bit set data structure

On Wed, Mar 13, 2019 at 8:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I started to consider rewriting the data structure into something more
like B-tree. Then I remembered that I wrote a data structure pretty much
like that last year already! We discussed that on the "Vacuum: allow
usage of more than 1GB of work mem" thread [2], to replace the current
huge array that holds the dead TIDs during vacuum.

So I dusted off that patch, and made it more general, so that it can be
used to store arbitrary 64-bit integers, rather than ItemPointers or
BlockNumbers. I then added a rudimentary form of compression to the leaf
pages, so that clusters of nearby values can be stored as an array of
32-bit integers, or as a bitmap. That would perhaps be overkill, if it
was just to conserve some memory in GiST vacuum, but I think this will
turn out to be a useful general-purpose facility.

I had a quick look at it, so I thought first comments could be helpful.

+ * If you change this, you must recalculate MAX_INTERVAL_LEVELS, too!
+ *   MAX_INTERNAL_ITEMS ^ MAX_INTERNAL_LEVELS >= 2^64.

I think that MAX_INTERVAL_LEVELS was a typo for MAX_INTERNAL_LEVELS,
which has probably been renamed to MAX_TREE_LEVELS in this patch.

+ * with varying levels of "compression".  Which one is used depending on the
+ * values stored.

depends on?

+       if (newitem <= sbs->last_item)
+           elog(ERROR, "cannot insert to sparse bitset out of order");

Is there any reason to disallow inserting duplicates? AFAICT nothing
prevents that in the current code. If that's intended, that probably
should be documented.

Nothing struck me other than that, that's a pretty nice new lib :)

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: Julien Rouhaud (#5)

Re: Sparse bit set data structure

On Thu, Mar 14, 2019 at 4:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

+       if (newitem <= sbs->last_item)
+           elog(ERROR, "cannot insert to sparse bitset out of order");
Is there any reason to disallow inserting duplicates? AFAICT nothing
prevents that in the current code. If that's intended, that probably
should be documented.

That of course won't work well with SBS_LEAF_BITMAP. I'd still prefer
a more explicit error message.

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Julien Rouhaud (#5)

1 attachment(s)

Re: Sparse bit set data structure

On 14/03/2019 17:37, Julien Rouhaud wrote:

On Wed, Mar 13, 2019 at 8:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I started to consider rewriting the data structure into something more
like B-tree. Then I remembered that I wrote a data structure pretty much
like that last year already! We discussed that on the "Vacuum: allow
usage of more than 1GB of work mem" thread [2], to replace the current
huge array that holds the dead TIDs during vacuum.

So I dusted off that patch, and made it more general, so that it can be
used to store arbitrary 64-bit integers, rather than ItemPointers or
BlockNumbers. I then added a rudimentary form of compression to the leaf
pages, so that clusters of nearby values can be stored as an array of
32-bit integers, or as a bitmap. That would perhaps be overkill, if it
was just to conserve some memory in GiST vacuum, but I think this will
turn out to be a useful general-purpose facility.

I had a quick look at it, so I thought first comments could be helpful.

Thanks!

+       if (newitem <= sbs->last_item)
+           elog(ERROR, "cannot insert to sparse bitset out of order");
Is there any reason to disallow inserting duplicates? AFAICT nothing
prevents that in the current code. If that's intended, that probably
should be documented.

Yeah, we could easily allow setting the last item again. It would be a
no-op, though, which doesn't seem very useful. It would be useful to
lift the limitation that the values have to be added in ascending order,
but current users that we're thinking of don't need it. Let's add that
later, if the need arises.

Or did you mean that the structure would be a "bag" rather than a "set",
so that it would keep the duplicates? I don't think that would be good.
I guess the vacuum code that this will be used in wouldn't care either
way, but "set" seems like a more clean concept.

On 13/03/2019 21:18, I wrote:

I'll do some more performance testing on this, to make sure it performs
well enough on random lookups, to also replace VACUUM's dead item
pointer array.

Turns out, it didn't perform very well for that use case. I tested with
distributions where you have clusters of 1-200 integers, at 2^16
intervals. That's close to the distribution of ItemPointers in a VACUUM,
where you have 1-200 (dead) items per page, and the offset number is
stored in the low 16 bits. It used slightly less memory than the plain
array of ItemPointers that we use today, but the code to use a bitmap at
the leaf level hardly ever kicks in, because there just isn't ever
enough set bits for that to win. In order to get the dense packing, it
needs to be done at a much more fine-grained fashion.

So I rewrote the way the leaf nodes work, so that the leaf nodes no
longer use a bitmap, but a simple array of items, like on internal
nodes. To still get the dense packing, the leaf items are packed using
an algorithm called Simple-8b, which can encode between 1-240 integers
in a single 64-bit word, depending on how far the integers are from each
other. That works much better, and actually makes the code simpler, too.

I renamed this thing to IntegerSet. That seems like a more accurate name
than the "sparse bitset" that I used call it. There aren't any "bits"
visible in the public interface of this, after all.

I improved the regression tests, so that it tests all the interface
functions, and covers various corner-cases. It tests the set with
different patterns of integers, and it can print the memory usage and
execution times of adding values to the set, probing random values, and
iterating through the set. That is a useful micro-benchmark. The speed
of all the operations seem to be in the same ballpark as with a simple
sorted array, but it uses much less memory.

I'm now pretty satisfied with this. Barring objections, I'll commit this
in the next few days. Please review, if you have a chance.

- Heikki

Attachments:

0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patchtext/x-patch; name=0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patchDownload

From 4c05c69020334babdc1aa406c5032ae2861e5cb5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2019 02:26:08 +0200
Subject: [PATCH 1/1] Add IntegerSet, to hold large sets of 64-bit ints
 efficiently.

The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values take
less space.

This doesn't include any use of the code yet, but we have two patches in
the works that would benefit from this:

* the GiST vacuum patch, to track empty GiST pages and internal GiST pages.

* Reducing memory usage, and also allowing more than 1 GB of memory to be
  used, to hold the dead TIDs in VACUUM.

This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.

Author: Heikki Linnakangas, Andrey Borodin
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb729f@iki.fi
---
 src/backend/lib/Makefile                      |    2 +-
 src/backend/lib/README                        |    2 +
 src/backend/lib/integerset.c                  | 1039 +++++++++++++++++
 src/include/lib/integerset.h                  |   25 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/test_integerset/.gitignore   |    4 +
 src/test/modules/test_integerset/Makefile     |   21 +
 src/test/modules/test_integerset/README       |    7 +
 .../expected/test_integerset.out              |   14 +
 .../test_integerset/sql/test_integerset.sql   |   11 +
 .../test_integerset/test_integerset--1.0.sql  |    8 +
 .../modules/test_integerset/test_integerset.c |  622 ++++++++++
 .../test_integerset/test_integerset.control   |    4 +
 13 files changed, 1759 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/integerset.c
 create mode 100644 src/include/lib/integerset.h
 create mode 100644 src/test/modules/test_integerset/.gitignore
 create mode 100644 src/test/modules/test_integerset/Makefile
 create mode 100644 src/test/modules/test_integerset/README
 create mode 100644 src/test/modules/test_integerset/expected/test_integerset.out
 create mode 100644 src/test/modules/test_integerset/sql/test_integerset.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset--1.0.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset.c
 create mode 100644 src/test/modules/test_integerset/test_integerset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..3c1ee1df83a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+       ilist.o integerset.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index ae5debe1bc6..f2fb591237d 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -13,6 +13,8 @@ hyperloglog.c - a streaming cardinality estimator
 
 ilist.c - single and double-linked lists
 
+integerset.c - a data structure for holding large set of integers
+
 knapsack.c - knapsack problem solver
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/integerset.c b/src/backend/lib/integerset.c
new file mode 100644
index 00000000000..c9172fa2005
--- /dev/null
+++ b/src/backend/lib/integerset.c
@@ -0,0 +1,1039 @@
+/*-------------------------------------------------------------------------
+ *
+ * integerset.c
+ *	  Data structure to hold a large set of 64-bit integers efficiently
+ *
+ * IntegerSet provides an in-memory data structure to hold a set of
+ * arbitrary 64-bit integers.  Internally, the values are stored in a
+ * B-tree, with a special packed representation at the leaf level using
+ * the Simple-8b algorithm, which can pack hold clusters of nearby values
+ * very tightly.
+ *
+ * Memory consumption depends on the number of values stored, but also
+ * on how far the values are from each other.  In the best case, with
+ * long runs of consecutive integers, memory consumption can be as low as
+ * 0.1 bytes per integer.  In the worst case, if integers are more than
+ * 2^32 apart, it uses about 8 bytes per integer.  In typical use, the
+ * consumption per integer is somewhere between those extremes, depending
+ * on the range of integers stored, and how "clustered" they are.
+ *
+ *
+ * Interface
+ * ---------
+ *
+ *	intset_create			- Create a new empty set.
+ *	intset_add_member		- Add an integer to the set.
+ *	intset_is_member		- Test if an integer is in the set
+ *	intset_begin_iterate	- Begin iterating through all integers in set
+ *	intset_iterate_next		- Return next integer
+ *
+ *
+ * Limitations
+ * -----------
+ *
+ * - Values must be added in order.  (Random insertions would require
+ *   splitting nodes, which hasn't been implemented.)
+ *
+ * - Values cannot be added while iteration is in progress.
+ *
+ * - No support for removing values.
+ *
+ * None of these limitations are fundamental to the data structure, and
+ * could be lifted if needed, by writing some new code.  But the current
+ * users of this facility don't need them.
+ *
+ *
+ * References
+ * ----------
+ *
+ * Simple-8b encoding is based on:
+ *
+ * Vo Ngoc Anh , Alistair Moffat, Index compression using 64-bit words,
+ *   Software - Practice & Experience, v.40 n.2, p.131-147, February 2010
+ *   (https://doi.org/10.1002/spe.948)
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/integerset.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "lib/integerset.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Properties of Simple-8b encoding.  (These are needed here, before
+ * other definitions, so that we can size other arrays accordingly).
+ *
+ * SIMPLE8B_MAX_VALUE is the greatest integer that can be encoded.  Simple-8B
+ * uses 64-bit words, but uses four bits to indicate the "mode" of the
+ * codeword, leaving at most 60 bits for the actual integer.
+ *
+ * SIMPLE8B_MAX_VALUES_PER_CODEWORD is the maximum number of integers that
+ * can be encoded in a single codeword.
+ */
+#define SIMPLE8B_MAX_VALUE		((1L << 60) - 1)
+#define SIMPLE8B_MAX_VALUES_PER_CODEWORD 240
+
+/*
+ * Parameters for shape of the in-memory B-tree.
+ *
+ * These set the size of each internal and leaf node.  They don't necessarily
+ * need to be the same, because the tree is just an in-memory structure.
+ * With the default 64, each node is about 1 kb.
+ *
+ * If you change these, you must recalculate MAX_TREE_LEVELS, too!
+ */
+#define MAX_INTERNAL_ITEMS	64
+#define MAX_LEAF_ITEMS	64
+
+/*
+ * Maximum height of the tree.
+ *
+ * MAX_TREE_ITEMS is calculated from the "fan-out" of the B-tree.  The
+ * theoretical maximum number of items that we can store in a set is 2^64,
+ * so MAX_TREE_LEVELS should be set so that:
+ *
+ *   MAX_LEAF_ITEMS * MAX_INTERNAL_ITEMS ^ (MAX_TREE_LEVELS - 1) >= 2^64.
+ *
+ * In practice, we'll need far fewer levels, because you will run out of
+ * memory long before reaching that number, but let's be conservative.
+ */
+#define MAX_TREE_LEVELS		11
+
+/*
+ * Node structures, for the in-memory B-tree.
+ *
+ * An internal node holds a number of downlink pointers to leaf nodes, or
+ * to internal nodes on lower level.  For each downlink, the key value
+ * corresponding the lower level node is stored in a sorted array.  The
+ * stored key values are low keys.  In other words, if the downlink has value
+ * X, then all items stored on that child are >= X.
+ *
+ * Each leaf node holds a number of "items", with a varying number of
+ * integers packed into each item.  Each item consists of two 64-bit words:
+ * The first word holds first integer stored in the item, in plain format.
+ * The second word contains between 0 and 240 more integers, packed using
+ * Simple-8b encoding.  By storing the first integer in plain, unpacked,
+ * format, we can use binary search to quickly find an item that holds (or
+ * would hold) a particular integer.  And by storing the rest in packed form,
+ * we still get pretty good memory density, if there are clusters of integers
+ * with similar values.
+ *
+ * Each leaf node also has a pointer to the next leaf node, so that the leaf
+ * nodes can be easily walked from beginning to end, when iterating.
+ */
+typedef struct intset_node intset_node;
+typedef struct intset_leaf_node intset_leaf_node;
+typedef struct intset_internal_node intset_internal_node;
+
+/* Common structure of both leaf and internal nodes. */
+struct intset_node
+{
+	uint16		level;
+	uint16		num_items;
+};
+
+/* Internal node */
+struct intset_internal_node
+{
+	/* common header, must match intset_node */
+	uint16		level;			/* >= 1 on internal nodes */
+	uint16		num_items;
+
+	/*
+	 * 'values' is an array of key values, and 'downlinks' are pointers
+	 * to lower-level nodes, corresponding to the key values.
+	 */
+	uint64		values[MAX_INTERNAL_ITEMS];
+	intset_node *downlinks[MAX_INTERNAL_ITEMS];
+};
+
+/* Leaf node */
+typedef struct
+{
+	uint64		first;		/* first integer in this item */
+	uint64		codeword;	/* simple8b encoded differences from 'first' */
+} leaf_item;
+
+#define MAX_VALUES_PER_LEAF_ITEM	(1 + SIMPLE8B_MAX_VALUES_PER_CODEWORD)
+
+struct intset_leaf_node
+{
+	/* common header, must match intset_node */
+	uint16		level;			/* 0 on leafs */
+	uint16		num_items;
+
+	intset_leaf_node *next;	/* right sibling, if any */
+
+	leaf_item	items[MAX_LEAF_ITEMS];
+};
+
+/*
+ * We buffer insertions in a simple array, before packing and inserting them
+ * into the B-tree.  MAX_BUFFERED_VALUES sets the size of the buffer.  The
+ * encoder assumes that it is large enough, that we can always fill a leaf
+ * item with buffered new items.  In other words, MAX_BUFFERED_VALUES must be
+ * larger than MAX_VALUES_PER_LEAF_ITEM.
+ */
+#define MAX_BUFFERED_VALUES			(MAX_VALUES_PER_LEAF_ITEM * 2)
+
+/*
+ * IntegerSet is the top-level object representing the set.
+ *
+ * The integers are stored in an in-memory B-tree structure, and an array
+ * for newly-added integers.  IntegerSet also tracks information about memory
+ * usage, as well as the current position, when iterating the set with
+ * intset_begin_iterate / intset_iterate_next.
+ */
+struct IntegerSet
+{
+	/*
+	 * 'context' is a dedicated memory context, used to hold the IntegerSet
+	 * struct itself, as well as all the tree nodes.
+	 *
+	 * 'mem_used' tracks the amount of memory used.  We don't do anything with
+	 * it in integerset.c itself, but the callers can ask for it with
+	 * intset_memory_usage().
+	 */
+	MemoryContext context;		/* memory context holding everything */
+	uint64		mem_used;		/* amount of memory used */
+
+	uint64		num_entries;	/* total # of values in the set */
+	uint64		highest_value;	/* highest value stored in this set */
+
+	/*
+	 * B-tree to hold the packed values.
+	 *
+	 * 'rightmost_nodes' hold pointers to the rightmost node on each level.
+	 * rightmost_parent[0] is rightmost leaf, rightmost_parent[1] is its
+	 * parent, and so forth, all the way up to the root. These are needed when
+	 * adding new values. (Currently, we require that new values are added at
+	 * the end.)
+	 */
+	int			num_levels;		/* height of the tree */
+	intset_node *root;			/* root node */
+	intset_node *rightmost_nodes[MAX_TREE_LEVELS];
+	intset_leaf_node *leftmost_leaf;	/* leftmost leaf node */
+
+	/*
+	 * Holding area for new items that haven't been inserted to the tree yet.
+	 */
+	uint64		buffered_values[MAX_BUFFERED_VALUES];
+	int			num_buffered_values;
+
+	/*
+	 * Iterator support.
+	 *
+	 * 'iter_values' is an array of integers ready to be returned to the
+	 * caller.  'item_node' and 'item_itemno' point to the leaf node, and
+	 * item within the leaf node, to get the next batch of values from.
+	 *
+	 * Normally, 'iter_values' points 'iter_values_buf', which holds items
+	 * decoded from a leaf item. But after we have scanned the whole B-tree,
+	 * we iterate through all the unbuffered values, too, by pointing
+	 * iter_values to 'buffered_values'.
+	 */
+	uint64	   *iter_values;
+	int			iter_num_values;	/* number of elements in 'iter_values' */
+	int			iter_valueno;		/* index into 'iter_values' */
+	intset_leaf_node *iter_node;	/* current leaf node */
+	int			iter_itemno;		/* next item 'iter_node' to decode */
+
+	uint64		iter_values_buf[MAX_VALUES_PER_LEAF_ITEM];
+};
+
+/*
+ * prototypes for internal functions.
+ */
+static void intset_update_upper(IntegerSet *intset, int level,
+				 intset_node *new_node, uint64 new_node_item);
+static void intset_flush_buffered_values(IntegerSet *intset);
+
+static int intset_binsrch_uint64(uint64 value, uint64 *arr, int arr_elems,
+				bool nextkey);
+static int intset_binsrch_leaf(uint64 value, leaf_item *arr, int arr_elems,
+				bool nextkey);
+
+static uint64 simple8b_encode(uint64 *ints, int *num_encoded, uint64 base);
+static int simple8b_decode(uint64 codeword, uint64 *decoded, uint64 base);
+static bool simple8b_contains(uint64 codeword, uint64 key, uint64 base);
+
+
+/*
+ * Create a new, initially empty, integer set.
+ */
+IntegerSet *
+intset_create(void)
+{
+	MemoryContext context;
+	IntegerSet *intset;
+
+	/*
+	 * Create a new memory context to hold everything.
+	 *
+	 * We never free any nodes, so the generational allocator works well for
+	 * us.
+	 *
+	 * Use a large block size, in the hopes that if we use a lot of memory,
+	 * the libc allocator will give it back to the OS when we free it, rather
+	 * than add it to a free-list. (On glibc, see M_MMAP_THRESHOLD.  As of this
+	 * writing, the effective threshold is somewhere between 128 kB and 4 MB.)
+	 */
+	context = GenerationContextCreate(CurrentMemoryContext,
+									  "integer set",
+									  SLAB_LARGE_BLOCK_SIZE);
+
+	/* Allocate the IntegerSet object itself, in the context. */
+	intset = (IntegerSet *) MemoryContextAlloc(context, sizeof(IntegerSet));
+	intset->context = context;
+	intset->mem_used = GetMemoryChunkSpace(intset);
+
+	intset->num_entries = 0;
+	intset->highest_value = 0;
+
+	intset->num_levels = 0;
+	intset->root = NULL;
+	memset(intset->rightmost_nodes, 0, sizeof(intset->rightmost_nodes));
+	intset->leftmost_leaf = NULL;
+
+	intset->num_buffered_values = 0;
+
+	intset->iter_node = NULL;
+	intset->iter_itemno = 0;
+	intset->iter_valueno = 0;
+	intset->iter_num_values = 0;
+
+	return intset;
+}
+
+/*
+ * Allocate a new node.
+ */
+static intset_internal_node *
+intset_new_internal_node(IntegerSet *intset)
+{
+	intset_internal_node *n;
+
+	n = (intset_internal_node *) MemoryContextAlloc(intset->context,
+													sizeof(intset_internal_node));
+	intset->mem_used += GetMemoryChunkSpace(n);
+
+	n->level = 0;		/* caller must set */
+	n->num_items = 0;
+
+	return n;
+}
+
+static intset_leaf_node *
+intset_new_leaf_node(IntegerSet *intset)
+{
+	intset_leaf_node *n;
+
+	n = (intset_leaf_node *) MemoryContextAlloc(intset->context,
+												sizeof(intset_leaf_node));
+	intset->mem_used += GetMemoryChunkSpace(n);
+
+	n->level = 0;
+	n->num_items = 0;
+	n->next = NULL;
+
+	return n;
+}
+
+/*
+ * Free the integer set
+ */
+void
+intset_free(IntegerSet *intset)
+{
+	/* everything is allocated in the memory context */
+	MemoryContextDelete(intset->context);
+}
+
+/*
+ * Return the number of entries in the integer set.
+ */
+uint64
+intset_num_entries(IntegerSet *intset)
+{
+	return intset->num_entries;
+}
+
+/*
+ * Return the amount of memory used by the integer set.
+ */
+uint64
+intset_memory_usage(IntegerSet *intset)
+{
+	return intset->mem_used;
+}
+
+/*
+ * Add a value to the set.
+ *
+ * Values must be added in order.
+ */
+void
+intset_add_member(IntegerSet *intset, uint64 x)
+{
+	if (intset->iter_node)
+		elog(ERROR, "cannot add new values to integer set when iteration is in progress");
+
+	if (x <= intset->highest_value && intset->num_entries > 0)
+		elog(ERROR, "cannot add value to integer set out of order");
+
+	if (intset->num_buffered_values >= MAX_BUFFERED_VALUES)
+	{
+		/* Time to flush our buffer */
+		intset_flush_buffered_values(intset);
+		Assert(intset->num_buffered_values < MAX_BUFFERED_VALUES);
+	}
+
+	/* Add it to the buffer of newly-added values */
+	intset->buffered_values[intset->num_buffered_values] = x;
+	intset->num_buffered_values++;
+	intset->num_entries++;
+	intset->highest_value = x;
+}
+
+/*
+ * Take a batch of buffered values, and pack them into the B-tree.
+ */
+static void
+intset_flush_buffered_values(IntegerSet *intset)
+{
+	uint64	   *values = intset->buffered_values;
+	uint64		num_values = intset->num_buffered_values;
+	int			num_packed = 0;
+	intset_leaf_node *leaf;
+
+	leaf = (intset_leaf_node *) intset->rightmost_nodes[0];
+
+	/*
+	 * If the tree is completely empty, create the first leaf page, which
+	 * is also the root.
+	 */
+	if (leaf == NULL)
+	{
+		/*
+		 * This is the very first item in the set.
+		 *
+		 * Allocate root node. It's also a leaf.
+		 */
+		leaf = intset_new_leaf_node(intset);
+
+		intset->root = (intset_node *) leaf;
+		intset->leftmost_leaf = leaf;
+		intset->rightmost_nodes[0] = (intset_node *) leaf;
+		intset->num_levels = 1;
+	}
+
+	/*
+	 * If there are less than MAX_VALUES_PER_LEAF_ITEM values in the
+	 * buffer, stop.  In most cases, we cannot encode that many values
+	 * in a single value, but this way, the encoder doesn't have to
+	 * worry about running out of input.
+	 */
+	while (num_values - num_packed >= MAX_VALUES_PER_LEAF_ITEM)
+	{
+		leaf_item	item;
+		int			num_encoded;
+
+		/*
+		 * Construct the next leaf item, packing as many buffered values
+		 * as possible.
+		 */
+		item.first = values[num_packed];
+		item.codeword = simple8b_encode(&values[num_packed + 1],
+										&num_encoded,
+										item.first);
+
+		/*
+		 * Add the item to the node, allocating a new node if the old one
+		 * is full.
+		 */
+		if (leaf->num_items >= MAX_LEAF_ITEMS)
+		{
+			/* Allocate new leaf and link it to the tree */
+			intset_leaf_node *old_leaf = leaf;
+
+			leaf = intset_new_leaf_node(intset);
+			old_leaf->next = leaf;
+			intset->rightmost_nodes[0] = (intset_node *) leaf;
+			intset_update_upper(intset, 1, (intset_node *) leaf, item.first);
+		}
+		leaf->items[leaf->num_items++] = item;
+
+		num_packed += 1 + num_encoded;
+	}
+
+	/*
+	 * Move any remaining buffered values to the beginning of the array.
+	 */
+	if (num_packed < intset->num_buffered_values)
+	{
+		memmove(&intset->buffered_values[0],
+				&intset->buffered_values[num_packed],
+				(intset->num_buffered_values - num_packed) * sizeof(uint64));
+	}
+	intset->num_buffered_values -= num_packed;
+}
+
+/*
+ * Insert a downlink into parent node, after creating a new node.
+ *
+ * Recurses if the parent node is full, too.
+ */
+static void
+intset_update_upper(IntegerSet *intset, int level, intset_node *child,
+					uint64 child_key)
+{
+	intset_internal_node *parent;
+
+	Assert(level > 0);
+
+	/*
+	 * Create a new root node, if necessary.
+	 */
+	if (level >= intset->num_levels)
+	{
+		intset_node *oldroot = intset->root;
+		uint64		downlink_key;
+
+		/* MAX_TREE_LEVELS should be more than enough, this shouldn't happen */
+		if (intset->num_levels == MAX_TREE_LEVELS)
+			elog(ERROR, "could not expand integer set, maximum number of levels reached");
+		intset->num_levels++;
+
+		/*
+		 * Get the first value on the old root page, to be used as the
+		 * downlink.
+		 */
+		if (intset->root->level == 0)
+			downlink_key = ((intset_leaf_node *) oldroot)->items[0].first;
+		else
+			downlink_key = ((intset_internal_node *) oldroot)->values[0];
+
+		parent = intset_new_internal_node(intset);
+		parent->level = level;
+		parent->values[0] = downlink_key;
+		parent->downlinks[0] = oldroot;
+		parent->num_items = 1;
+
+		intset->root = (intset_node *) parent;
+		intset->rightmost_nodes[level] = (intset_node *) parent;
+	}
+
+	/*
+	 * Place the downlink on the parent page.
+	 */
+	parent = (intset_internal_node *) intset->rightmost_nodes[level];
+
+	if (parent->num_items < MAX_INTERNAL_ITEMS)
+	{
+		parent->values[parent->num_items] = child_key;
+		parent->downlinks[parent->num_items] = child;
+		parent->num_items++;
+	}
+	else
+	{
+		/*
+		 * Doesn't fit.  Allocate new parent, with the downlink as the first
+		 * item on it, and recursively insert the downlink to the new parent
+		 * to the grandparent.
+		 */
+		parent = intset_new_internal_node(intset);
+		parent->level = level;
+		parent->values[0] = child_key;
+		parent->downlinks[0] = child;
+		parent->num_items = 1;
+
+		intset->rightmost_nodes[level] = (intset_node *) parent;
+
+		intset_update_upper(intset, level + 1, (intset_node *) parent, child_key);
+	}
+}
+
+/*
+ * Does the set contain the given value?
+ */
+bool
+intset_is_member(IntegerSet *intset, uint64 x)
+{
+	intset_node   *node;
+	intset_leaf_node *leaf;
+	int			level;
+	int			itemno;
+	leaf_item  *item;
+
+	/*
+	 * The value might be in the buffer of newly-added values.
+	 */
+	if (intset->num_buffered_values > 0 && x >= intset->buffered_values[0])
+	{
+		int			itemno;
+
+		itemno = intset_binsrch_uint64(x,
+									   intset->buffered_values,
+									   intset->num_buffered_values,
+									   false);
+		if (itemno >= intset->num_buffered_values)
+			return false;
+		else
+			return intset->buffered_values[itemno] == x;
+	}
+
+	/*
+	 * Start from the root, and walk down the B-tree to find the right leaf
+	 * node.
+	 */
+	if (!intset->root)
+		return false;
+	node = intset->root;
+	for (level = intset->num_levels - 1; level > 0; level--)
+	{
+		intset_internal_node *n = (intset_internal_node *) node;
+
+		Assert(node->level == level);
+
+		itemno = intset_binsrch_uint64(x, n->values, n->num_items, true);
+		if (itemno == 0)
+			return false;
+		node = n->downlinks[itemno - 1];
+	}
+	Assert(node->level == 0);
+	leaf = (intset_leaf_node *) node;
+
+	/*
+	 * Binary search the right item on the leaf page
+	 */
+	itemno = intset_binsrch_leaf(x, leaf->items, leaf->num_items, true);
+	if (itemno == 0)
+		return false;
+	item = &leaf->items[itemno - 1];
+
+	/* Is this a match to the first value on the item? */
+	if (item->first == x)
+		return true;
+	Assert(x > item->first);
+
+	/* Is it in the packed codeword? */
+	if (simple8b_contains(item->codeword, x, item->first))
+		return true;
+
+	return false;
+}
+
+/*
+ * Begin in-order scan through all the values.
+ *
+ * While the iteration is in-progress, you cannot add new values to the set.
+ */
+void
+intset_begin_iterate(IntegerSet *intset)
+{
+	intset->iter_node = intset->leftmost_leaf;
+	intset->iter_itemno = 0;
+	intset->iter_valueno = 0;
+	intset->iter_num_values = 0;
+	intset->iter_values = intset->iter_values_buf;
+}
+
+/*
+ * Returns the next integer, when iterating.
+ *
+ * intset_begin_iterate() must be called first.  intset_iterate_next() returns
+ * the next value in the set.  If there are no more values, *found is set
+ * to false.
+ */
+uint64
+intset_iterate_next(IntegerSet *intset, bool *found)
+{
+	for (;;)
+	{
+		if (intset->iter_valueno < intset->iter_num_values)
+		{
+			*found = true;
+			return intset->iter_values[intset->iter_valueno++];
+		}
+
+		/* Our queue is empty, decode next leaf item */
+		if (intset->iter_node && intset->iter_itemno < intset->iter_node->num_items)
+		{
+			/* We have reached end of this packed item.  Step to the next one. */
+			leaf_item  *item;
+			int			num_decoded;
+
+			item = &intset->iter_node->items[intset->iter_itemno++];
+
+			intset->iter_values[0] = item->first;
+			num_decoded = simple8b_decode(item->codeword, &intset->iter_values[1], item->first);
+			intset->iter_num_values = num_decoded + 1;
+
+			intset->iter_valueno = 0;
+			continue;
+		}
+
+		/* No more items on this leaf, step to next node */
+		if (intset->iter_node)
+		{
+			/* No more matches on this bucket. Step to the next node. */
+			intset->iter_node = intset->iter_node->next;
+			intset->iter_itemno = 0;
+			intset->iter_valueno = 0;
+			intset->iter_num_values = 0;
+			continue;
+		}
+
+		/*
+		 * We have reached the end of the B-tree.  But we might still have
+		 * some integers in the buffer of newly-added values.
+		 */
+		if (intset->iter_values == intset->iter_values_buf)
+		{
+			intset->iter_values = intset->buffered_values;
+			intset->iter_num_values = intset->num_buffered_values;
+			continue;
+		}
+
+		break;
+	}
+
+	/* No more results. */
+	*found = false;
+	return 0;
+}
+
+/*
+ * intset_binsrch_uint64() -- search a sorted array of uint64s
+ *
+ * Returns the first position with key equal or less than the given key.
+ * The returned position would be the "insert" location for the given key,
+ * that is, the position where the new key should be inserted to.
+ *
+ * 'nextkey' affects the behavior on equal keys.  If true, and there is an
+ * equal key in the array, this returns the position immediately after the
+ * equal key.  If false, this returns the position of the equal key itself.
+ */
+static int
+intset_binsrch_uint64(uint64 item, uint64 *arr, int arr_elems, bool nextkey)
+{
+	int			low,
+				high,
+				mid;
+
+	low = 0;
+	high = arr_elems;
+	while (high > low)
+	{
+		mid = low + (high - low) / 2;
+
+		if (nextkey)
+		{
+			if (item >= arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+		else
+		{
+			if (item > arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+	}
+
+	return low;
+}
+
+/* same, but for an array of leaf items */
+static int
+intset_binsrch_leaf(uint64 item, leaf_item *arr, int arr_elems, bool nextkey)
+{
+	int			low,
+				high,
+				mid;
+
+	low = 0;
+	high = arr_elems;
+	while (high > low)
+	{
+		mid = low + (high - low) / 2;
+
+		if (nextkey)
+		{
+			if (item >= arr[mid].first)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+		else
+		{
+			if (item > arr[mid].first)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+	}
+
+	return low;
+}
+
+/*
+ * Simple-8b encoding.
+ *
+ * Simple-8b algorithm packs between 1 and 240 integers into 64-bit words,
+ * called "codewords".  The number of integers packed into a single codeword
+ * depends on the integers being packed: small integers are encoded using
+ * fewer bits than large integers.  A single codeword can store a single
+ * 60-bit integer, or two 30-bit integers, for example.
+ *
+ * Since we're storing a unique, sorted, set of integers, we actually encode
+ * the *differences* between consecutive integers.  That way, clusters of
+ * integers that are close to each other are packed efficiently, regardless
+ * of the absolute values.
+ *
+ * In Simple-8b, each codeword consists of a 4-bit selector, which indicates
+ * how many integers are encoded in the codeword, and the encoded integers
+ * packed into the remaining 60 bits.  The selector allows for 16 different
+ * ways of using the remaining 60 bits, "modes".  The number of integers
+ * packed into a single codeword is listed in the simple8b_modes table below.
+ * For example, consider the following codeword:
+ *
+ *      20-bit integer       20-bit integer       20-bit integer
+ * 1101 00000000000000010010 01111010000100100000 00000000000000010100
+ * ^
+ * selector
+ *
+ * The selector 1101 is 13 in decimal.  From the modes table below, we see
+ * that it means that the codeword encodes three 12-bit integers.  In decimal,
+ * those integers are 18, 500000 and 20.  Because we encode deltas rather than
+ * absolute values, the actual values that they represent are 18,  500018 and
+ * 500038.
+ *
+ * Modes 0 and 1 are a bit special; they encode a run of 240 or 120 zeros
+ * (which means 240 or 120 consecutive integers, since we're encoding the
+ * the deltas between integers), without using the rest of the codeword bits
+ * for anything.
+ *
+ * Simple-8b cannot encode integers larger than 60 bits.  Values larger than
+ * that are always stored in the 'first' field of a leaf item, never in the
+ * packed codeword.  If there is a sequence of integers that are more than
+ * 2^60 apart, the codeword will go unused on those items.  To represent that,
+ * we use a magic EMPTY_CODEWORD codeword.
+ */
+static const struct
+{
+	uint8		bits_per_int;
+	uint8		num_ints;
+} simple8b_modes[17] =
+{
+	{  0, 240 },	/* mode  0: 240 zeros */
+	{  0, 120 },	/* mode  1: 120 zeros */
+	{  1,  60 },	/* mode  2: sixty 1-bit integers */
+	{  2,  30 },	/* mode  3: thirty 2-bit integers */
+	{  3,  20 },	/* mode  4: twenty 3-bit integers */
+	{  4,  15 },	/* mode  5: fifteen 4-bit integers */
+	{  5,  12 },	/* mode  6: twelve 5-bit integers */
+	{  6,  10 },	/* mode  7: ten 6-bit integers */
+	{  7,   8 },	/* mode  8: eight 7-bit integers (four bits are wasted) */
+	{  8,   7 },	/* mode  9: seven 8-bit integers (four bits are wasted) */
+	{ 10,   6 },	/* mode 10: six 10-bit integers */
+	{ 12,   5 },	/* mode 11: five 12-bit integers */
+	{ 15,   4 },	/* mode 12: four 15-bit integers */
+	{ 20,   3 },	/* mode 13: three 20-bit integers */
+	{ 30,   2 },	/* mode 14: two 30-bit integers */
+	{ 60,   1 },	/* mode 15: one 60-bit integer */
+	{ 0,    0 }		/* sentinel value */
+};
+
+/*
+ * EMPTY_CODEWORD is a special value, used to indicate "no values".
+ * It is used if the next value is too large to be encoded with Simple-8b.
+ *
+ * This value looks like a 0-mode codeword,  but we check for it
+ * specifically.  (In a real 0-mode codeword, all the unused bits are zero.)
+ */
+#define EMPTY_CODEWORD		(0xFFFFFFFFFFFFFFF0)
+
+/*
+ * Encode a number of integers into a Simple-8b codeword.
+ *
+ * Returns the number of integers that were encoded.
+ */
+static uint64
+simple8b_encode(uint64 *ints, int *num_encoded, uint64 base)
+{
+	int			selector;
+	int			nints;
+	int			bits;
+	uint64		diff;
+	uint64		last_val;
+	uint64		codeword;
+	uint64		diffs[60];
+	int			i;
+
+	Assert(ints[0] > base);
+
+	/*
+	 * Select the "mode" to use for the next codeword.
+	 *
+	 * In each iteration, check if the next value can be represented
+	 * in the current mode we're considering.  If it's too large, then
+	 * step up the mode to a wider one, and repeat.  If it fits, move
+	 * on to next integer.  Repeat until the codeword is full, given
+	 * the current mode.
+	 *
+	 * Note that we don't have any way to represent unused slots in the
+	 * codeword, so we require each codeword to be "full".
+	 */
+	selector = 0;
+	nints = simple8b_modes[0].num_ints;
+	bits = simple8b_modes[0].bits_per_int;
+	diff = ints[0] - base - 1;
+	last_val = ints[0];
+	i = 0;
+	for (;;)
+	{
+		if (diff >= (1L << bits))
+		{
+			/* too large, step up to next mode */
+			selector++;
+			nints = simple8b_modes[selector].num_ints;
+			bits = simple8b_modes[selector].bits_per_int;
+			if (i >= nints)
+				break;
+		}
+		else
+		{
+			if (i < 60)
+				diffs[i] = diff;
+			i++;
+			if (i >= nints)
+				break;
+
+			Assert(ints[i] > last_val);
+			diff = ints[i] - last_val - 1;
+			last_val = ints[i];
+		}
+	}
+
+	if (nints == 0)
+	{
+		/* The next value is too large and be encoded with Simple-8b */
+		Assert(i == 0);
+		*num_encoded = 0;
+		return EMPTY_CODEWORD;
+	}
+
+	/*
+	 * Encode the integers using the selected mode.  Note that we shift them
+	 * into the codeword in reverse order, so that they will come out in the
+	 * correct order in the decoder.
+	 */
+	codeword = 0;
+	if (bits > 0)
+	{
+		for (i = nints - 1; i >= 0; i--)
+		{
+			codeword <<= bits;
+			codeword |= diffs[i];
+		}
+	}
+
+	/* add selector to the codeword, and return */
+	codeword <<= 4;
+	codeword |= selector;
+
+	*num_encoded = nints;
+	return codeword;
+}
+
+/*
+ * Decode a codeword into an array of integers.
+ */
+static int
+simple8b_decode(uint64 codeword, uint64 *decoded, uint64 base)
+{
+	int			selector = codeword & 0x0f;
+	int			nints = simple8b_modes[selector].num_ints;
+	uint64		bits = simple8b_modes[selector].bits_per_int;
+	uint64		mask = (1L << bits) - 1;
+	uint64		prev_value;
+
+	if (codeword == EMPTY_CODEWORD)
+		return 0;
+
+	codeword >>= 4;		/* shift out the selector */
+
+	prev_value = base;
+	for (int i = 0; i < nints; i++)
+	{
+		uint64		diff = codeword & mask;
+
+		decoded[i] = prev_value + 1L + diff;
+		prev_value = decoded[i];
+		codeword >>= bits;
+	}
+	return nints;
+}
+
+/*
+ * This is very similar to simple8b_decode(), but instead of decoding all
+ * the values to an array, it just checks if the given integer is part of
+ * the codeword.
+ */
+static bool
+simple8b_contains(uint64 codeword, uint64 key, uint64 base)
+{
+	int			selector = codeword & 0x0f;
+	int			nints = simple8b_modes[selector].num_ints;
+	int			bits = simple8b_modes[selector].bits_per_int;
+
+	if (codeword == EMPTY_CODEWORD)
+		return false;
+
+	codeword >>= 4;		/* shift out the selector */
+
+	if (bits == 0)
+	{
+		/* Special handling for 0-bit cases. */
+		return key - base <= nints;
+	}
+	else
+	{
+		int			mask = (1L << bits) - 1;
+		uint64		prev_value;
+
+		prev_value = base;
+		for (int i = 0; i < nints; i++)
+		{
+			uint64		diff = codeword & mask;
+			uint64		curr_value;
+
+			curr_value = prev_value + 1L + diff;
+
+			if (curr_value >= key)
+			{
+				if (curr_value == key)
+					return true;
+				else
+					return false;
+			}
+
+			codeword >>= bits;
+			prev_value = curr_value;
+		}
+	}
+	return false;
+}
diff --git a/src/include/lib/integerset.h b/src/include/lib/integerset.h
new file mode 100644
index 00000000000..27aa3ee883c
--- /dev/null
+++ b/src/include/lib/integerset.h
@@ -0,0 +1,25 @@
+/*
+ * integerset.h
+ *	  In-memory data structure to hold a large set of integers efficiently
+ *
+ * Portions Copyright (c) 2012-2019, PostgreSQL Global Development Group
+ *
+ * src/include/lib/integerset.h
+ */
+#ifndef INTEGERSET_H
+#define INTEGERSET_H
+
+typedef struct IntegerSet IntegerSet;
+
+extern IntegerSet *intset_create(void);
+extern void intset_free(IntegerSet *intset);
+extern void intset_add_member(IntegerSet *intset, uint64 x);
+extern bool intset_is_member(IntegerSet *intset, uint64 x);
+
+extern uint64 intset_num_entries(IntegerSet *intset);
+extern uint64 intset_memory_usage(IntegerSet *intset);
+
+extern void intset_begin_iterate(IntegerSet *intset);
+extern uint64 intset_iterate_next(IntegerSet *intset, bool *found);
+
+#endif							/* INTEGERSET_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 19d60a506e1..dfd0956aee3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  test_bloomfilter \
 		  test_ddl_deparse \
 		  test_extensions \
+		  test_integerset \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
diff --git a/src/test/modules/test_integerset/.gitignore b/src/test/modules/test_integerset/.gitignore
new file mode 100644
index 00000000000..5dcb3ff9723
--- /dev/null
+++ b/src/test/modules/test_integerset/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_integerset/Makefile b/src/test/modules/test_integerset/Makefile
new file mode 100644
index 00000000000..3b7c4999d6f
--- /dev/null
+++ b/src/test/modules/test_integerset/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_integerset/Makefile
+
+MODULE_big = test_integerset
+OBJS = test_integerset.o $(WIN32RES)
+PGFILEDESC = "test_integerset - test code for src/backend/lib/integerset.c"
+
+EXTENSION = test_integerset
+DATA = test_integerset--1.0.sql
+
+REGRESS = test_integerset
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_integerset
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_integerset/README b/src/test/modules/test_integerset/README
new file mode 100644
index 00000000000..3e4226adb55
--- /dev/null
+++ b/src/test/modules/test_integerset/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation,
+in src/backend/lib/integerset.c
+
+The tests verify the correctness of the implemention, but they can also be
+as a micro-benchmark:  If you set the 'intset_tests_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_integerset/expected/test_integerset.out b/src/test/modules/test_integerset/expected/test_integerset.out
new file mode 100644
index 00000000000..d7c88ded092
--- /dev/null
+++ b/src/test/modules/test_integerset/expected/test_integerset.out
@@ -0,0 +1,14 @@
+CREATE EXTENSION test_integerset;
+--
+-- These tests don't produce any interesting output.  We're checking that
+-- the operations complete without crashing or hanging and that none of their
+-- internal sanity tests fail.  They print progress information as INFOs,
+-- which are not interesting for automated tests, so suppress those.
+--
+SET client_min_messages = 'warning';
+SELECT test_integerset();
+ test_integerset 
+-----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_integerset/sql/test_integerset.sql b/src/test/modules/test_integerset/sql/test_integerset.sql
new file mode 100644
index 00000000000..34223afa885
--- /dev/null
+++ b/src/test/modules/test_integerset/sql/test_integerset.sql
@@ -0,0 +1,11 @@
+CREATE EXTENSION test_integerset;
+
+--
+-- These tests don't produce any interesting output.  We're checking that
+-- the operations complete without crashing or hanging and that none of their
+-- internal sanity tests fail.  They print progress information as INFOs,
+-- which are not interesting for automated tests, so suppress those.
+--
+SET client_min_messages = 'warning';
+
+SELECT test_integerset();
diff --git a/src/test/modules/test_integerset/test_integerset--1.0.sql b/src/test/modules/test_integerset/test_integerset--1.0.sql
new file mode 100644
index 00000000000..d6d5a3f6cf7
--- /dev/null
+++ b/src/test/modules/test_integerset/test_integerset--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_integerset/test_integerset--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_integerset" to load this file. \quit
+
+CREATE FUNCTION test_integerset()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_integerset/test_integerset.c b/src/test/modules/test_integerset/test_integerset.c
new file mode 100644
index 00000000000..24a2e08c0d1
--- /dev/null
+++ b/src/test/modules/test_integerset/test_integerset.c
@@ -0,0 +1,622 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_integerset.c
+ *		Test integer set data structure.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_integerset/test_integerset.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/integerset.h"
+#include "nodes/bitmapset.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "miscadmin.h"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns.
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool intset_test_stats = true;
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_integerset);
+
+/*
+ * A struct to define a pattern of integers, for use with test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 100000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 100000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 10000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 100000000
+	},
+	{
+		"clusters of thousand",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 100000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 10000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 10000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		10000000000L, 1000000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		10000000000L, 10000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		2000000000000000000L, 23 /* can't be much higher than this, or we overflow uint64 */
+	}
+};
+
+static void test_pattern(const test_spec *spec);
+static void test_empty(void);
+static void test_single_value(uint64 value);
+static void check_with_filler(IntegerSet *intset, uint64 x, uint64 value, uint64 filler_min, uint64 filler_max);
+static void test_single_value_and_filler(uint64 value, uint64 filler_min, uint64 filler_max);
+static void test_huge_distances(void);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ */
+Datum
+test_integerset(PG_FUNCTION_ARGS)
+{
+	test_huge_distances();
+
+	test_empty();
+
+	test_single_value(0);
+	test_single_value(1);
+	test_single_value(PG_UINT64_MAX - 1);
+	test_single_value(PG_UINT64_MAX);
+
+	/* Same tests, but with some "filler" values, so that the B-tree gets created */
+	test_single_value_and_filler(0, 1000, 2000);
+	test_single_value_and_filler(1, 1000, 2000);
+	test_single_value_and_filler(1, 1000, 2000000);
+	test_single_value_and_filler(PG_UINT64_MAX - 1, 1000, 2000);
+	test_single_value_and_filler(PG_UINT64_MAX, 1000, 2000);
+
+	test_huge_distances();
+
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	IntegerSet *intset;
+	MemoryContext test_cxt;
+	MemoryContext old_cxt;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing intset with pattern \"%s\"", spec->test_name);
+	if (intset_test_stats)
+		fprintf(stderr, "-----\ntesting intset with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the integer set.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.  (intset_create() creates a memory context of its
+	 * own, too, but we don't have direct access to it, so we cannot call
+	 * MemoryContextStats() on it directly).
+	 */
+	test_cxt = AllocSetContextCreate(CurrentMemoryContext,
+									 "intset test",
+									 ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(test_cxt, spec->test_name);
+	old_cxt = MemoryContextSwitchTo(test_cxt);
+	intset = intset_create();
+	MemoryContextSwitchTo(old_cxt);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			x = last_int + pattern_values[i];
+
+			intset_add_member(intset, x);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (intset_test_stats)
+		fprintf(stderr, "added %lu values in %lu ms\n",
+				spec->num_values, (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by intset_memory_usage(), as well as the
+	 * stats from the memory context. They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes
+	 * to the implementation, just observe that manually.
+	 */
+	if (intset_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by intset_memory_usage().
+		 * It should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = intset_memory_usage(intset);
+		fprintf(stderr, "intset_memory_usage() reported %lu (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(test_cxt);
+	}
+
+	/* Check that intset_get_num_entries works */
+	n = intset_num_entries(intset);
+	if (n != spec->num_values)
+		elog(ERROR, "intset_num_entries returned %lu, expected %lu", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with intset_is_member()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 1000000; n++)
+	{
+		bool		b;
+		bool		expected;
+		uint64		x;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit hit values that are actually in the set.
+		 */
+		x = (pg_lrand48() << 31) | pg_lrand48();
+		x = x % (last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to intset_is_member() ? */
+		b = intset_is_member(intset, x);
+
+		if (b != expected)
+			elog(ERROR, "mismatch at %lu: %d vs %d", x, b, expected);
+	}
+	endtime = GetCurrentTimestamp();
+	if (intset_test_stats)
+		fprintf(stderr, "probed %lu values in %lu ms\n", n, (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	intset_begin_iterate(intset);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			bool		found;
+
+			x = intset_iterate_next(intset, &found);
+			if (!found)
+				break;
+
+			if (x != expected)
+				elog(ERROR, "iterate returned wrong value; got %lu, expected %lu", x, expected);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (intset_test_stats)
+		fprintf(stderr, "iterated %lu values in %lu ms\n", n, (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after %lu entries, expected %lu", n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned %lu entries, %lu was expected", n, spec->num_values);
+
+	intset_free(intset);
+}
+
+/*
+ * Test with a set containing a single integer.
+ */
+static void
+test_single_value(uint64 value)
+{
+	IntegerSet *intset;
+	uint64		x;
+	uint64		num_entries;
+	bool		found;
+
+	elog(NOTICE, "testing intset with single value %lu", value);
+
+	/* Create the set. */
+	intset = intset_create();
+	intset_add_member(intset, value);
+
+	/* Test intset_get_num_entries() */
+	num_entries = intset_num_entries(intset);
+	if (num_entries != 1)
+		elog(ERROR, "intset_num_entries returned %lu, expected %lu", num_entries, 1L);
+
+	/*
+	 * Test intset_is_member() at various special values, like 0 and and maximum
+	 * possible 64-bit integer, as well as the value itself.
+	 */
+	if (intset_is_member(intset, 0) != (value == 0))
+		elog(ERROR, "intset_is_member failed for 0");
+	if (intset_is_member(intset, 1) != (value == 1))
+		elog(ERROR, "intset_is_member failed for 1");
+	if (intset_is_member(intset, PG_UINT64_MAX) != (value == PG_UINT64_MAX))
+		elog(ERROR, "intset_is_member failed for PG_UINT64_MAX");
+	if (intset_is_member(intset, value) != true)
+		elog(ERROR, "intset_is_member failed for the tested value");
+
+	/*
+	 * Test iterator
+	 */
+	intset_begin_iterate(intset);
+	x = intset_iterate_next(intset, &found);
+	if (!found || x != value)
+		elog(ERROR, "intset_iterate_next failed for %lu", x);
+
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next failed %lu", x);
+
+	intset_free(intset);
+}
+
+/*
+ * Test with an integer set that contains:
+ *
+ * - a given single 'value', and
+ * - all integers between 'filler_min' and 'filler_max'.
+ *
+ * This exercises different codepaths than testing just with a single value,
+ * because the implementation buffers newly-added values.  If we add just
+ * single value to the set, we won't test the internal B-tree code at all,
+ * just the code that deals with the buffer.
+ */
+static void
+test_single_value_and_filler(uint64 value, uint64 filler_min, uint64 filler_max)
+{
+	IntegerSet *intset;
+	uint64		x;
+	bool		found;
+	uint64	   *iter_expected;
+	uint64		n = 0;
+	uint64		num_entries = 0;
+	uint64		mem_usage;
+
+	elog(NOTICE, "testing intset with value %lu, and all between %lu and %lu",
+		 value, filler_min, filler_max);
+
+	intset = intset_create();
+
+	iter_expected = palloc(sizeof(uint64) * (filler_max - filler_min + 1));
+	if (value < filler_min)
+	{
+		intset_add_member(intset, value);
+		iter_expected[n++] = value;
+	}
+
+	for (x = filler_min; x < filler_max; x++)
+	{
+		intset_add_member(intset, x);
+		iter_expected[n++] = x;
+	}
+
+	if (value >= filler_max)
+	{
+		intset_add_member(intset, value);
+		iter_expected[n++] = value;
+	}
+
+	/* Test intset_get_num_entries() */
+	num_entries = intset_num_entries(intset);
+	if (num_entries != n)
+		elog(ERROR, "intset_num_entries returned %lu, expected %lu", num_entries, n);
+
+	/*
+	 * Test intset_is_member() at various spots, at and around the values that we
+	 * expect to be set, as well as 0 and the maximum possible value.
+	 */
+	check_with_filler(intset, 0,                 value, filler_min, filler_max);
+	check_with_filler(intset, 1,                 value, filler_min, filler_max);
+	check_with_filler(intset, filler_min - 1,    value, filler_min, filler_max);
+	check_with_filler(intset, filler_min,        value, filler_min, filler_max);
+	check_with_filler(intset, filler_min + 1,    value, filler_min, filler_max);
+	check_with_filler(intset, value - 1,         value, filler_min, filler_max);
+	check_with_filler(intset, value,             value, filler_min, filler_max);
+	check_with_filler(intset, value + 1,         value, filler_min, filler_max);
+	check_with_filler(intset, filler_max - 1,    value, filler_min, filler_max);
+	check_with_filler(intset, filler_max,        value, filler_min, filler_max);
+	check_with_filler(intset, filler_max + 1,    value, filler_min, filler_max);
+	check_with_filler(intset, PG_UINT64_MAX - 1, value, filler_min, filler_max);
+	check_with_filler(intset, PG_UINT64_MAX,     value, filler_min, filler_max);
+
+	intset_begin_iterate(intset);
+	for (uint64 i = 0; i < n; i++)
+	{
+		x = intset_iterate_next(intset, &found);
+		if (!found || x != iter_expected[i])
+			elog(ERROR, "intset_iterate_next failed for %lu", x);
+	}
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next failed %lu", x);
+
+	mem_usage = intset_memory_usage(intset);
+	if (mem_usage < 5000 || mem_usage > 500000000)
+		elog(ERROR, "intset_memory_usage() reported suspicous value: %lu", mem_usage);
+
+	intset_free(intset);
+}
+
+/*
+ * Helper function for test_single_value_and_filler.
+ *
+ * Calls intset_is_member() for value 'x', and checks that the result is what
+ * we expect.
+ */
+static void
+check_with_filler(IntegerSet *intset, uint64 x,
+				  uint64 value, uint64 filler_min, uint64 filler_max)
+{
+	bool		expected;
+	bool		actual;
+
+	expected = (x == value || (filler_min <= x && x < filler_max));
+
+	actual = intset_is_member(intset, x);
+
+	if (actual != expected)
+		elog(ERROR, "intset_is_member failed for %lu", x);
+}
+
+/*
+ * Test empty set
+ */
+static void
+test_empty(void)
+{
+	IntegerSet *intset;
+	bool		found = true;
+	uint64		x;
+
+	elog(NOTICE, "testing intset with empty set");
+
+	intset = intset_create();
+
+	/* Test intset_is_member() */
+	if (intset_is_member(intset, 0) != false)
+		elog(ERROR, "intset_is_member on empty set returned true");
+	if (intset_is_member(intset, 1) != false)
+		elog(ERROR, "intset_is_member on empty set returned true");
+	if (intset_is_member(intset, PG_UINT64_MAX) != false)
+		elog(ERROR, "intset_is_member on empty set returned true");
+
+	/* Test iterator */
+	intset_begin_iterate(intset);
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next on empty set returned a value (%lu)", x);
+
+	intset_free(intset);
+}
+
+/*
+ * Test with integers that are more than 2^60 apart.
+ *
+ * The Simple-8b encoding used by the set implementation can only encode
+ * values up to 2^60.  That makes large differences like this interesting
+ * to test.
+ */
+static void
+test_huge_distances(void)
+{
+	IntegerSet *intset;
+	uint64		values[1000];
+	int			num_values = 0;
+	uint64		val = 0;
+	bool		found;
+	uint64		x;
+
+	elog(NOTICE, "testing intset with distances > 2^60 between values");
+
+	val = 0;
+	values[num_values++] = val;
+
+	val += 1152921504606846976L - 1;	/* 2^60 - 1*/
+	values[num_values++] = val;
+
+	val += 1152921504606846976L - 1;	/* 2^60 - 1*/
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L + 1;	/* 2^60 + 1 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L + 1;	/* 2^60 + 1 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L + 1;	/* 2^60 + 1 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	/* we're now very close to 2^64, so can't add large values anymore */
+
+	intset = intset_create();
+
+	/*
+	 * Add many more values to the end, to make sure that all the above
+	 * values get flushed and packed into the tree structure.
+	 */
+	while (num_values < 1000)
+	{
+		val += pg_lrand48();
+		values[num_values++] = val;
+	}
+
+	/* Add these numbers to the set */
+	for (int i = 0; i < num_values; i++)
+		intset_add_member(intset, values[i]);
+
+	/*
+	 * Test iterset_is_member() around each of these values
+	 */
+	for (int i = 0; i < num_values; i++)
+	{
+		uint64		x = values[i];
+		bool		result;
+
+		if (x > 0)
+		{
+			result = intset_is_member(intset, x - 1);
+			if (result != false)
+				elog(ERROR, "intset_is_member failed for %lu", x - 1);
+		}
+
+		result = intset_is_member(intset, x);
+		if (result != true)
+			elog(ERROR, "intset_is_member failed for %lu", x);
+
+		result = intset_is_member(intset, x + 1);
+		if (result != false)
+			elog(ERROR, "intset_is_member failed for %lu", x + 1);
+	}
+
+	/*
+	 * Test iterator
+	 */
+	intset_begin_iterate(intset);
+	for (int i = 0; i < num_values; i++)
+	{
+		x = intset_iterate_next(intset, &found);
+		if (!found || x != values[i])
+			elog(ERROR, "intset_iterate_next failed for %lu", x);
+	}
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next failed %lu", x);
+}
diff --git a/src/test/modules/test_integerset/test_integerset.control b/src/test/modules/test_integerset/test_integerset.control
new file mode 100644
index 00000000000..7d20c2d7b88
--- /dev/null
+++ b/src/test/modules/test_integerset/test_integerset.control
@@ -0,0 +1,4 @@
+comment = 'Test code for integerset'
+default_version = '1.0'
+module_pathname = '$libdir/test_integerset'
+relocatable = true
-- 
2.20.1

Andrey Borodin

x4mmm@yandex-team.ru

almost 7 years ago

In reply to: Heikki Linnakangas (#7)

Re: Sparse bit set data structure

Hi!

Great job!

20 марта 2019 г., в 9:10, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

Please review, if you have a chance.

- Heikki
<0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patch>

I'm looking into the code and have few questions:
1. I'm not sure it is the best interface for iteration
uint64
intset_iterate_next(IntegerSet *intset, bool *found)

we will use it like

while
{
bool found;
BlockNumber x = (BlockNumber) intset_iterate_next(is, &found);
if (!found)
break;
// do stuff
}

we could use it like

BlockNumber x;
while(intset_iterate_next(is, &x))
{
// do stuff
}

But that's not a big difference.

2.
* Limitations
* -----------
*
* - Values must be added in order. (Random insertions would require
* splitting nodes, which hasn't been implemented.)
*
* - Values cannot be added while iteration is in progress.

You check for violation of these limitation in code, but there is not tests for this checks.
Should we add these tests?

Best regards, Andrey Borodin.

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: Heikki Linnakangas (#7)

Re: Sparse bit set data structure

On Wed, Mar 20, 2019 at 2:10 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 14/03/2019 17:37, Julien Rouhaud wrote:
+       if (newitem <= sbs->last_item)
+           elog(ERROR, "cannot insert to sparse bitset out of order");
Is there any reason to disallow inserting duplicates? AFAICT nothing
prevents that in the current code. If that's intended, that probably
should be documented.
Yeah, we could easily allow setting the last item again. It would be a
no-op, though, which doesn't seem very useful. It would be useful to
lift the limitation that the values have to be added in ascending order,
but current users that we're thinking of don't need it. Let's add that
later, if the need arises.

Or did you mean that the structure would be a "bag" rather than a "set",
so that it would keep the duplicates? I don't think that would be good.
I guess the vacuum code that this will be used in wouldn't care either
way, but "set" seems like a more clean concept.

Yes, I was thinking about "bag". For a set, allowing inserting
duplicates is indeed a no-op and should be pretty cheap with almost no
extra code for that. Maybe VACUUM can't have duplicate, but is it
that unlikely that other would need it? I'm wondering if just
requiring to merge multiple such structure isn't going to be needed
soon for instance. If that's not wanted, I'm still thinking that a
less ambiguous error should be raised.

I'm now pretty satisfied with this. Barring objections, I'll commit this
in the next few days. Please review, if you have a chance.

You're defining SIMPLE8B_MAX_VALUE but never use it. Maybe you wanted
to add an assert / explicit test in intset_add_member()?

/*
* We buffer insertions in a simple array, before packing and inserting them
* into the B-tree. MAX_BUFFERED_VALUES sets the size of the buffer. The
* encoder assumes that it is large enough, that we can always fill a leaf
* item with buffered new items. In other words, MAX_BUFFERED_VALUES must be
* larger than MAX_VALUES_PER_LEAF_ITEM.
*/
#define MAX_BUFFERED_VALUES (MAX_VALUES_PER_LEAF_ITEM * 2)

The *2 is not immediately obvious here (at least it wasn't to me),
maybe explaining intset_flush_buffered_values() main loop rationale
here could be worthwhile.

Otherwise, everything looks just fine!

#10

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: Julien Rouhaud (#9)

Re: Sparse bit set data structure

On Wed, Mar 20, 2019 at 5:20 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Mar 20, 2019 at 2:10 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm now pretty satisfied with this. Barring objections, I'll commit this
in the next few days. Please review, if you have a chance.

You're defining SIMPLE8B_MAX_VALUE but never use it. Maybe you wanted
to add an assert / explicit test in intset_add_member()?

/*
* We buffer insertions in a simple array, before packing and inserting them
* into the B-tree. MAX_BUFFERED_VALUES sets the size of the buffer. The
* encoder assumes that it is large enough, that we can always fill a leaf
* item with buffered new items. In other words, MAX_BUFFERED_VALUES must be
* larger than MAX_VALUES_PER_LEAF_ITEM.
*/
#define MAX_BUFFERED_VALUES (MAX_VALUES_PER_LEAF_ITEM * 2)

The *2 is not immediately obvious here (at least it wasn't to me),
maybe explaining intset_flush_buffered_values() main loop rationale
here could be worthwhile.

Otherwise, everything looks just fine!

I forgot to mention a minor gripe about the intset_binsrch_uint64 /
intset_binsrch_leaf function, which are 99% duplicates. But I don't
know if fixing that (something like passing the array as a void * and
passing a getter function?) is worth the trouble.

#11

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Julien Rouhaud (#10)

Re: Sparse bit set data structure

Committed, thanks for the review!

I made one last-minute change: Instead of allocating a memory context in
intset_create(), it is left to the caller. The GiST vacuum code created
two integer sets, and it made more sense for it to use the same context
for both. Also, the VACUUM tid list patch would like to use a memory
context with very large defaults, so that malloc() will decide to
mmap() it, so that the memory can be returned to the OS. Because the
desired memory allocation varies between callers, so it's best to leave
it to the caller.

On 20/03/2019 19:50, Julien Rouhaud wrote:

I forgot to mention a minor gripe about the intset_binsrch_uint64 /
intset_binsrch_leaf function, which are 99% duplicates. But I don't
know if fixing that (something like passing the array as a void * and
passing a getter function?) is worth the trouble.

Yeah. I felt that it's simpler to have two almost-identical functions,
even though it's a bit repetitive, than try to have one function serve
both uses. The 'nextkey' argument is always passed as false to
intset_binsrch_leaf() function, but I kept it anyway, to keep the
functions identical. The compiler will surely optimize it away, so it
makes no difference to performance.

On 20/03/2019 04:48, Andrey Borodin wrote:

1. I'm not sure it is the best interface for iteration

uint64
intset_iterate_next(IntegerSet *intset, bool *found)

we will use it like

while
{
bool found;
BlockNumber x = (BlockNumber) intset_iterate_next(is, &found);
if (!found)
break;
// do stuff
}

we could use it like

BlockNumber x;
while(intset_iterate_next(is, &x))
{
// do stuff
}

But that's not a big difference.

Agreed, I changed it that way.

- Heikki

#12

Adrien NAYRAT

adrien.nayrat@anayrat.info

almost 7 years ago

In reply to: Heikki Linnakangas (#11)

1 attachment(s)

Re: Sparse bit set data structure

Hello,

According to the draw and simple8b_mode struct comment, it seems there
is a typo:

* 20-bit integer 20-bit integer 20-bit integer
* 1101 00000000000000010010 01111010000100100000 00000000000000010100
* ^
* selector
*
* The selector 1101 is 13 in decimal. From the modes table below, we see
* that it means that the codeword encodes three 12-bit integers. In decimal,
* those integers are 18, 500000 and 20. Because we encode deltas rather than
* absolute values, the actual values that they represent are 18, 500018 and
* 500038.

[...]

{20, 3}, /* mode 13: three 20-bit integers */

The comment should be "the codeword encodes three *20-bit* integers" ?

Patch attached.

Regards,

Attachments:

fix-typo.patchtext/x-patch; name=fix-typo.patchDownload

diff --git a/src/backend/lib/integerset.c b/src/backend/lib/integerset.c
index 28b4a38609..9984fd55e8 100644
--- a/src/backend/lib/integerset.c
+++ b/src/backend/lib/integerset.c
@@ -805,7 +805,7 @@ intset_binsrch_leaf(uint64 item, leaf_item *arr, int arr_elems, bool nextkey)
  * selector
  *
  * The selector 1101 is 13 in decimal.  From the modes table below, we see
- * that it means that the codeword encodes three 12-bit integers.  In decimal,
+ * that it means that the codeword encodes three 20-bit integers.  In decimal,
  * those integers are 18, 500000 and 20.  Because we encode deltas rather than
  * absolute values, the actual values that they represent are 18, 500018 and
  * 500038.

#13

Bruce Momjian

bruce@momjian.us

almost 7 years ago

In reply to: Adrien NAYRAT (#12)

Re: Sparse bit set data structure

Uh, should this be applied?

---------------------------------------------------------------------------

On Thu, Mar 28, 2019 at 03:46:03PM +0100, Adrien NAYRAT wrote:

Hello,

According to the draw and simple8b_mode struct comment, it seems there is a
typo:

* 20-bit integer 20-bit integer 20-bit integer
* 1101 00000000000000010010 01111010000100100000 00000000000000010100
* ^
* selector
*
* The selector 1101 is 13 in decimal. From the modes table below, we see
* that it means that the codeword encodes three 12-bit integers. In decimal,
* those integers are 18, 500000 and 20. Because we encode deltas rather than
* absolute values, the actual values that they represent are 18, 500018 and
* 500038.

[...]

{20, 3}, /* mode 13: three 20-bit integers */

The comment should be "the codeword encodes three *20-bit* integers" ?

Patch attached.

Regards,

diff --git a/src/backend/lib/integerset.c b/src/backend/lib/integerset.c
index 28b4a38609..9984fd55e8 100644
--- a/src/backend/lib/integerset.c
+++ b/src/backend/lib/integerset.c
@@ -805,7 +805,7 @@ intset_binsrch_leaf(uint64 item, leaf_item *arr, int arr_elems, bool nextkey)
* selector
*
* The selector 1101 is 13 in decimal.  From the modes table below, we see
- * that it means that the codeword encodes three 12-bit integers.  In decimal,
+ * that it means that the codeword encodes three 20-bit integers.  In decimal,
* those integers are 18, 500000 and 20.  Because we encode deltas rather than
* absolute values, the actual values that they represent are 18, 500018 and
* 500038.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#14

Alvaro Herrera

alvherre@2ndquadrant.com

almost 7 years ago

In reply to: Bruce Momjian (#13)

Re: Sparse bit set data structure

On 2019-Apr-08, Bruce Momjian wrote:

Uh, should this be applied?

Yes, it's a pretty obvious typo methinks.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Alvaro Herrera (#14)

Re: Sparse bit set data structure

On 09/04/2019 03:45, Alvaro Herrera wrote:

On 2019-Apr-08, Bruce Momjian wrote:

Uh, should this be applied?

Yes, it's a pretty obvious typo methinks.

Pushed, thanks, and sorry for the delay.

- Heikki