contrib/cache_scan (Re: What's needed for cache-only table scan?)
Hello,
The attached patch is what we discussed just before the commit-fest:Nov.
It implements an alternative way to scan a particular table using on-memory
cache instead of the usual heap access method. Unlike buffer cache, this
mechanism caches a limited number of columns on the memory, so memory
consumption per tuple is much smaller than the regular heap access method,
thus it allows much larger number of tuples on the memory.
I'd like to extend this idea to implement a feature to cache data according to
column-oriented data structure to utilize parallel calculation processors like
CPU's SIMD operations or simple GPU cores. (Probably, it makes sense to
evaluate multiple records with a single vector instruction if contents of
a particular column is put as a large array.)
However, this patch still keeps all the tuples in row-oriented data format,
because row <=> column translation makes this patch bigger than the
current form (about 2KL), and GPU integration needs to link proprietary
library (cuda or opencl) thus I thought it is not preferable for the upstream
code.
Also note that this patch needs part-1 ~ part-3 patches of CustomScan
APIs as prerequisites because it is implemented on top of the APIs.
One thing I have to apologize is, lack of documentation and source code
comments around the contrib/ code. Please give me a couple of days to
clean-up the code.
Aside from the extension code, I put two enhancement on the core code
as follows. I'd like to have a discussion about adequacy of these enhancement.
The first enhancement is a hook on heap_page_prune() to synchronize
internal state of extension with changes of heap image on the disk.
It is not avoidable to hold garbage, increasing time by time, on the cache,
thus needs to clean up as vacuum process doing. The best timing to do
is when dead tuples are reclaimed because it is certain nobody will
reference the tuples any more.
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f626755..023f78e 100644
--- a/src/backend/utils/time/tqual.c
bool marked[MaxHeapTuplesPerPage + 1];
} PruneState;
+/* Callback for each page pruning */
+heap_page_prune_hook_type heap_page_prune_hook = NULL;
+
/* Local functions */
static int heap_prune_chain(Relation relation, Buffer buffer,
OffsetNumber rootoffnum,
@@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, Transacti
onId OldestXmin,
* and update FSM with the remaining space.
*/
+ /*
+ * This callback allows extensions to synchronize their own status with
+ * heap image on the disk, when this buffer page is vacuumed.
+ */
+ if (heap_page_prune_hook)
+ (*heap_page_prune_hook)(relation,
+ buffer,
+ ndeleted,
+ OldestXmin,
+ prstate.latestRemovedXid);
return ndeleted;
}
The second enhancement makes SetHintBits() accepts InvalidBuffer to
ignore all the jobs. We need to check visibility of cached tuples when
custom-scan node scans cached table instead of the heap.
Even though we can use MVCC snapshot to check tuple's visibility,
it may internally set hint bit of tuples thus we always needs to give
a valid buffer pointer to HeapTupleSatisfiesVisibility(). Unfortunately,
it kills all the benefit of table cache if it takes to load the heap buffer
being associated with the cached tuple.
So, I'd like to have a special case handling on the SetHintBits() for
dry-run when InvalidBuffer is given.
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f626755..023f78e 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid,
Snapshot snapshot);
*
* The caller should pass xid as the XID of the transaction to check, or
* InvalidTransactionId if no check is needed.
+ *
+ * In case when the supplied HeapTuple is not associated with a particular
+ * buffer, it just returns without any jobs. It may happen when an extension
+ * caches tuple with their own way.
*/
static inline void
SetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid)
{
+ if (BufferIsInvalid(buffer))
+ return;
+
if (TransactionIdIsValid(xid))
{
/* NB: xid must be known committed here! */
Thanks,
2013/11/13 Kohei KaiGai <kaigai@kaigai.gr.jp>:
2013/11/12 Tom Lane <tgl@sss.pgh.pa.us>:
Kohei KaiGai <kaigai@kaigai.gr.jp> writes:
So, are you thinking it is a feasible approach to focus on custom-scan
APIs during the upcoming CF3, then table-caching feature as use-case
of this APIs on CF4?Sure. If you work on this extension after CF3, and it reveals that the
custom scan stuff needs some adjustments, there would be time to do that
in CF4. The policy about what can be submitted in CF4 is that we don't
want new major features that no one has seen before, not that you can't
make fixes to previously submitted stuff. Something like a new hook
in vacuum wouldn't be a "major feature", anyway.Thanks for this clarification.
3 days are too short to write a patch, however, 2 month may be sufficient
to develop a feature on top of the scheme being discussed in the previous
comitfest.Best regards,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>
--
KaiGai Kohei <kaigai@kaigai.gr.jp>
Attachments:
pgsql-v9.4-custom-scan.part-4.v5.patchapplication/octet-stream; name=pgsql-v9.4-custom-scan.part-4.v5.patchDownload
contrib/cache_scan/Makefile | 19 +
contrib/cache_scan/cache_scan--1.0.sql | 26 +
contrib/cache_scan/cache_scan--unpackaged--1.0.sql | 2 +
contrib/cache_scan/cache_scan.control | 5 +
contrib/cache_scan/cache_scan.h | 83 ++
contrib/cache_scan/ccache.c | 1395 ++++++++++++++++++++
contrib/cache_scan/cscan.c | 668 ++++++++++
src/backend/access/heap/pruneheap.c | 13 +
src/backend/utils/time/tqual.c | 7 +
src/include/access/heapam.h | 7 +
10 files changed, 2225 insertions(+)
diff --git a/contrib/cache_scan/Makefile b/contrib/cache_scan/Makefile
new file mode 100644
index 0000000..4e68b68
--- /dev/null
+++ b/contrib/cache_scan/Makefile
@@ -0,0 +1,19 @@
+# contrib/dbcache/Makefile
+
+MODULE_big = cache_scan
+OBJS = cscan.o ccache.o
+
+EXTENSION = cache_scan
+DATA = cache_scan--1.0.sql cache_scan--unpackaged--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/cache_scan
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
diff --git a/contrib/cache_scan/cache_scan--1.0.sql b/contrib/cache_scan/cache_scan--1.0.sql
new file mode 100644
index 0000000..43567e2
--- /dev/null
+++ b/contrib/cache_scan/cache_scan--1.0.sql
@@ -0,0 +1,26 @@
+CREATE FUNCTION public.cache_scan_invalidation_trigger()
+RETURNS trigger
+AS 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE STRICT;
+
+CREATE TYPE public.__cache_scan_debuginfo AS
+(
+ tableoid oid,
+ status text,
+ chunk text,
+ upper text,
+ l_depth int4,
+ l_chunk text,
+ r_depth int4,
+ r_chunk text,
+ ntuples int4,
+ usage int4,
+ min_ctid tid,
+ max_ctid tid
+);
+CREATE FUNCTION public.cache_scan_debuginfo()
+ RETURNS SETOF public.__cache_scan_debuginfo
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT;
+
+
diff --git a/contrib/cache_scan/cache_scan--unpackaged--1.0.sql b/contrib/cache_scan/cache_scan--unpackaged--1.0.sql
new file mode 100644
index 0000000..04b53ef
--- /dev/null
+++ b/contrib/cache_scan/cache_scan--unpackaged--1.0.sql
@@ -0,0 +1,2 @@
+DROP FUNCTION public.cache_scan_invalidation_trigger() CASCADE;
+
diff --git a/contrib/cache_scan/cache_scan.control b/contrib/cache_scan/cache_scan.control
new file mode 100644
index 0000000..77946da
--- /dev/null
+++ b/contrib/cache_scan/cache_scan.control
@@ -0,0 +1,5 @@
+# cache_scan extension
+comment = 'custom scan provider for cache-only scan'
+default_version = '1.0'
+module_pathname = '$libdir/cache_scan'
+relocatable = false
diff --git a/contrib/cache_scan/cache_scan.h b/contrib/cache_scan/cache_scan.h
new file mode 100644
index 0000000..79b9f1e
--- /dev/null
+++ b/contrib/cache_scan/cache_scan.h
@@ -0,0 +1,83 @@
+/* -------------------------------------------------------------------------
+ *
+ * contrib/cache_scan/cache_scan.h
+ *
+ * Definitions for the cache_scan extension
+ *
+ * Copyright (c) 2010-2013, PostgreSQL Global Development Group
+ *
+ * -------------------------------------------------------------------------
+ */
+#ifndef CACHE_SCAN_H
+#define CACHE_SCAN_H
+#include "access/htup_details.h"
+#include "lib/ilist.h"
+#include "nodes/bitmapset.h"
+#include "storage/lwlock.h"
+#include "utils/rel.h"
+
+typedef struct ccache_chunk {
+ struct ccache_chunk *upper; /* link to the upper node */
+ struct ccache_chunk *right; /* link to the greaternode, if exist */
+ struct ccache_chunk *left; /* link to the less node, if exist */
+ int r_depth; /* max depth in right branch */
+ int l_depth; /* max depth in left branch */
+ uint32 ntups; /* number of tuples being cached */
+ uint32 usage; /* usage counter of this chunk */
+ HeapTuple tuples[FLEXIBLE_ARRAY_MEMBER];
+} ccache_chunk;
+
+#define CCACHE_STATUS_INITIALIZED 1
+#define CCACHE_STATUS_IN_PROGRESS 2
+#define CCACHE_STATUS_CONSTRUCTED 3
+
+typedef struct {
+ LWLockId lock; /* used to protect ttree links */
+ volatile int refcnt;
+ int status;
+
+ dlist_node hash_chain; /* linked to ccache_hash->slots[] */
+ dlist_node lru_chain; /* linked to ccache_hash->lru_list */
+
+ Oid tableoid;
+ ccache_chunk *root_chunk;
+ Bitmapset attrs_used; /* !Bitmapset is variable length! */
+} ccache_head;
+
+extern int ccache_max_attribute_number(void);
+extern ccache_head *cs_get_ccache(Oid tableoid, Bitmapset *attrs_used,
+ bool create_on_demand);
+extern void cs_put_ccache(ccache_head *ccache);
+
+extern bool ccache_insert_tuple(ccache_head *ccache,
+ Relation rel, HeapTuple tuple);
+extern bool ccache_delete_tuple(ccache_head *ccache, HeapTuple oldtup);
+
+extern void ccache_vacuum_page(ccache_head *ccache, Buffer buffer);
+
+extern HeapTuple ccache_find_tuple(ccache_chunk *cchunk,
+ ItemPointer ctid,
+ ScanDirection direction);
+extern void ccache_init(void);
+
+extern Datum cache_scan_invalidation_trigger(PG_FUNCTION_ARGS);
+extern Datum cache_scan_debuginfo(PG_FUNCTION_ARGS);
+
+extern void _PG_init(void);
+
+
+#define CS_DEBUG(fmt,...) \
+ elog(INFO, "%s:%d " fmt, __FUNCTION__, __LINE__, __VA_ARGS__)
+
+static inline const char *
+ctid_to_cstring(ItemPointer ctid)
+{
+ char buf[1024];
+
+ snprintf(buf, sizeof(buf), "(%u,%u)",
+ ctid->ip_blkid.bi_hi << 16 | ctid->ip_blkid.bi_lo,
+ ctid->ip_posid);
+ return pstrdup(buf);
+}
+
+#endif /* CACHE_SCAN_H */
diff --git a/contrib/cache_scan/ccache.c b/contrib/cache_scan/ccache.c
new file mode 100644
index 0000000..bf70266
--- /dev/null
+++ b/contrib/cache_scan/ccache.c
@@ -0,0 +1,1395 @@
+/* -------------------------------------------------------------------------
+ *
+ * contrib/cache_scan/ccache.c
+ *
+ * Routines for columns-culled cache implementation
+ *
+ * Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "access/heapam.h"
+#include "access/sysattr.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "storage/ipc.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+#include "cache_scan.h"
+
+/*
+ * Hash table to manage all the ccache_head
+ */
+typedef struct {
+ slock_t lock; /* lock of the hash table */
+ dlist_head lru_list; /* list of recently used cache */
+ dlist_head free_list; /* list of free ccache_head */
+ volatile int lwlocks_usage;
+ LWLockId *lwlocks;
+ dlist_head *slots;
+} ccache_hash;
+
+/*
+ * Data structure to manage blocks on the shared memory segment.
+ * This extension acquires (shmseg_blocksize) x (shmseg_num_blocks) bytes of
+ * shared memory, then it shall be split into the fixed-length memory blocks.
+ * All the memory allocation and relase are done by block, to avoid memory
+ * fragmentation that eventually makes implementation complicated.
+ *
+ * The shmseg_head has a spinlock and global free_list to link free blocks.
+ * Its blocks[] array contains shmseg_block structures that points a particular
+ * address of the associated memory block.
+ * The shmseg_block being chained in the free_list of shmseg_head are available
+ * to allocate. Elsewhere, this block is already allocated on somewhere.
+ */
+typedef struct {
+ dlist_node chain;
+ Size address;
+} shmseg_block;
+
+typedef struct {
+ slock_t lock;
+ dlist_head free_list;
+ Size base_address;
+ shmseg_block blocks[FLEXIBLE_ARRAY_MEMBER];
+} shmseg_head;
+
+/*
+ * ccache_entry is used to track ccache_head being acquired by this backend.
+ */
+typedef struct {
+ dlist_node chain;
+ ResourceOwner owner;
+ ccache_head *ccache;
+} ccache_entry;
+
+static dlist_head ccache_local_list;
+static dlist_head ccache_free_list;
+
+/* Static variables */
+static shmem_startup_hook_type shmem_startup_next = NULL;
+
+static ccache_hash *cs_ccache_hash = NULL;
+static shmseg_head *cs_shmseg_head = NULL;
+
+/* GUC variables */
+static int ccache_hash_size;
+static int shmseg_blocksize;
+static int shmseg_num_blocks;
+static int max_cached_attnum;
+
+/* Static functions */
+static void *cs_alloc_shmblock(void);
+static void cs_free_shmblock(void *address);
+
+int
+ccache_max_attribute_number(void)
+{
+ return (max_cached_attnum - FirstLowInvalidHeapAttributeNumber +
+ BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+}
+
+/*
+ * ccache_on_resource_release
+ *
+ * It is a callback to put ccache_head being acquired locally, to keep
+ * consistency of reference counter.
+ */
+static void
+ccache_on_resource_release(ResourceReleasePhase phase,
+ bool isCommit,
+ bool isTopLevel,
+ void *arg)
+{
+ dlist_mutable_iter iter;
+
+ if (phase != RESOURCE_RELEASE_AFTER_LOCKS)
+ return;
+
+ dlist_foreach_modify(iter, &ccache_local_list)
+ {
+ ccache_entry *entry
+ = dlist_container(ccache_entry, chain, iter.cur);
+
+ if (entry->owner == CurrentResourceOwner)
+ {
+ dlist_delete(&entry->chain);
+
+ if (isCommit)
+ elog(WARNING, "cache reference leak (tableoid=%u, refcnt=%d)",
+ entry->ccache->tableoid, entry->ccache->refcnt);
+ cs_put_ccache(entry->ccache);
+
+ entry->ccache = NULL;
+ dlist_push_tail(&ccache_free_list, &entry->chain);
+ }
+ }
+}
+
+static ccache_chunk *
+ccache_alloc_chunk(ccache_head *ccache, ccache_chunk *upper)
+{
+ ccache_chunk *cchunk = cs_alloc_shmblock();
+
+ if (cchunk)
+ {
+ cchunk->upper = upper;
+ cchunk->right = NULL;
+ cchunk->left = NULL;
+ cchunk->r_depth = 0;
+ cchunk->l_depth = 0;
+ cchunk->ntups = 0;
+ cchunk->usage = shmseg_blocksize;
+ }
+ return cchunk;
+}
+
+/*
+ * ccache_rebalance_tree
+ *
+ * It keeps the balance of ccache tree if the supplied chunk has
+ * unbalanced subtrees.
+ */
+#define MAX_DEPTH(chunk) Max((chunk)->l_depth, (chunk)->r_depth)
+
+static void
+ccache_rebalance_tree(ccache_head *ccache, ccache_chunk *cchunk)
+{
+ Assert(cchunk->upper != NULL
+ ? (cchunk->upper->left == cchunk || cchunk->upper->right == cchunk)
+ : (ccache->root_chunk == cchunk));
+
+ if (cchunk->l_depth + 1 < cchunk->r_depth)
+ {
+ /* anticlockwise rotation */
+ ccache_chunk *rchunk = cchunk->right;
+ ccache_chunk *upper = cchunk->upper;
+
+ cchunk->right = rchunk->left;
+ cchunk->r_depth = MAX_DEPTH(cchunk->right) + 1;
+ cchunk->upper = rchunk;
+
+ rchunk->left = cchunk;
+ rchunk->l_depth = MAX_DEPTH(rchunk->left) + 1;
+ rchunk->upper = upper;
+
+ if (!upper)
+ ccache->root_chunk = rchunk;
+ else if (upper->left == cchunk)
+ {
+ upper->left = rchunk;
+ upper->l_depth = MAX_DEPTH(rchunk) + 1;
+ }
+ else
+ {
+ upper->right = rchunk;
+ upper->r_depth = MAX_DEPTH(rchunk) + 1;
+ }
+ }
+ else if (cchunk->l_depth > cchunk->r_depth + 1)
+ {
+ /* clockwise rotation */
+ ccache_chunk *lchunk = cchunk->left;
+ ccache_chunk *upper = cchunk->upper;
+
+ cchunk->left = lchunk->right;
+ cchunk->l_depth = MAX_DEPTH(cchunk->left) + 1;
+ cchunk->upper = lchunk;
+
+ lchunk->right = cchunk;
+ lchunk->l_depth = MAX_DEPTH(lchunk->right) + 1;
+ lchunk->upper = upper;
+
+ if (!upper)
+ ccache->root_chunk = lchunk;
+ else if (upper->right == cchunk)
+ {
+ upper->right = lchunk;
+ upper->r_depth = MAX_DEPTH(lchunk) + 1;
+ }
+ else
+ {
+ upper->left = lchunk;
+ upper->l_depth = MAX_DEPTH(lchunk) + 1;
+ }
+ }
+}
+
+/*
+ * ccache_insert_tuple
+ *
+ * It inserts the supplied tuple, but uncached columns are dropped off,
+ * onto the ccache_head. If no space is left, it expands the t-tree
+ * structure with a chunk newly allocated. If no shared memory space was
+ * left, it returns false.
+ */
+#define cchunk_freespace(cchunk) \
+ ((cchunk)->usage - offsetof(ccache_chunk, tuples[(cchunk)->ntups + 1]))
+
+static void
+do_insert_tuple(ccache_head *ccache, ccache_chunk *cchunk, HeapTuple tuple)
+{
+ HeapTuple newtup;
+ ItemPointer ctid = &tuple->t_self;
+ int i_min = 0;
+ int i_max = cchunk->ntups;
+ int i, required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
+
+ Assert(required <= cchunk_freespace(cchunk));
+
+ while (i_min < i_max)
+ {
+ int i_mid = (i_min + i_max) / 2;
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_mid]->t_self) <= 0)
+ i_max = i_mid;
+ else
+ i_min = i_mid + 1;
+ }
+
+ if (i_min < cchunk->ntups)
+ {
+ HeapTuple movtup = cchunk->tuples[i_min];
+ Size movlen = HEAPTUPLESIZE + MAXALIGN(movtup->t_len);
+ char *destaddr = (char *)movtup + movlen - required;
+
+ Assert(ItemPointerCompare(&tuple->t_self, &movtup->t_self) < 0);
+
+ memmove((char *)cchunk + cchunk->usage - required,
+ (char *)cchunk + cchunk->usage,
+ ((Size)movtup + movlen) - ((Size)cchunk + cchunk->usage));
+ for (i=cchunk->ntups; i > i_min; i--)
+ {
+ HeapTuple temp;
+
+ temp = (HeapTuple)((char *)cchunk->tuples[i-1] - required);
+ cchunk->tuples[i] = temp;
+ temp->t_data = (HeapTupleHeader)((char *)temp->t_data - required);
+ }
+ cchunk->tuples[i_min] = newtup = (HeapTuple)destaddr;
+ memcpy(newtup, tuple, HEAPTUPLESIZE);
+ newtup->t_data = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, tuple->t_data, tuple->t_len);
+ cchunk->usage -= required;
+ cchunk->ntups++;
+
+ Assert(cchunk->usage >= offsetof(ccache_chunk, tuples[cchunk->ntups]));
+ }
+ else
+ {
+ cchunk->usage -= required;
+ newtup = (HeapTuple)(((char *)cchunk) + cchunk->usage);
+ memcpy(newtup, tuple, HEAPTUPLESIZE);
+ newtup->t_data = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, tuple->t_data, tuple->t_len);
+
+ cchunk->tuples[i_min] = newtup;
+ cchunk->ntups++;
+
+ Assert(cchunk->usage >= offsetof(ccache_chunk, tuples[cchunk->ntups]));
+ }
+ Assert(cchunk->ntups < 10000);
+}
+
+static void
+copy_tuple_properties(HeapTuple newtup, HeapTuple oldtup)
+{
+ ItemPointerCopy(&oldtup->t_self, &newtup->t_self);
+ newtup->t_tableOid = oldtup->t_tableOid;
+ memcpy(&newtup->t_data->t_choice.t_heap,
+ &oldtup->t_data->t_choice.t_heap,
+ sizeof(HeapTupleFields));
+ ItemPointerCopy(&oldtup->t_data->t_ctid,
+ &newtup->t_data->t_ctid);
+ newtup->t_data->t_infomask
+ = ((newtup->t_data->t_infomask & ~HEAP_XACT_MASK) |
+ (oldtup->t_data->t_infomask & HEAP_XACT_MASK));
+ newtup->t_data->t_infomask2
+ = ((newtup->t_data->t_infomask2 & ~HEAP2_XACT_MASK) |
+ (oldtup->t_data->t_infomask2 & HEAP2_XACT_MASK));
+}
+
+static bool
+ccache_insert_tuple_internal(ccache_head *ccache,
+ ccache_chunk *cchunk,
+ HeapTuple newtup)
+{
+ ItemPointer ctid = &newtup->t_self;
+ ItemPointer min_ctid;
+ ItemPointer max_ctid;
+ int required = MAXALIGN(HEAPTUPLESIZE + newtup->t_len);
+
+ Assert(cchunk->ntups > 0);
+retry:
+ min_ctid = &cchunk->tuples[0]->t_self;
+ max_ctid = &cchunk->tuples[cchunk->ntups - 1]->t_self;
+
+ if (ItemPointerCompare(ctid, min_ctid) < 0)
+ {
+ if (!cchunk->left && required <= cchunk_freespace(cchunk))
+ do_insert_tuple(ccache, cchunk, newtup);
+ else
+ {
+ if (!cchunk->left)
+ {
+ cchunk->left = ccache_alloc_chunk(ccache, cchunk);
+ if (!cchunk->left)
+ return false;
+ cchunk->l_depth = 1;
+ }
+ if (!ccache_insert_tuple_internal(ccache, cchunk->left, newtup))
+ return false;
+ }
+ }
+ else if (ItemPointerCompare(ctid, max_ctid) > 0)
+ {
+ if (!cchunk->right && required <= cchunk_freespace(cchunk))
+ do_insert_tuple(ccache, cchunk, newtup);
+ else
+ {
+ if (!cchunk->right)
+ {
+ cchunk->right = ccache_alloc_chunk(ccache, cchunk);
+ if (!cchunk->right)
+ return false;
+ cchunk->r_depth = 1;
+ }
+ if (!ccache_insert_tuple_internal(ccache, cchunk->right, newtup))
+ return false;
+ }
+ }
+ else
+ {
+ if (required <= cchunk_freespace(cchunk))
+ do_insert_tuple(ccache, cchunk, newtup);
+ else
+ {
+ HeapTuple movtup;
+
+ /* push out largest ctid until we get enough space */
+ if (!cchunk->right)
+ {
+ cchunk->right = ccache_alloc_chunk(ccache, cchunk);
+ if (!cchunk->right)
+ return false;
+ cchunk->r_depth = 1;
+ }
+ movtup = cchunk->tuples[cchunk->ntups - 1];
+
+ if (!ccache_insert_tuple_internal(ccache, cchunk->right, movtup))
+ return false;
+
+ cchunk->ntups--;
+ cchunk->usage += MAXALIGN(HEAPTUPLESIZE + movtup->t_len);
+
+ goto retry;
+ }
+ }
+ /* Rebalance the tree, if needed */
+ ccache_rebalance_tree(ccache, cchunk);
+
+ return true;
+}
+
+bool
+ccache_insert_tuple(ccache_head *ccache, Relation rel, HeapTuple tuple)
+{
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ HeapTuple newtup;
+ Datum *cs_values = alloca(sizeof(Datum) * tupdesc->natts);
+ bool *cs_isnull = alloca(sizeof(bool) * tupdesc->natts);
+ ccache_chunk *cchunk;
+ int required;
+ int i, j;
+ bool rc;
+
+ /* remove unreferenced columns */
+ heap_deform_tuple(tuple, tupdesc, cs_values, cs_isnull);
+ for (i=0; i < tupdesc->natts; i++)
+ {
+ j = i + 1 - FirstLowInvalidHeapAttributeNumber;
+
+ if (!bms_is_member(j, &ccache->attrs_used))
+ cs_isnull[i] = true;
+ }
+ newtup = heap_form_tuple(tupdesc, cs_values, cs_isnull);
+ copy_tuple_properties(newtup, tuple);
+
+ required = MAXALIGN(HEAPTUPLESIZE + newtup->t_len);
+
+ cchunk = ccache->root_chunk;
+ if (cchunk->ntups == 0)
+ {
+ HeapTuple tup;
+
+ cchunk->usage -= required;
+ cchunk->tuples[0] = tup = (HeapTuple)((char *)cchunk + cchunk->usage);
+ memcpy(tup, newtup, HEAPTUPLESIZE);
+ tup->t_data = (HeapTupleHeader)((char *)tup + HEAPTUPLESIZE);
+ memcpy(tup->t_data, newtup->t_data, newtup->t_len);
+ cchunk->ntups++;
+ rc = true;
+ }
+ else
+ rc = ccache_insert_tuple_internal(ccache, ccache->root_chunk, newtup);
+
+ return rc;
+}
+
+/*
+ * ccache_find_tuple
+ *
+ * It find a tuple that satisfies the supplied ItemPointer according to
+ * the ScanDirection. If NoMovementScanDirection, it returns a tuple that
+ * has strictly same ItemPointer. On the other hand, it returns a tuple
+ * that has the least ItemPointer greater than the supplied one if
+ * ForwardScanDirection, and also returns a tuple with the greatest
+ * ItemPointer smaller than the supplied one if BackwardScanDirection.
+ */
+HeapTuple
+ccache_find_tuple(ccache_chunk *cchunk, ItemPointer ctid,
+ ScanDirection direction)
+{
+ ItemPointer min_ctid;
+ ItemPointer max_ctid;
+ HeapTuple tuple = NULL;
+ int i_min = 0;
+ int i_max = cchunk->ntups - 1;
+ int rc;
+
+ if (cchunk->ntups == 0)
+ return false;
+
+ min_ctid = &cchunk->tuples[i_min]->t_self;
+ max_ctid = &cchunk->tuples[i_max]->t_self;
+
+ if ((rc = ItemPointerCompare(ctid, min_ctid)) <= 0)
+ {
+ if (rc == 0 && (direction == NoMovementScanDirection ||
+ direction == ForwardScanDirection))
+ {
+ if (cchunk->ntups > direction)
+ return cchunk->tuples[direction];
+ }
+ else
+ {
+ if (cchunk->left)
+ tuple = ccache_find_tuple(cchunk->left, ctid, direction);
+ if (!HeapTupleIsValid(tuple) && direction == ForwardScanDirection)
+ return cchunk->tuples[0];
+ return tuple;
+ }
+ }
+
+ if ((rc = ItemPointerCompare(ctid, max_ctid)) >= 0)
+ {
+ if (rc == 0 && (direction == NoMovementScanDirection ||
+ direction == BackwardScanDirection))
+ {
+ if (i_max + direction >= 0)
+ return cchunk->tuples[i_max + direction];
+ }
+ else
+ {
+ if (cchunk->right)
+ tuple = ccache_find_tuple(cchunk->right, ctid, direction);
+ if (!HeapTupleIsValid(tuple) && direction == BackwardScanDirection)
+ return cchunk->tuples[i_max];
+ return tuple;
+ }
+ }
+
+ while (i_min < i_max)
+ {
+ int i_mid = (i_min + i_max) / 2;
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_mid]->t_self) <= 0)
+ i_max = i_mid;
+ else
+ i_min = i_mid + 1;
+ }
+ Assert(i_min == i_max);
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_min]->t_self) == 0)
+ {
+ if (direction == BackwardScanDirection && i_min > 0)
+ return cchunk->tuples[i_min - 1];
+ else if (direction == NoMovementScanDirection)
+ return cchunk->tuples[i_min];
+ else if (direction == ForwardScanDirection)
+ {
+ Assert(i_min + 1 < cchunk->ntups);
+ return cchunk->tuples[i_min + 1];
+ }
+ }
+ else
+ {
+ if (direction == BackwardScanDirection && i_min > 0)
+ return cchunk->tuples[i_min - 1];
+ else if (direction == ForwardScanDirection)
+ return cchunk->tuples[i_min];
+ }
+ return NULL;
+}
+
+/*
+ * ccache_delete_tuple
+ *
+ * It synchronizes the properties of tuple being already cached, usually
+ * for deletion.
+ */
+bool
+ccache_delete_tuple(ccache_head *ccache, HeapTuple oldtup)
+{
+ HeapTuple tuple;
+
+ tuple = ccache_find_tuple(ccache->root_chunk, &oldtup->t_self,
+ NoMovementScanDirection);
+ if (!tuple)
+ return false;
+
+ copy_tuple_properties(tuple, oldtup);
+
+ return true;
+}
+
+/*
+ * ccache_merge_chunk
+ *
+ * It merges two chunks if these have enough free space to consolidate
+ * its contents into one.
+ */
+static void
+ccache_merge_chunk(ccache_head *ccache, ccache_chunk *cchunk)
+{
+ ccache_chunk *curr;
+ ccache_chunk **upper;
+ int *p_depth;
+ int i;
+ bool needs_rebalance = false;
+
+ /* find the least right node that has no left node */
+ upper = &cchunk->right;
+ p_depth = &cchunk->r_depth;
+ curr = cchunk->right;
+ while (curr != NULL)
+ {
+ if (!curr->left)
+ {
+ Size shift = shmseg_blocksize - curr->usage;
+ Size total_usage = cchunk->usage - shift;
+ int total_ntups = cchunk->ntups + curr->ntups;
+
+ if (offsetof(ccache_chunk, tuples[total_ntups]) < total_usage)
+ {
+ ccache_chunk *rchunk = curr->right;
+
+ /* merge contents */
+ for (i=0; i < curr->ntups; i++)
+ {
+ HeapTuple oldtup = curr->tuples[i];
+ HeapTuple newtup;
+
+ cchunk->usage -= HEAPTUPLESIZE + MAXALIGN(oldtup->t_len);
+ newtup = (HeapTuple)((char *)cchunk + cchunk->usage);
+ memcpy(newtup, oldtup, HEAPTUPLESIZE);
+ newtup->t_data
+ = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, oldtup->t_data,
+ MAXALIGN(oldtup->t_len));
+
+ cchunk->tuples[cchunk->ntups++] = newtup;
+ }
+
+ /* detach the current chunk */
+ *upper = curr->right;
+ *p_depth = curr->r_depth;
+ if (rchunk)
+ rchunk->upper = curr->upper;
+ /* release it */
+ cs_free_shmblock(curr);
+ needs_rebalance = true;
+ }
+ break;
+ }
+ upper = &curr->left;
+ p_depth = &curr->l_depth;
+ curr = cchunk->left;
+ }
+
+ /* find the greatest left node that has no right node */
+ upper = &cchunk->left;
+ p_depth = &cchunk->l_depth;
+ curr = cchunk->left;
+ while (curr != NULL)
+ {
+ if (!curr->right)
+ {
+ Size shift = shmseg_blocksize - curr->usage;
+ Size total_usage = cchunk->usage - shift;
+ int total_ntups = cchunk->ntups + curr->ntups;
+
+ if (offsetof(ccache_chunk, tuples[total_ntups]) < total_usage)
+ {
+ ccache_chunk *lchunk = curr->left;
+ Size offset;
+
+ /* merge contents */
+ memmove((char *)cchunk + cchunk->usage - shift,
+ (char *)cchunk + cchunk->usage,
+ shmseg_blocksize - cchunk->usage);
+ for (i=cchunk->ntups - 1; i >= 0; i--)
+ {
+ HeapTuple temp
+ = (HeapTuple)((char *)cchunk->tuples[i] - shift);
+
+ cchunk->tuples[curr->ntups + i] = temp;
+ temp->t_data = (HeapTupleHeader)((char *)temp +
+ HEAPTUPLESIZE);
+ }
+ cchunk->usage -= shift;
+ cchunk->ntups += curr->ntups;
+
+ /* merge contents */
+ offset = shmseg_blocksize;
+ for (i=0; i < curr->ntups; i++)
+ {
+ HeapTuple oldtup = curr->tuples[i];
+ HeapTuple newtup;
+
+ offset -= HEAPTUPLESIZE + MAXALIGN(oldtup->t_len);
+ newtup = (HeapTuple)((char *)cchunk + offset);
+ memcpy(newtup, oldtup, HEAPTUPLESIZE);
+ newtup->t_data
+ = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, oldtup->t_data,
+ MAXALIGN(oldtup->t_len));
+ cchunk->tuples[i] = newtup;
+ }
+
+ /* detach the current chunk */
+ *upper = curr->left;
+ *p_depth = curr->l_depth;
+ if (lchunk)
+ lchunk->upper = curr->upper;
+ /* release it */
+ cs_free_shmblock(curr);
+ needs_rebalance = true;
+ }
+ break;
+ }
+ upper = &curr->right;
+ p_depth = &curr->r_depth;
+ curr = cchunk->right;
+ }
+ /* Rebalance the tree, if needed */
+ if (needs_rebalance)
+ ccache_rebalance_tree(ccache, cchunk);
+}
+
+/*
+ * ccache_vacuum_page
+ *
+ * It reclaims the tuples being already vacuumed. It shall be kicked on
+ * the callback function of heap_page_prune_hook to synchronize contents
+ * of the cache with on-disk image.
+ */
+static void
+ccache_vacuum_tuple(ccache_head *ccache,
+ ccache_chunk *cchunk,
+ ItemPointer ctid)
+{
+ ItemPointer min_ctid;
+ ItemPointer max_ctid;
+ int i_min = 0;
+ int i_max = cchunk->ntups;
+
+ if (cchunk->ntups == 0)
+ return;
+
+ min_ctid = &cchunk->tuples[i_min]->t_self;
+ max_ctid = &cchunk->tuples[i_max - 1]->t_self;
+
+ if (ItemPointerCompare(ctid, min_ctid) < 0)
+ {
+ if (cchunk->left)
+ ccache_vacuum_tuple(ccache, cchunk->left, ctid);
+ }
+ else if (ItemPointerCompare(ctid, max_ctid) > 0)
+ {
+ if (cchunk->right)
+ ccache_vacuum_tuple(ccache, cchunk->right, ctid);
+ }
+ else
+ {
+ while (i_min < i_max)
+ {
+ int i_mid = (i_min + i_max) / 2;
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_mid]->t_self) <= 0)
+ i_max = i_mid;
+ else
+ i_min = i_mid + 1;
+ }
+ Assert(i_min == i_max);
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_min]->t_self) == 0)
+ {
+ HeapTuple tuple = cchunk->tuples[i_min];
+ int length = MAXALIGN(HEAPTUPLESIZE + tuple->t_len);
+
+ if (i_min < cchunk->ntups - 1)
+ {
+ int j;
+
+ memmove((char *)cchunk + cchunk->usage + length,
+ (char *)cchunk + cchunk->usage,
+ (Size)tuple - ((Size)cchunk + cchunk->usage));
+ for (j=i_min + 1; j < cchunk->ntups; j++)
+ {
+ HeapTuple temp;
+
+ temp = (HeapTuple)((char *)cchunk->tuples[j] + length);
+ cchunk->tuples[j-1] = temp;
+ temp->t_data
+ = (HeapTupleHeader)((char *)temp->t_data + length);
+ }
+ }
+ cchunk->usage += length;
+ cchunk->ntups--;
+ }
+ }
+ /* merge chunks if this chunk has enough space to merge */
+ ccache_merge_chunk(ccache, cchunk);
+}
+
+void
+ccache_vacuum_page(ccache_head *ccache, Buffer buffer)
+{
+ /* XXX it needs buffer is valid and pinned */
+ BlockNumber blknum = BufferGetBlockNumber(buffer);
+ Page page = BufferGetPage(buffer);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber offnum;
+
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemPointerData ctid;
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsNormal(itemid))
+ continue;
+
+ ItemPointerSetBlockNumber(&ctid, blknum);
+ ItemPointerSetOffsetNumber(&ctid, offnum);
+
+ ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
+ }
+}
+
+static void
+ccache_release_all_chunks(ccache_chunk *cchunk)
+{
+ if (cchunk->left)
+ ccache_release_all_chunks(cchunk->left);
+ if (cchunk->right)
+ ccache_release_all_chunks(cchunk->right);
+ cs_free_shmblock(cchunk);
+}
+
+static void
+track_ccache_locally(ccache_head *ccache)
+{
+ ccache_entry *entry;
+ dlist_node *dnode;
+
+ if (dlist_is_empty(&ccache_free_list))
+ {
+ int i;
+
+ PG_TRY();
+ {
+ for (i=0; i < 20; i++)
+ {
+ entry = MemoryContextAlloc(TopMemoryContext,
+ sizeof(ccache_entry));
+ dlist_push_tail(&ccache_free_list, &entry->chain);
+ }
+ }
+ PG_CATCH();
+ {
+ cs_put_ccache(ccache);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+ dnode = dlist_pop_head_node(&ccache_free_list);
+ entry = dlist_container(ccache_entry, chain, dnode);
+ entry->owner = CurrentResourceOwner;
+ entry->ccache = ccache;
+ dlist_push_tail(&ccache_local_list, &entry->chain);
+}
+
+static void
+untrack_ccache_locally(ccache_head *ccache)
+{
+ dlist_mutable_iter iter;
+
+ dlist_foreach_modify(iter, &ccache_local_list)
+ {
+ ccache_entry *entry
+ = dlist_container(ccache_entry, chain, iter.cur);
+
+ if (entry->ccache == ccache &&
+ entry->owner == CurrentResourceOwner)
+ {
+ dlist_delete(&entry->chain);
+ dlist_push_tail(&ccache_free_list, &entry->chain);
+ return;
+ }
+ }
+}
+
+static void
+cs_put_ccache_nolock(ccache_head *ccache)
+{
+ Assert(ccache->refcnt > 0);
+ if (--ccache->refcnt == 0)
+ {
+ ccache_release_all_chunks(ccache->root_chunk);
+ dlist_delete(&ccache->hash_chain);
+ dlist_delete(&ccache->lru_chain);
+ dlist_push_head(&cs_ccache_hash->free_list, &ccache->hash_chain);
+ }
+ untrack_ccache_locally(ccache);
+}
+
+void
+cs_put_ccache(ccache_head *cache)
+{
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ cs_put_ccache_nolock(cache);
+ SpinLockRelease(&cs_ccache_hash->lock);
+}
+
+static ccache_head *
+cs_create_ccache(Oid tableoid, Bitmapset *attrs_used)
+{
+ ccache_head *temp;
+ ccache_head *new_cache;
+ dlist_node *dnode;
+ int i;
+
+ /*
+ * Here is no columnar cache of this relation or cache attributes are
+ * not enough to run the required query. So, it tries to create a new
+ * ccache_head for the upcoming cache-scan.
+ * Also allocate ones, if we have no free ccache_head any more.
+ */
+ if (dlist_is_empty(&cs_ccache_hash->free_list))
+ {
+ char *buffer;
+ int offset;
+ int nwords, size;
+
+ buffer = cs_alloc_shmblock();
+ if (!buffer)
+ return NULL;
+
+ nwords = (max_cached_attnum - FirstLowInvalidHeapAttributeNumber +
+ BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+ size = MAXALIGN(offsetof(ccache_head,
+ attrs_used.words[nwords + 1]));
+ for (offset = 0; offset <= shmseg_blocksize - size; offset += size)
+ {
+ temp = (ccache_head *)(buffer + offset);
+
+ dlist_push_tail(&cs_ccache_hash->free_list, &temp->hash_chain);
+ }
+ }
+ dnode = dlist_pop_head_node(&cs_ccache_hash->free_list);
+ new_cache = dlist_container(ccache_head, hash_chain, dnode);
+
+ i = cs_ccache_hash->lwlocks_usage++ % ccache_hash_size;
+ new_cache->lock = cs_ccache_hash->lwlocks[i];
+ new_cache->refcnt = 2;
+ new_cache->status = CCACHE_STATUS_INITIALIZED;
+
+ new_cache->tableoid = tableoid;
+ new_cache->root_chunk = ccache_alloc_chunk(new_cache, NULL);
+ if (!new_cache->root_chunk)
+ {
+ dlist_push_head(&cs_ccache_hash->free_list, &new_cache->hash_chain);
+ return NULL;
+ }
+
+ if (attrs_used)
+ memcpy(&new_cache->attrs_used, attrs_used,
+ offsetof(Bitmapset, words[attrs_used->nwords]));
+ else
+ {
+ new_cache->attrs_used.nwords = 1;
+ new_cache->attrs_used.words[0] = 0;
+ }
+ return new_cache;
+}
+
+ccache_head *
+cs_get_ccache(Oid tableoid, Bitmapset *attrs_used, bool create_on_demand)
+{
+ Datum hash = hash_any((unsigned char *)&tableoid, sizeof(Oid));
+ Index i = hash % ccache_hash_size;
+ dlist_iter iter;
+ ccache_head *old_cache = NULL;
+ ccache_head *new_cache = NULL;
+ ccache_head *temp;
+
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ PG_TRY();
+ {
+ /*
+ * Try to find out existing ccache that has all the columns being
+ * referenced in this query.
+ */
+ dlist_foreach(iter, &cs_ccache_hash->slots[i])
+ {
+ temp = dlist_container(ccache_head, hash_chain, iter.cur);
+
+ if (tableoid != temp->tableoid)
+ continue;
+
+ if (bms_is_subset(attrs_used, &temp->attrs_used))
+ {
+ temp->refcnt++;
+ if (create_on_demand)
+ dlist_move_head(&cs_ccache_hash->lru_list,
+ &temp->lru_chain);
+ new_cache = temp;
+ goto out_unlock;
+ }
+ old_cache = temp;
+ break;
+ }
+
+ if (create_on_demand)
+ {
+ if (old_cache)
+ attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
+
+ new_cache = cs_create_ccache(tableoid, attrs_used);
+ if (!new_cache)
+ goto out_unlock;
+
+ dlist_push_head(&cs_ccache_hash->slots[i], &new_cache->hash_chain);
+ dlist_push_head(&cs_ccache_hash->lru_list, &new_cache->lru_chain);
+ if (old_cache)
+ cs_put_ccache_nolock(old_cache);
+ }
+ }
+ PG_CATCH();
+ {
+ SpinLockRelease(&cs_ccache_hash->lock);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+out_unlock:
+ SpinLockRelease(&cs_ccache_hash->lock);
+
+ if (new_cache)
+ track_ccache_locally(new_cache);
+
+ return new_cache;
+}
+
+typedef struct {
+ Oid tableoid;
+ int status;
+ ccache_chunk *cchunk;
+ ccache_chunk *upper;
+ ccache_chunk *right;
+ ccache_chunk *left;
+ int r_depth;
+ int l_depth;
+ uint32 ntups;
+ uint32 usage;
+ ItemPointerData min_ctid;
+ ItemPointerData max_ctid;
+} ccache_status;
+
+static List *
+cache_scan_debuginfo_internal(ccache_head *ccache,
+ ccache_chunk *cchunk, List *result)
+{
+ ccache_status *cstatus = palloc0(sizeof(ccache_status));
+ List *temp;
+
+ if (cchunk->left)
+ {
+ temp = cache_scan_debuginfo_internal(ccache, cchunk->left, NIL);
+ result = list_concat(result, temp);
+ }
+ cstatus->tableoid = ccache->tableoid;
+ cstatus->status = ccache->status;
+ cstatus->cchunk = cchunk;
+ cstatus->upper = cchunk->upper;
+ cstatus->right = cchunk->right;
+ cstatus->left = cchunk->left;
+ cstatus->r_depth = cchunk->r_depth;
+ cstatus->l_depth = cchunk->l_depth;
+ cstatus->ntups = cchunk->ntups;
+ cstatus->usage = cchunk->usage;
+ if (cchunk->ntups > 0)
+ {
+ ItemPointerCopy(&cchunk->tuples[0]->t_self,
+ &cstatus->min_ctid);
+ ItemPointerCopy(&cchunk->tuples[cchunk->ntups - 1]->t_self,
+ &cstatus->max_ctid);
+ }
+ else
+ {
+ ItemPointerSet(&cstatus->min_ctid,
+ InvalidBlockNumber,
+ InvalidOffsetNumber);
+ ItemPointerSet(&cstatus->max_ctid,
+ InvalidBlockNumber,
+ InvalidOffsetNumber);
+ }
+ result = lappend(result, cstatus);
+
+ if (cchunk->right)
+ {
+ temp = cache_scan_debuginfo_internal(ccache, cchunk->right, NIL);
+ result = list_concat(result, temp);
+ }
+ return result;
+}
+
+/*
+ * cache_scan_debuginfo
+ *
+ * It shows the current status of ccache_chunks being allocated.
+ */
+Datum
+cache_scan_debuginfo(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *fncxt;
+ List *cstatus_list;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ TupleDesc tupdesc;
+ MemoryContext oldcxt;
+ int i;
+ dlist_iter iter;
+ List *result = NIL;
+
+ fncxt = SRF_FIRSTCALL_INIT();
+ oldcxt = MemoryContextSwitchTo(fncxt->multi_call_memory_ctx);
+
+ /* make definition of tuple-descriptor */
+ tupdesc = CreateTemplateTupleDesc(12, false);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "tableoid",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 2, "status",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 3, "chunk",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 4, "upper",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 5, "l_depth",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 6, "l_chunk",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 7, "r_depth",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 8, "r_chunk",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 9, "ntuples",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber)10, "usage",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber)11, "min_ctid",
+ TIDOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber)12, "max_ctid",
+ TIDOID, -1, 0);
+ fncxt->tuple_desc = BlessTupleDesc(tupdesc);
+
+ /* make a snapshot of the current table cache */
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ for (i=0; i < ccache_hash_size; i++)
+ {
+ dlist_foreach(iter, &cs_ccache_hash->slots[i])
+ {
+ ccache_head *ccache
+ = dlist_container(ccache_head, hash_chain, iter.cur);
+
+ ccache->refcnt++;
+ SpinLockRelease(&cs_ccache_hash->lock);
+ track_ccache_locally(ccache);
+
+ LWLockAcquire(ccache->lock, LW_SHARED);
+ result = cache_scan_debuginfo_internal(ccache,
+ ccache->root_chunk,
+ result);
+ LWLockRelease(ccache->lock);
+
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ cs_put_ccache_nolock(ccache);
+ }
+ }
+ SpinLockRelease(&cs_ccache_hash->lock);
+
+ fncxt->user_fctx = result;
+ MemoryContextSwitchTo(oldcxt);
+ }
+ fncxt = SRF_PERCALL_SETUP();
+
+ cstatus_list = (List *)fncxt->user_fctx;
+ if (cstatus_list != NIL &&
+ fncxt->call_cntr < cstatus_list->length)
+ {
+ ccache_status *cstatus = list_nth(cstatus_list, fncxt->call_cntr);
+ Datum values[12];
+ bool isnull[12];
+ HeapTuple tuple;
+
+ memset(isnull, false, sizeof(isnull));
+ values[0] = ObjectIdGetDatum(cstatus->tableoid);
+ if (cstatus->status == CCACHE_STATUS_INITIALIZED)
+ values[1] = CStringGetTextDatum("initialized");
+ else if (cstatus->status == CCACHE_STATUS_IN_PROGRESS)
+ values[1] = CStringGetTextDatum("in-progress");
+ else if (cstatus->status == CCACHE_STATUS_CONSTRUCTED)
+ values[1] = CStringGetTextDatum("constructed");
+ else
+ values[1] = CStringGetTextDatum("unknown");
+ values[2] = CStringGetTextDatum(psprintf("%p", cstatus->cchunk));
+ values[3] = CStringGetTextDatum(psprintf("%p", cstatus->upper));
+ values[4] = Int32GetDatum(cstatus->l_depth);
+ values[5] = CStringGetTextDatum(psprintf("%p", cstatus->left));
+ values[6] = Int32GetDatum(cstatus->r_depth);
+ values[7] = CStringGetTextDatum(psprintf("%p", cstatus->right));
+ values[8] = Int32GetDatum(cstatus->ntups);
+ values[9] = Int32GetDatum(cstatus->usage);
+
+ if (ItemPointerIsValid(&cstatus->min_ctid))
+ values[10] = PointerGetDatum(&cstatus->min_ctid);
+ else
+ isnull[10] = true;
+ if (ItemPointerIsValid(&cstatus->max_ctid))
+ values[11] = PointerGetDatum(&cstatus->max_ctid);
+ else
+ isnull[11] = true;
+
+ tuple = heap_form_tuple(fncxt->tuple_desc, values, isnull);
+
+ SRF_RETURN_NEXT(fncxt, HeapTupleGetDatum(tuple));
+ }
+ SRF_RETURN_DONE(fncxt);
+}
+PG_FUNCTION_INFO_V1(cache_scan_debuginfo);
+
+/*
+ * cs_alloc_shmblock
+ *
+ * It allocates a fixed-length block. The reason why this routine does not
+ * support variable length allocation is to simplify the logic for its purpose.
+ */
+static void *
+cs_alloc_shmblock(void)
+{
+ ccache_head *ccache;
+ dlist_node *dnode;
+ shmseg_block *block;
+ void *address = NULL;
+ int retry = 2;
+
+do_retry:
+ SpinLockAcquire(&cs_shmseg_head->lock);
+ if (dlist_is_empty(&cs_shmseg_head->free_list) && retry-- > 0)
+ {
+ SpinLockRelease(&cs_shmseg_head->lock);
+
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ if (!dlist_is_empty(&cs_ccache_hash->lru_list))
+ {
+ dnode = dlist_tail_node(&cs_ccache_hash->lru_list);
+ ccache = dlist_container(ccache_head, lru_chain, dnode);
+
+ cs_put_ccache_nolock(ccache);
+ }
+ SpinLockRelease(&cs_ccache_hash->lock);
+
+ goto do_retry;
+ }
+
+ if (!dlist_is_empty(&cs_shmseg_head->free_list))
+ {
+ dnode = dlist_pop_head_node(&cs_shmseg_head->free_list);
+ block = dlist_container(shmseg_block, chain, dnode);
+
+ memset(&block->chain, 0, sizeof(dlist_node));
+
+ address = (void *) block->address;
+ }
+ SpinLockRelease(&cs_shmseg_head->lock);
+
+ return address;
+}
+
+/*
+ * cs_free_shmblock
+ *
+ * It release a block being allocated by cs_alloc_shmblock
+ */
+static void
+cs_free_shmblock(void *address)
+{
+ Size curr = (Size) address;
+ Size base = cs_shmseg_head->base_address;
+ ulong index;
+ shmseg_block *block;
+
+ Assert((curr - base) % shmseg_blocksize == 0);
+ Assert(curr >= base && curr < base + shmseg_num_blocks * shmseg_blocksize);
+ index = (curr - base) / shmseg_blocksize;
+
+ SpinLockAcquire(&cs_shmseg_head->lock);
+ block = &cs_shmseg_head->blocks[index];
+
+ dlist_push_head(&cs_shmseg_head->free_list, &block->chain);
+
+ SpinLockRelease(&cs_shmseg_head->lock);
+}
+
+static void
+ccache_setup(void)
+{
+ Size curr_address;
+ ulong i;
+ bool found;
+
+ /* allocation of a shared memory segment for table's hash */
+ cs_ccache_hash = ShmemInitStruct("cache_scan: hash of columnar cache",
+ MAXALIGN(sizeof(ccache_hash)) +
+ MAXALIGN(sizeof(LWLockId) *
+ ccache_hash_size) +
+ MAXALIGN(sizeof(dlist_node) *
+ ccache_hash_size),
+ &found);
+ Assert(!found);
+
+ SpinLockInit(&cs_ccache_hash->lock);
+ dlist_init(&cs_ccache_hash->lru_list);
+ dlist_init(&cs_ccache_hash->free_list);
+ cs_ccache_hash->lwlocks = (void *)(&cs_ccache_hash[1]);
+ cs_ccache_hash->slots
+ = (void *)(&cs_ccache_hash->lwlocks[ccache_hash_size]);
+
+ for (i=0; i < ccache_hash_size; i++)
+ cs_ccache_hash->lwlocks[i] = LWLockAssign();
+ for (i=0; i < ccache_hash_size; i++)
+ dlist_init(&cs_ccache_hash->slots[i]);
+
+ /* allocation of a shared memory segment for columnar cache */
+ cs_shmseg_head = ShmemInitStruct("cache_scan: columnar cache",
+ offsetof(shmseg_head,
+ blocks[shmseg_num_blocks]) +
+ shmseg_num_blocks * shmseg_blocksize,
+ &found);
+ Assert(!found);
+
+ SpinLockInit(&cs_shmseg_head->lock);
+ dlist_init(&cs_shmseg_head->free_list);
+
+ curr_address = MAXALIGN(&cs_shmseg_head->blocks[shmseg_num_blocks]);
+
+ cs_shmseg_head->base_address = curr_address;
+ for (i=0; i < shmseg_num_blocks; i++)
+ {
+ shmseg_block *block = &cs_shmseg_head->blocks[i];
+
+ block->address = curr_address;
+ dlist_push_tail(&cs_shmseg_head->free_list, &block->chain);
+
+ curr_address += shmseg_blocksize;
+ }
+}
+
+void
+ccache_init(void)
+{
+ /* setup GUC variables */
+ DefineCustomIntVariable("cache_scan.block_size",
+ "block size of in-memory columnar cache",
+ NULL,
+ &shmseg_blocksize,
+ 2048 * 1024, /* 2MB */
+ 1024 * 1024, /* 1MB */
+ INT_MAX,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+ if ((shmseg_blocksize & (shmseg_blocksize - 1)) != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cache_scan.block_size must be power of 2")));
+
+ DefineCustomIntVariable("cache_scan.num_blocks",
+ "number of in-memory columnar cache blocks",
+ NULL,
+ &shmseg_num_blocks,
+ 64,
+ 64,
+ INT_MAX,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ DefineCustomIntVariable("cache_scan.hash_size",
+ "number of hash slots for columnar cache",
+ NULL,
+ &ccache_hash_size,
+ 128,
+ 128,
+ INT_MAX,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ DefineCustomIntVariable("cache_scan.max_cached_attnum",
+ "max attribute number we can cache",
+ NULL,
+ &max_cached_attnum,
+ 256,
+ sizeof(bitmapword) * BITS_PER_BYTE,
+ 2048,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ /* request shared memory segment for table's cache */
+ RequestAddinShmemSpace(MAXALIGN(sizeof(ccache_hash)) +
+ MAXALIGN(sizeof(dlist_head) * ccache_hash_size) +
+ MAXALIGN(sizeof(LWLockId) * ccache_hash_size) +
+ MAXALIGN(offsetof(shmseg_head,
+ blocks[shmseg_num_blocks])) +
+ shmseg_num_blocks * shmseg_blocksize);
+ RequestAddinLWLocks(ccache_hash_size);
+
+ shmem_startup_next = shmem_startup_hook;
+ shmem_startup_hook = ccache_setup;
+
+ /* register resource-release callback */
+ dlist_init(&ccache_local_list);
+ dlist_init(&ccache_free_list);
+ RegisterResourceReleaseCallback(ccache_on_resource_release, NULL);
+}
diff --git a/contrib/cache_scan/cscan.c b/contrib/cache_scan/cscan.c
new file mode 100644
index 0000000..6da6b4a
--- /dev/null
+++ b/contrib/cache_scan/cscan.c
@@ -0,0 +1,668 @@
+/* -------------------------------------------------------------------------
+ *
+ * contrib/cache_scan/cscan.c
+ *
+ * An extension that offers an alternative way to scan a table utilizing column
+ * oriented database cache.
+ *
+ * Copyright (c) 2010-2013, PostgreSQL Global Development Group
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "access/heapam.h"
+#include "access/relscan.h"
+#include "access/sysattr.h"
+#include "catalog/objectaccess.h"
+#include "catalog/pg_language.h"
+#include "catalog/pg_proc.h"
+#include "catalog/pg_trigger.h"
+#include "commands/trigger.h"
+#include "executor/nodeCustom.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/var.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/guc.h"
+#include "utils/spccache.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+#include "cache_scan.h"
+#include <limits.h>
+
+PG_MODULE_MAGIC;
+
+/* Static variables */
+static add_scan_path_hook_type add_scan_path_next = NULL;
+static object_access_hook_type object_access_next = NULL;
+static heap_page_prune_hook_type heap_page_prune_next = NULL;
+
+static bool cache_scan_disabled;
+
+static bool
+cs_estimate_costs(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Relation rel,
+ CustomPath *cpath,
+ Bitmapset **attrs_used)
+{
+ ListCell *lc;
+ ccache_head *ccache;
+ Oid tableoid = RelationGetRelid(rel);
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ int total_width = 0;
+ int tuple_width = 0;
+ double hit_ratio;
+ Cost run_cost = 0.0;
+ Cost startup_cost = 0.0;
+ double tablespace_page_cost;
+ QualCost qpqual_cost;
+ Cost cpu_per_tuple;
+ int i;
+
+ /* Mark the path with the correct row estimate */
+ if (cpath->path.param_info)
+ cpath->path.rows = cpath->path.param_info->ppi_rows;
+ else
+ cpath->path.rows = baserel->rows;
+
+ /* List up all the columns being in-use */
+ pull_varattnos((Node *) baserel->reltargetlist,
+ baserel->relid,
+ attrs_used);
+ foreach(lc, baserel->baserestrictinfo)
+ {
+ RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+ pull_varattnos((Node *) rinfo->clause,
+ baserel->relid,
+ attrs_used);
+ }
+
+ for (i=FirstLowInvalidHeapAttributeNumber + 1; i <= 0; i++)
+ {
+ int attidx = i - FirstLowInvalidHeapAttributeNumber;
+
+ if (bms_is_member(attidx, *attrs_used))
+ {
+ /* oid and whole-row reference is not supported */
+ if (i == ObjectIdAttributeNumber || i == InvalidAttrNumber)
+ return false;
+
+ /* clear system attributes from the bitmap */
+ *attrs_used = bms_del_member(*attrs_used, attidx);
+ }
+ }
+
+ /*
+ * Because of layout on the shared memory segment, we have to restrict
+ * the largest attribute number in use to prevent overrun by growth of
+ * Bitmapset.
+ */
+ if (*attrs_used &&
+ (*attrs_used)->nwords > ccache_max_attribute_number())
+ return false;
+
+ /*
+ * Estimation of average width of cached tuples - it does not make
+ * sense to construct a new cache if its average width is more than
+ * 30% of the raw data.
+ */
+ for (i=0; i < tupdesc->natts; i++)
+ {
+ Form_pg_attribute attr = tupdesc->attrs[i];
+ int attidx = i + 1 - FirstLowInvalidHeapAttributeNumber;
+ int width;
+
+ if (attr->attlen > 0)
+ width = attr->attlen;
+ else
+ width = get_attavgwidth(tableoid, attr->attnum);
+
+ total_width += width;
+ if (bms_is_member(attidx, *attrs_used))
+ tuple_width += width;
+ }
+
+ ccache = cs_get_ccache(RelationGetRelid(rel), *attrs_used, false);
+ if (!ccache)
+ {
+ if ((double)tuple_width / (double)total_width > 0.3)
+ return false;
+ hit_ratio = 0.05;
+ }
+ else
+ {
+ hit_ratio = 0.95;
+ cs_put_ccache(ccache);
+ }
+
+ get_tablespace_page_costs(baserel->reltablespace,
+ NULL,
+ &tablespace_page_cost);
+ /* Disk costs */
+ run_cost += (1.0 - hit_ratio) * tablespace_page_cost * baserel->pages;
+
+ /* CPU costs */
+ get_restriction_qual_cost(root, baserel,
+ cpath->path.param_info,
+ &qpqual_cost);
+
+ startup_cost += qpqual_cost.startup;
+ cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+ run_cost += cpu_per_tuple * baserel->tuples;
+
+ cpath->path.startup_cost = startup_cost;
+ cpath->path.total_cost = startup_cost + run_cost;
+
+ return true;
+}
+
+static bool
+cs_relation_has_invalidator(Relation rel)
+{
+ int i, numtriggers;
+ bool has_on_insert_invalidator = false;
+ bool has_on_update_invalidator = false;
+ bool has_on_delete_invalidator = false;
+
+ /*
+ * Cacheable tables must have invalidation trigger on UPDATE and DELETE.
+ */
+ if (!rel->trigdesc)
+ return false;
+
+ numtriggers = rel->trigdesc->numtriggers;
+ for (i=0; i < numtriggers; i++)
+ {
+ Trigger *trig = rel->trigdesc->triggers + i;
+ HeapTuple tup;
+
+ if (!trig->tgenabled)
+ continue;
+
+ tup = SearchSysCache1(PROCOID, ObjectIdGetDatum(trig->tgfoid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for function %u", trig->tgfoid);
+
+ if (((Form_pg_proc) GETSTRUCT(tup))->prolang == ClanguageId)
+ {
+ Datum value;
+ bool isnull;
+ char *prosrc;
+ char *probin;
+
+ value = SysCacheGetAttr(PROCOID, tup,
+ Anum_pg_proc_prosrc, &isnull);
+ if (isnull)
+ elog(ERROR, "null prosrc for C function %u", trig->tgoid);
+ prosrc = TextDatumGetCString(value);
+
+ value = SysCacheGetAttr(PROCOID, tup,
+ Anum_pg_proc_probin, &isnull);
+ if (isnull)
+ elog(ERROR, "null probin for C function %u", trig->tgoid);
+ probin = TextDatumGetCString(value);
+
+ if (strcmp(prosrc, "cache_scan_invalidation_trigger") == 0 &&
+ strcmp(probin, "$libdir/cache_scan") == 0 &&
+ (trig->tgtype & (TRIGGER_TYPE_ROW | TRIGGER_TYPE_AFTER))
+ == (TRIGGER_TYPE_ROW | TRIGGER_TYPE_AFTER))
+ {
+ if ((trig->tgtype & TRIGGER_TYPE_INSERT) != 0)
+ has_on_insert_invalidator = true;
+ if ((trig->tgtype & TRIGGER_TYPE_UPDATE) != 0)
+ has_on_update_invalidator = true;
+ if ((trig->tgtype & TRIGGER_TYPE_DELETE) != 0)
+ has_on_delete_invalidator = true;
+ }
+ pfree(prosrc);
+ pfree(probin);
+ }
+ ReleaseSysCache(tup);
+ }
+ if (has_on_insert_invalidator &&
+ has_on_update_invalidator &&
+ has_on_delete_invalidator)
+ return true;
+ return false;
+}
+
+
+static void
+cs_add_scan_path(PlannerInfo *root,
+ RelOptInfo *baserel,
+ RangeTblEntry *rte)
+{
+ Relation rel;
+
+ /* call the secondary hook if exist */
+ if (add_scan_path_next)
+ (*add_scan_path_next)(root, baserel, rte);
+
+ /* Is this feature available now? */
+ if (cache_scan_disabled)
+ return;
+
+ /* Only regular tables can be cached */
+ if (baserel->reloptkind != RELOPT_BASEREL ||
+ rte->rtekind != RTE_RELATION)
+ return;
+
+ /* Core code should already acquire an appropriate lock */
+ rel = heap_open(rte->relid, NoLock);
+
+ if (cs_relation_has_invalidator(rel))
+ {
+ CustomPath *cpath = makeNode(CustomPath);
+ Relids required_outer;
+ Bitmapset *attrs_used = NULL;
+
+ /*
+ * We don't support pushing join clauses into the quals of a ctidscan,
+ * but it could still have required parameterization due to LATERAL
+ * refs in its tlist.
+ */
+ required_outer = baserel->lateral_relids;
+
+ cpath->path.pathtype = T_CustomScan;
+ cpath->path.parent = baserel;
+ cpath->path.param_info = get_baserel_parampathinfo(root, baserel,
+ required_outer);
+ if (cs_estimate_costs(root, baserel, rel, cpath, &attrs_used))
+ {
+ cpath->custom_name = pstrdup("cache scan");
+ cpath->custom_flags = 0;
+ cpath->custom_private
+ = list_make1(makeString(bms_to_string(attrs_used)));
+
+ add_path(baserel, &cpath->path);
+ }
+ }
+ heap_close(rel, NoLock);
+}
+
+static void
+cs_init_custom_scan_plan(PlannerInfo *root,
+ CustomScan *cscan_plan,
+ CustomPath *cscan_path,
+ List *tlist,
+ List *scan_clauses)
+{
+ List *quals = NIL;
+ ListCell *lc;
+
+ /* should be a base relation */
+ Assert(cscan_path->path.parent->relid > 0);
+ Assert(cscan_path->path.parent->rtekind == RTE_RELATION);
+
+ /* extract the supplied RestrictInfo */
+ foreach (lc, scan_clauses)
+ {
+ RestrictInfo *rinfo = lfirst(lc);
+ quals = lappend(quals, rinfo->clause);
+ }
+
+ /* do nothing something special pushing-down */
+ cscan_plan->scan.plan.targetlist = tlist;
+ cscan_plan->scan.plan.qual = quals;
+ cscan_plan->custom_private = cscan_path->custom_private;
+}
+
+typedef struct
+{
+ ccache_head *ccache;
+ ItemPointerData curr_ctid;
+ bool normal_seqscan;
+ bool with_construction;
+} cs_state;
+
+static void
+cs_begin_custom_scan(CustomScanState *node, int eflags)
+{
+ CustomScan *cscan = (CustomScan *)node->ss.ps.plan;
+ Relation rel = node->ss.ss_currentRelation;
+ EState *estate = node->ss.ps.state;
+ HeapScanDesc scandesc = NULL;
+ cs_state *csstate;
+ Bitmapset *attrs_used;
+ ccache_head *ccache;
+
+ csstate = palloc0(sizeof(cs_state));
+
+ attrs_used = bms_from_string(strVal(linitial(cscan->custom_private)));
+
+ ccache = cs_get_ccache(RelationGetRelid(rel), attrs_used, true);
+ if (ccache)
+ {
+ LWLockAcquire(ccache->lock, LW_SHARED);
+ if (ccache->status != CCACHE_STATUS_CONSTRUCTED)
+ {
+ LWLockRelease(ccache->lock);
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (ccache->status == CCACHE_STATUS_INITIALIZED)
+ {
+ ccache->status = CCACHE_STATUS_IN_PROGRESS;
+ csstate->with_construction = true;
+ scandesc = heap_beginscan(rel, SnapshotAny, 0, NULL);
+ }
+ else if (ccache->status == CCACHE_STATUS_IN_PROGRESS)
+ {
+ csstate->normal_seqscan = true;
+ scandesc = heap_beginscan(rel, estate->es_snapshot, 0, NULL);
+ }
+ }
+ LWLockRelease(ccache->lock);
+ csstate->ccache = ccache;
+
+ /* seek to the first position */
+ if (estate->es_direction == ForwardScanDirection)
+ {
+ ItemPointerSetBlockNumber(&csstate->curr_ctid, 0);
+ ItemPointerSetOffsetNumber(&csstate->curr_ctid, 0);
+ }
+ else
+ {
+ ItemPointerSetBlockNumber(&csstate->curr_ctid, MaxBlockNumber);
+ ItemPointerSetOffsetNumber(&csstate->curr_ctid, MaxOffsetNumber);
+ }
+ }
+ else
+ {
+ scandesc = heap_beginscan(rel, estate->es_snapshot, 0, NULL);
+ csstate->normal_seqscan = true;
+ }
+ node->ss.ss_currentScanDesc = scandesc;
+
+ node->custom_state = csstate;
+}
+
+static bool
+cache_scan_needs_next(HeapTuple tuple, Snapshot snapshot, Buffer buffer)
+{
+ bool visibility;
+
+ /* end of the scan */
+ if (!HeapTupleIsValid(tuple))
+ return false;
+
+ if (buffer != InvalidBuffer)
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ visibility = HeapTupleSatisfiesVisibility(tuple, snapshot, buffer);
+
+ if (buffer != InvalidBuffer)
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ return !visibility ? true : false;
+}
+
+static TupleTableSlot *
+cache_scan_next(CustomScanState *node)
+{
+ cs_state *csstate = node->custom_state;
+ Relation rel = node->ss.ss_currentRelation;
+ HeapScanDesc scan = node->ss.ss_currentScanDesc;
+ TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
+ EState *estate = node->ss.ps.state;
+ Snapshot snapshot = estate->es_snapshot;
+ HeapTuple tuple;
+ Buffer buffer;
+
+ if (csstate->normal_seqscan)
+ {
+ tuple = heap_getnext(scan, estate->es_direction);
+ if (HeapTupleIsValid(tuple))
+ ExecStoreTuple(tuple, slot, scan->rs_cbuf, false);
+ else
+ ExecClearTuple(slot);
+ return slot;
+ }
+
+ do {
+ if (csstate->ccache)
+ {
+ ccache_head *ccache = csstate->ccache;
+
+ if (csstate->with_construction)
+ {
+ tuple = heap_getnext(scan, estate->es_direction);
+
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (HeapTupleIsValid(tuple))
+ {
+ if (ccache_insert_tuple(ccache, rel, tuple))
+ LWLockRelease(ccache->lock);
+ else
+ {
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+ cs_put_ccache(ccache);
+ csstate->ccache = NULL;
+ }
+ }
+ else
+ {
+ ccache->status = CCACHE_STATUS_CONSTRUCTED;
+ LWLockRelease(ccache->lock);
+ }
+ buffer = scan->rs_cbuf;
+ }
+ else
+ {
+ LWLockAcquire(ccache->lock, LW_SHARED);
+ tuple = ccache_find_tuple(ccache->root_chunk,
+ &csstate->curr_ctid,
+ estate->es_direction);
+ if (HeapTupleIsValid(tuple))
+ {
+ ItemPointerCopy(&tuple->t_self, &csstate->curr_ctid);
+ tuple = heap_copytuple(tuple);
+ }
+ LWLockRelease(ccache->lock);
+ buffer = InvalidBuffer;
+ }
+ }
+ else
+ {
+ Assert(scan != NULL);
+ tuple = heap_getnext(scan, estate->es_direction);
+ buffer = scan->rs_cbuf;
+ }
+ } while (cache_scan_needs_next(tuple, snapshot, buffer));
+
+ if (HeapTupleIsValid(tuple))
+ ExecStoreTuple(tuple, slot, buffer, buffer == InvalidBuffer);
+ else
+ ExecClearTuple(slot);
+
+ return slot;
+}
+
+static bool
+cache_scan_recheck(CustomScanState *node, TupleTableSlot *slot)
+{
+ return true;
+}
+
+static TupleTableSlot *
+cs_exec_custom_scan(CustomScanState *node)
+{
+ return ExecScan((ScanState *) node,
+ (ExecScanAccessMtd) cache_scan_next,
+ (ExecScanRecheckMtd) cache_scan_recheck);
+}
+
+static void
+cs_end_custom_scan(CustomScanState *node)
+{
+ cs_state *csstate = node->custom_state;
+
+ if (csstate->ccache)
+ cs_put_ccache(csstate->ccache);
+ if (node->ss.ss_currentScanDesc)
+ heap_endscan(node->ss.ss_currentScanDesc);
+}
+
+static void
+cs_rescan_custom_scan(CustomScanState *node)
+{
+ elog(ERROR, "not implemented yet");
+}
+
+Datum
+cache_scan_invalidation_trigger(PG_FUNCTION_ARGS)
+{
+ TriggerData *trigdata = (TriggerData *) fcinfo->context;
+ TriggerEvent tg_event = trigdata->tg_event;
+ Relation rel = trigdata->tg_relation;
+ HeapTuple tuple = trigdata->tg_trigtuple;
+ HeapTuple newtup = trigdata->tg_newtuple;
+ HeapTuple result = NULL;
+ const char *tg_name = trigdata->tg_trigger->tgname;
+ ccache_head *ccache;
+
+ if (!CALLED_AS_TRIGGER(fcinfo))
+ elog(ERROR, "%s: not fired by trigger manager", tg_name);
+
+ if (!TRIGGER_FIRED_AFTER(tg_event) ||
+ !TRIGGER_FIRED_FOR_ROW(tg_event))
+ elog(ERROR, "%s: not fired by AFTER FOR EACH ROW event", tg_name);
+
+ ccache = cs_get_ccache(RelationGetRelid(rel), NULL, false);
+ if (!ccache)
+ return PointerGetDatum(newtup);
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+
+ PG_TRY();
+ {
+ if (TRIGGER_FIRED_BY_INSERT(trigdata->tg_event))
+ {
+ ccache_insert_tuple(ccache, rel, tuple);
+ result = tuple;
+ }
+ else if (TRIGGER_FIRED_BY_UPDATE(trigdata->tg_event))
+ {
+ ccache_insert_tuple(ccache, rel, newtup);
+ ccache_delete_tuple(ccache, tuple);
+ result = newtup;
+ }
+ else if (TRIGGER_FIRED_BY_DELETE(trigdata->tg_event))
+ {
+ ccache_delete_tuple(ccache, tuple);
+ result = tuple;
+ }
+ else
+ elog(ERROR, "%s: fired by unsupported event", tg_name);
+ }
+ PG_CATCH();
+ {
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+
+ PG_RETURN_POINTER(result);
+}
+PG_FUNCTION_INFO_V1(cache_scan_invalidation_trigger);
+
+static void
+ccache_on_object_access(ObjectAccessType access,
+ Oid classId,
+ Oid objectId,
+ int subId,
+ void *arg)
+{
+ ccache_head *ccache;
+
+ /* ALTER TABLE and DROP TABLE needs cache invalidation */
+ if (access != OAT_DROP && access != OAT_POST_ALTER)
+ return;
+ if (classId != RelationRelationId)
+ return;
+
+ ccache = cs_get_ccache(objectId, NULL, false);
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
+ cs_put_ccache(ccache);
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+}
+
+static void
+ccache_on_page_prune(Relation relation,
+ Buffer buffer,
+ int ndeleted,
+ TransactionId OldestXmin,
+ TransactionId latestRemovedXid)
+{
+ ccache_head *ccache;
+
+ /* call the secondary hook */
+ if (heap_page_prune_next)
+ (*heap_page_prune_next)(relation, buffer, ndeleted,
+ OldestXmin, latestRemovedXid);
+
+ ccache = cs_get_ccache(RelationGetRelid(relation), NULL, false);
+ if (ccache)
+ {
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+
+ ccache_vacuum_page(ccache, buffer);
+
+ LWLockRelease(ccache->lock);
+
+ cs_put_ccache(ccache);
+ }
+}
+
+void
+_PG_init(void)
+{
+ CustomProvider provider;
+
+ if (IsUnderPostmaster)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cache_scan must be loaded via shared_preload_libraries")));
+
+ DefineCustomBoolVariable("cache_scan.disabled",
+ "turn on/off cache_scan feature on run-time",
+ NULL,
+ &cache_scan_disabled,
+ false,
+ PGC_USERSET,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ /* initialization of cache subsystem */
+ ccache_init();
+
+ /* callbacks for cache invalidation */
+ object_access_next = object_access_hook;
+ object_access_hook = ccache_on_object_access;
+
+ heap_page_prune_next = heap_page_prune_hook;
+ heap_page_prune_hook = ccache_on_page_prune;
+
+ /* registration of custom scan provider */
+ add_scan_path_next = add_scan_path_hook;
+ add_scan_path_hook = cs_add_scan_path;
+
+ memset(&provider, 0, sizeof(provider));
+ strncpy(provider.name, "cache scan", sizeof(provider.name));
+ provider.InitCustomScanPlan = cs_init_custom_scan_plan;
+ provider.BeginCustomScan = cs_begin_custom_scan;
+ provider.ExecCustomScan = cs_exec_custom_scan;
+ provider.EndCustomScan = cs_end_custom_scan;
+ provider.ReScanCustomScan = cs_rescan_custom_scan;
+
+ register_custom_provider(&provider);
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 27cbac8..1fb5f4a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -42,6 +42,9 @@ typedef struct
bool marked[MaxHeapTuplesPerPage + 1];
} PruneState;
+/* Callback for each page pruning */
+heap_page_prune_hook_type heap_page_prune_hook = NULL;
+
/* Local functions */
static int heap_prune_chain(Relation relation, Buffer buffer,
OffsetNumber rootoffnum,
@@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
* and update FSM with the remaining space.
*/
+ /*
+ * This callback allows extensions to synchronize their own status with
+ * heap image on the disk, when this buffer page is vacuumed.
+ */
+ if (heap_page_prune_hook)
+ (*heap_page_prune_hook)(relation,
+ buffer,
+ ndeleted,
+ OldestXmin,
+ prstate.latestRemovedXid);
return ndeleted;
}
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f626755..023f78e 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
*
* The caller should pass xid as the XID of the transaction to check, or
* InvalidTransactionId if no check is needed.
+ *
+ * In case when the supplied HeapTuple is not associated with a particular
+ * buffer, it just returns without any jobs. It may happen when an extension
+ * caches tuple with their own way.
*/
static inline void
SetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid)
{
+ if (BufferIsInvalid(buffer))
+ return;
+
if (TransactionIdIsValid(xid))
{
/* NB: xid must be known committed here! */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bfdadc3..9775aad 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -164,6 +164,13 @@ extern void heap_restrpos(HeapScanDesc scan);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
+typedef void (*heap_page_prune_hook_type)(Relation relation,
+ Buffer buffer,
+ int ndeleted,
+ TransactionId OldestXmin,
+ TransactionId latestRemovedXid);
+extern heap_page_prune_hook_type heap_page_prune_hook;
+
extern void heap_page_prune_opt(Relation relation, Buffer buffer,
TransactionId OldestXmin);
extern int heap_page_prune(Relation relation, Buffer buffer,
Hello,
I revisited the patch for contrib/cache_scan extension.
The previous one had a problem when T-tree node shall be rebalanced
then crashed on merging the node.
Even though contrib/cache_scan portion has more than 2KL code,
things I'd like to have a discussion first is a portion of the
core enhancements to run MVCCsnapshot on the cached tuple, and
to get callback on vacuumed pages for cache synchronization.
Any comments please.
Thanks,
(2014/01/15 0:06), Kohei KaiGai wrote:
Hello,
The attached patch is what we discussed just before the commit-fest:Nov.
It implements an alternative way to scan a particular table using on-memory
cache instead of the usual heap access method. Unlike buffer cache, this
mechanism caches a limited number of columns on the memory, so memory
consumption per tuple is much smaller than the regular heap access method,
thus it allows much larger number of tuples on the memory.I'd like to extend this idea to implement a feature to cache data according to
column-oriented data structure to utilize parallel calculation processors like
CPU's SIMD operations or simple GPU cores. (Probably, it makes sense to
evaluate multiple records with a single vector instruction if contents of
a particular column is put as a large array.)
However, this patch still keeps all the tuples in row-oriented data format,
because row <=> column translation makes this patch bigger than the
current form (about 2KL), and GPU integration needs to link proprietary
library (cuda or opencl) thus I thought it is not preferable for the upstream
code.Also note that this patch needs part-1 ~ part-3 patches of CustomScan
APIs as prerequisites because it is implemented on top of the APIs.One thing I have to apologize is, lack of documentation and source code
comments around the contrib/ code. Please give me a couple of days to
clean-up the code.
Aside from the extension code, I put two enhancement on the core code
as follows. I'd like to have a discussion about adequacy of these enhancement.The first enhancement is a hook on heap_page_prune() to synchronize
internal state of extension with changes of heap image on the disk.
It is not avoidable to hold garbage, increasing time by time, on the cache,
thus needs to clean up as vacuum process doing. The best timing to do
is when dead tuples are reclaimed because it is certain nobody will
reference the tuples any more.diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c index f626755..023f78e 100644 --- a/src/backend/utils/time/tqual.c bool marked[MaxHeapTuplesPerPage + 1]; } PruneState;+/* Callback for each page pruning */ +heap_page_prune_hook_type heap_page_prune_hook = NULL; + /* Local functions */ static int heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, @@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, Transacti onId OldestXmin, * and update FSM with the remaining space. */+ /* + * This callback allows extensions to synchronize their own status with + * heap image on the disk, when this buffer page is vacuumed. + */ + if (heap_page_prune_hook) + (*heap_page_prune_hook)(relation, + buffer, + ndeleted, + OldestXmin, + prstate.latestRemovedXid); return ndeleted; }The second enhancement makes SetHintBits() accepts InvalidBuffer to
ignore all the jobs. We need to check visibility of cached tuples when
custom-scan node scans cached table instead of the heap.
Even though we can use MVCC snapshot to check tuple's visibility,
it may internally set hint bit of tuples thus we always needs to give
a valid buffer pointer to HeapTupleSatisfiesVisibility(). Unfortunately,
it kills all the benefit of table cache if it takes to load the heap buffer
being associated with the cached tuple.
So, I'd like to have a special case handling on the SetHintBits() for
dry-run when InvalidBuffer is given.diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c index f626755..023f78e 100644 --- a/src/backend/utils/time/tqual.c +++ b/src/backend/utils/time/tqual.c @@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot); * * The caller should pass xid as the XID of the transaction to check, or * InvalidTransactionId if no check is needed. + * + * In case when the supplied HeapTuple is not associated with a particular + * buffer, it just returns without any jobs. It may happen when an extension + * caches tuple with their own way. */ static inline void SetHintBits(HeapTupleHeader tuple, Buffer buffer, uint16 infomask, TransactionId xid) { + if (BufferIsInvalid(buffer)) + return; + if (TransactionIdIsValid(xid)) { /* NB: xid must be known committed here! */Thanks,
2013/11/13 Kohei KaiGai <kaigai@kaigai.gr.jp>:
2013/11/12 Tom Lane <tgl@sss.pgh.pa.us>:
Kohei KaiGai <kaigai@kaigai.gr.jp> writes:
So, are you thinking it is a feasible approach to focus on custom-scan
APIs during the upcoming CF3, then table-caching feature as use-case
of this APIs on CF4?Sure. If you work on this extension after CF3, and it reveals that the
custom scan stuff needs some adjustments, there would be time to do that
in CF4. The policy about what can be submitted in CF4 is that we don't
want new major features that no one has seen before, not that you can't
make fixes to previously submitted stuff. Something like a new hook
in vacuum wouldn't be a "major feature", anyway.Thanks for this clarification.
3 days are too short to write a patch, however, 2 month may be sufficient
to develop a feature on top of the scheme being discussed in the previous
comitfest.Best regards,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>
--
OSS Promotion Center / The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>
Attachments:
pgsql-v9.4-custom-scan.part-4.v5.patchtext/plain; charset=Shift_JIS; name=pgsql-v9.4-custom-scan.part-4.v5.patchDownload
contrib/cache_scan/Makefile | 19 +
contrib/cache_scan/cache_scan--1.0.sql | 26 +
contrib/cache_scan/cache_scan--unpackaged--1.0.sql | 3 +
contrib/cache_scan/cache_scan.control | 5 +
contrib/cache_scan/cache_scan.h | 68 +
contrib/cache_scan/ccache.c | 1410 ++++++++++++++++++++
contrib/cache_scan/cscan.c | 761 +++++++++++
doc/src/sgml/cache-scan.sgml | 224 ++++
doc/src/sgml/contrib.sgml | 1 +
doc/src/sgml/custom-scan.sgml | 14 +
doc/src/sgml/filelist.sgml | 1 +
src/backend/access/heap/pruneheap.c | 13 +
src/backend/utils/time/tqual.c | 7 +
src/include/access/heapam.h | 7 +
14 files changed, 2559 insertions(+)
diff --git a/contrib/cache_scan/Makefile b/contrib/cache_scan/Makefile
new file mode 100644
index 0000000..4e68b68
--- /dev/null
+++ b/contrib/cache_scan/Makefile
@@ -0,0 +1,19 @@
+# contrib/dbcache/Makefile
+
+MODULE_big = cache_scan
+OBJS = cscan.o ccache.o
+
+EXTENSION = cache_scan
+DATA = cache_scan--1.0.sql cache_scan--unpackaged--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/cache_scan
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
diff --git a/contrib/cache_scan/cache_scan--1.0.sql b/contrib/cache_scan/cache_scan--1.0.sql
new file mode 100644
index 0000000..4bd04d1
--- /dev/null
+++ b/contrib/cache_scan/cache_scan--1.0.sql
@@ -0,0 +1,26 @@
+CREATE FUNCTION public.cache_scan_synchronizer()
+RETURNS trigger
+AS 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE STRICT;
+
+CREATE TYPE public.__cache_scan_debuginfo AS
+(
+ tableoid oid,
+ status text,
+ chunk text,
+ upper text,
+ l_depth int4,
+ l_chunk text,
+ r_depth int4,
+ r_chunk text,
+ ntuples int4,
+ usage int4,
+ min_ctid tid,
+ max_ctid tid
+);
+CREATE FUNCTION public.cache_scan_debuginfo()
+ RETURNS SETOF public.__cache_scan_debuginfo
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT;
+
+
diff --git a/contrib/cache_scan/cache_scan--unpackaged--1.0.sql b/contrib/cache_scan/cache_scan--unpackaged--1.0.sql
new file mode 100644
index 0000000..718a2de
--- /dev/null
+++ b/contrib/cache_scan/cache_scan--unpackaged--1.0.sql
@@ -0,0 +1,3 @@
+DROP FUNCTION public.cache_scan_synchronizer() CASCADE;
+DROP FUNCTION public.cache_scan_debuginfo() CASCADE;
+DROP TYPE public.__cache_scan_debuginfo;
diff --git a/contrib/cache_scan/cache_scan.control b/contrib/cache_scan/cache_scan.control
new file mode 100644
index 0000000..77946da
--- /dev/null
+++ b/contrib/cache_scan/cache_scan.control
@@ -0,0 +1,5 @@
+# cache_scan extension
+comment = 'custom scan provider for cache-only scan'
+default_version = '1.0'
+module_pathname = '$libdir/cache_scan'
+relocatable = false
diff --git a/contrib/cache_scan/cache_scan.h b/contrib/cache_scan/cache_scan.h
new file mode 100644
index 0000000..d06156e
--- /dev/null
+++ b/contrib/cache_scan/cache_scan.h
@@ -0,0 +1,68 @@
+/* -------------------------------------------------------------------------
+ *
+ * contrib/cache_scan/cache_scan.h
+ *
+ * Definitions for the cache_scan extension
+ *
+ * Copyright (c) 2010-2013, PostgreSQL Global Development Group
+ *
+ * -------------------------------------------------------------------------
+ */
+#ifndef CACHE_SCAN_H
+#define CACHE_SCAN_H
+#include "access/htup_details.h"
+#include "lib/ilist.h"
+#include "nodes/bitmapset.h"
+#include "storage/lwlock.h"
+#include "utils/rel.h"
+
+typedef struct ccache_chunk {
+ struct ccache_chunk *upper; /* link to the upper node */
+ struct ccache_chunk *right; /* link to the greaternode, if exist */
+ struct ccache_chunk *left; /* link to the less node, if exist */
+ int r_depth; /* max depth in right branch */
+ int l_depth; /* max depth in left branch */
+ uint32 ntups; /* number of tuples being cached */
+ uint32 usage; /* usage counter of this chunk */
+ HeapTuple tuples[FLEXIBLE_ARRAY_MEMBER];
+} ccache_chunk;
+
+#define CCACHE_STATUS_INITIALIZED 1
+#define CCACHE_STATUS_IN_PROGRESS 2
+#define CCACHE_STATUS_CONSTRUCTED 3
+
+typedef struct {
+ LWLockId lock; /* used to protect ttree links */
+ volatile int refcnt;
+ int status;
+
+ dlist_node hash_chain; /* linked to ccache_hash->slots[] */
+ dlist_node lru_chain; /* linked to ccache_hash->lru_list */
+
+ Oid tableoid;
+ ccache_chunk *root_chunk;
+ Bitmapset attrs_used; /* !Bitmapset is variable length! */
+} ccache_head;
+
+extern int ccache_max_attribute_number(void);
+extern ccache_head *cs_get_ccache(Oid tableoid, Bitmapset *attrs_used,
+ bool create_on_demand);
+extern void cs_put_ccache(ccache_head *ccache);
+
+extern bool ccache_insert_tuple(ccache_head *ccache,
+ Relation rel, HeapTuple tuple);
+extern bool ccache_delete_tuple(ccache_head *ccache, HeapTuple oldtup);
+
+extern void ccache_vacuum_page(ccache_head *ccache, Buffer buffer);
+
+extern HeapTuple ccache_find_tuple(ccache_chunk *cchunk,
+ ItemPointer ctid,
+ ScanDirection direction);
+extern void ccache_init(void);
+
+extern Datum cache_scan_synchronizer(PG_FUNCTION_ARGS);
+extern Datum cache_scan_debuginfo(PG_FUNCTION_ARGS);
+
+extern void _PG_init(void);
+
+#endif /* CACHE_SCAN_H */
diff --git a/contrib/cache_scan/ccache.c b/contrib/cache_scan/ccache.c
new file mode 100644
index 0000000..0bb9ff4
--- /dev/null
+++ b/contrib/cache_scan/ccache.c
@@ -0,0 +1,1410 @@
+/* -------------------------------------------------------------------------
+ *
+ * contrib/cache_scan/ccache.c
+ *
+ * Routines for columns-culled cache implementation
+ *
+ * Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "access/heapam.h"
+#include "access/sysattr.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "storage/ipc.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+#include "cache_scan.h"
+
+/*
+ * Hash table to manage all the ccache_head
+ */
+typedef struct {
+ slock_t lock; /* lock of the hash table */
+ dlist_head lru_list; /* list of recently used cache */
+ dlist_head free_list; /* list of free ccache_head */
+ volatile int lwlocks_usage;
+ LWLockId *lwlocks;
+ dlist_head *slots;
+} ccache_hash;
+
+/*
+ * Data structure to manage blocks on the shared memory segment.
+ * This extension acquires (shmseg_blocksize) x (shmseg_num_blocks) bytes of
+ * shared memory, then it shall be split into the fixed-length memory blocks.
+ * All the memory allocation and relase are done by block, to avoid memory
+ * fragmentation that eventually makes implementation complicated.
+ *
+ * The shmseg_head has a spinlock and global free_list to link free blocks.
+ * Its blocks[] array contains shmseg_block structures that points a particular
+ * address of the associated memory block.
+ * The shmseg_block being chained in the free_list of shmseg_head are available
+ * to allocate. Elsewhere, this block is already allocated on somewhere.
+ */
+typedef struct {
+ dlist_node chain;
+ Size address;
+} shmseg_block;
+
+typedef struct {
+ slock_t lock;
+ dlist_head free_list;
+ Size base_address;
+ shmseg_block blocks[FLEXIBLE_ARRAY_MEMBER];
+} shmseg_head;
+
+/*
+ * ccache_entry is used to track ccache_head being acquired by this backend.
+ */
+typedef struct {
+ dlist_node chain;
+ ResourceOwner owner;
+ ccache_head *ccache;
+} ccache_entry;
+
+static dlist_head ccache_local_list;
+static dlist_head ccache_free_list;
+
+/* Static variables */
+static shmem_startup_hook_type shmem_startup_next = NULL;
+
+static ccache_hash *cs_ccache_hash = NULL;
+static shmseg_head *cs_shmseg_head = NULL;
+
+/* GUC variables */
+static int ccache_hash_size;
+static int shmseg_blocksize;
+static int shmseg_num_blocks;
+static int max_cached_attnum;
+
+/* Static functions */
+static void *cs_alloc_shmblock(void);
+static void cs_free_shmblock(void *address);
+
+int
+ccache_max_attribute_number(void)
+{
+ return (max_cached_attnum - FirstLowInvalidHeapAttributeNumber +
+ BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+}
+
+/*
+ * ccache_on_resource_release
+ *
+ * It is a callback to put ccache_head being acquired locally, to keep
+ * consistency of reference counter.
+ */
+static void
+ccache_on_resource_release(ResourceReleasePhase phase,
+ bool isCommit,
+ bool isTopLevel,
+ void *arg)
+{
+ dlist_mutable_iter iter;
+
+ if (phase != RESOURCE_RELEASE_AFTER_LOCKS)
+ return;
+
+ dlist_foreach_modify(iter, &ccache_local_list)
+ {
+ ccache_entry *entry
+ = dlist_container(ccache_entry, chain, iter.cur);
+
+ if (entry->owner == CurrentResourceOwner)
+ {
+ dlist_delete(&entry->chain);
+
+ if (isCommit)
+ elog(WARNING, "cache reference leak (tableoid=%u, refcnt=%d)",
+ entry->ccache->tableoid, entry->ccache->refcnt);
+ cs_put_ccache(entry->ccache);
+
+ entry->ccache = NULL;
+ dlist_push_tail(&ccache_free_list, &entry->chain);
+ }
+ }
+}
+
+static ccache_chunk *
+ccache_alloc_chunk(ccache_head *ccache, ccache_chunk *upper)
+{
+ ccache_chunk *cchunk = cs_alloc_shmblock();
+
+ if (cchunk)
+ {
+ cchunk->upper = upper;
+ cchunk->right = NULL;
+ cchunk->left = NULL;
+ cchunk->r_depth = 0;
+ cchunk->l_depth = 0;
+ cchunk->ntups = 0;
+ cchunk->usage = shmseg_blocksize;
+ }
+ return cchunk;
+}
+
+/*
+ * ccache_rebalance_tree
+ *
+ * It keeps the balance of ccache tree if the supplied chunk has
+ * unbalanced subtrees.
+ */
+#define AssertIfNotShmem(addr) \
+ Assert((addr) == NULL || \
+ (((Size)(addr)) >= cs_shmseg_head->base_address && \
+ ((Size)(addr)) < (cs_shmseg_head->base_address + \
+ shmseg_num_blocks * shmseg_blocksize)))
+
+#define TTREE_DEPTH(chunk) \
+ ((chunk) == 0 ? 0 : Max((chunk)->l_depth, (chunk)->r_depth) + 1)
+
+static void
+ccache_rebalance_tree(ccache_head *ccache, ccache_chunk *cchunk)
+{
+ Assert(cchunk->upper != NULL
+ ? (cchunk->upper->left == cchunk || cchunk->upper->right == cchunk)
+ : (ccache->root_chunk == cchunk));
+
+ if (cchunk->l_depth + 1 < cchunk->r_depth)
+ {
+ /* anticlockwise rotation */
+ ccache_chunk *rchunk = cchunk->right;
+ ccache_chunk *upper = cchunk->upper;
+
+ cchunk->right = rchunk->left;
+ cchunk->r_depth = TTREE_DEPTH(cchunk->right);
+ cchunk->upper = rchunk;
+
+ rchunk->left = cchunk;
+ rchunk->l_depth = TTREE_DEPTH(rchunk->left);
+ rchunk->upper = upper;
+
+ if (!upper)
+ ccache->root_chunk = rchunk;
+ else if (upper->left == cchunk)
+ {
+ upper->left = rchunk;
+ upper->l_depth = TTREE_DEPTH(rchunk);
+ }
+ else
+ {
+ upper->right = rchunk;
+ upper->r_depth = TTREE_DEPTH(rchunk);
+ }
+ AssertIfNotShmem(cchunk->right);
+ AssertIfNotShmem(cchunk->left);
+ AssertIfNotShmem(cchunk->upper);
+ AssertIfNotShmem(rchunk->left);
+ AssertIfNotShmem(rchunk->right);
+ AssertIfNotShmem(rchunk->upper);
+ }
+ else if (cchunk->l_depth > cchunk->r_depth + 1)
+ {
+ /* clockwise rotation */
+ ccache_chunk *lchunk = cchunk->left;
+ ccache_chunk *upper = cchunk->upper;
+
+ cchunk->left = lchunk->right;
+ cchunk->l_depth = TTREE_DEPTH(cchunk->left);
+ cchunk->upper = lchunk;
+
+ lchunk->right = cchunk;
+ lchunk->l_depth = TTREE_DEPTH(lchunk->right);
+ lchunk->upper = upper;
+
+ if (!upper)
+ ccache->root_chunk = lchunk;
+ else if (upper->right == cchunk)
+ {
+ upper->right = lchunk;
+ upper->r_depth = TTREE_DEPTH(lchunk) + 1;
+ }
+ else
+ {
+ upper->left = lchunk;
+ upper->l_depth = TTREE_DEPTH(lchunk) + 1;
+ }
+ AssertIfNotShmem(cchunk->right);
+ AssertIfNotShmem(cchunk->left);
+ AssertIfNotShmem(cchunk->upper);
+ AssertIfNotShmem(lchunk->left);
+ AssertIfNotShmem(lchunk->right);
+ AssertIfNotShmem(lchunk->upper);
+ }
+}
+
+/*
+ * ccache_insert_tuple
+ *
+ * It inserts the supplied tuple, but uncached columns are dropped off,
+ * onto the ccache_head. If no space is left, it expands the t-tree
+ * structure with a chunk newly allocated. If no shared memory space was
+ * left, it returns false.
+ */
+#define cchunk_freespace(cchunk) \
+ ((cchunk)->usage - offsetof(ccache_chunk, tuples[(cchunk)->ntups + 1]))
+
+static void
+do_insert_tuple(ccache_head *ccache, ccache_chunk *cchunk, HeapTuple tuple)
+{
+ HeapTuple newtup;
+ ItemPointer ctid = &tuple->t_self;
+ int i_min = 0;
+ int i_max = cchunk->ntups;
+ int i, required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
+
+ Assert(required <= cchunk_freespace(cchunk));
+
+ while (i_min < i_max)
+ {
+ int i_mid = (i_min + i_max) / 2;
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_mid]->t_self) <= 0)
+ i_max = i_mid;
+ else
+ i_min = i_mid + 1;
+ }
+
+ if (i_min < cchunk->ntups)
+ {
+ HeapTuple movtup = cchunk->tuples[i_min];
+ Size movlen = HEAPTUPLESIZE + MAXALIGN(movtup->t_len);
+ char *destaddr = (char *)movtup + movlen - required;
+
+ Assert(ItemPointerCompare(&tuple->t_self, &movtup->t_self) < 0);
+
+ memmove((char *)cchunk + cchunk->usage - required,
+ (char *)cchunk + cchunk->usage,
+ ((Size)movtup + movlen) - ((Size)cchunk + cchunk->usage));
+ for (i=cchunk->ntups; i > i_min; i--)
+ {
+ HeapTuple temp;
+
+ temp = (HeapTuple)((char *)cchunk->tuples[i-1] - required);
+ cchunk->tuples[i] = temp;
+ temp->t_data = (HeapTupleHeader)((char *)temp->t_data - required);
+ }
+ cchunk->tuples[i_min] = newtup = (HeapTuple)destaddr;
+ memcpy(newtup, tuple, HEAPTUPLESIZE);
+ newtup->t_data = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, tuple->t_data, tuple->t_len);
+ cchunk->usage -= required;
+ cchunk->ntups++;
+
+ Assert(cchunk->usage >= offsetof(ccache_chunk, tuples[cchunk->ntups]));
+ }
+ else
+ {
+ cchunk->usage -= required;
+ newtup = (HeapTuple)(((char *)cchunk) + cchunk->usage);
+ memcpy(newtup, tuple, HEAPTUPLESIZE);
+ newtup->t_data = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, tuple->t_data, tuple->t_len);
+
+ cchunk->tuples[i_min] = newtup;
+ cchunk->ntups++;
+
+ Assert(cchunk->usage >= offsetof(ccache_chunk, tuples[cchunk->ntups]));
+ }
+}
+
+static void
+copy_tuple_properties(HeapTuple newtup, HeapTuple oldtup)
+{
+ ItemPointerCopy(&oldtup->t_self, &newtup->t_self);
+ newtup->t_tableOid = oldtup->t_tableOid;
+ memcpy(&newtup->t_data->t_choice.t_heap,
+ &oldtup->t_data->t_choice.t_heap,
+ sizeof(HeapTupleFields));
+ ItemPointerCopy(&oldtup->t_data->t_ctid,
+ &newtup->t_data->t_ctid);
+ newtup->t_data->t_infomask
+ = ((newtup->t_data->t_infomask & ~HEAP_XACT_MASK) |
+ (oldtup->t_data->t_infomask & HEAP_XACT_MASK));
+ newtup->t_data->t_infomask2
+ = ((newtup->t_data->t_infomask2 & ~HEAP2_XACT_MASK) |
+ (oldtup->t_data->t_infomask2 & HEAP2_XACT_MASK));
+}
+
+static bool
+ccache_insert_tuple_internal(ccache_head *ccache,
+ ccache_chunk *cchunk,
+ HeapTuple newtup)
+{
+ ItemPointer ctid = &newtup->t_self;
+ ItemPointer min_ctid;
+ ItemPointer max_ctid;
+ int required = MAXALIGN(HEAPTUPLESIZE + newtup->t_len);
+
+ if (cchunk->ntups == 0)
+ {
+ HeapTuple tup;
+
+ cchunk->usage -= required;
+ cchunk->tuples[0] = tup = (HeapTuple)((char *)cchunk + cchunk->usage);
+ memcpy(tup, newtup, HEAPTUPLESIZE);
+ tup->t_data = (HeapTupleHeader)((char *)tup + HEAPTUPLESIZE);
+ memcpy(tup->t_data, newtup->t_data, newtup->t_len);
+ cchunk->ntups++;
+
+ return true;
+ }
+
+retry:
+ min_ctid = &cchunk->tuples[0]->t_self;
+ max_ctid = &cchunk->tuples[cchunk->ntups - 1]->t_self;
+
+ if (ItemPointerCompare(ctid, min_ctid) < 0)
+ {
+ if (!cchunk->left && required <= cchunk_freespace(cchunk))
+ do_insert_tuple(ccache, cchunk, newtup);
+ else
+ {
+ if (!cchunk->left)
+ {
+ cchunk->left = ccache_alloc_chunk(ccache, cchunk);
+ if (!cchunk->left)
+ return false;
+ cchunk->l_depth = 1;
+ }
+ if (!ccache_insert_tuple_internal(ccache, cchunk->left, newtup))
+ return false;
+ cchunk->l_depth = TTREE_DEPTH(cchunk->left);
+ }
+ }
+ else if (ItemPointerCompare(ctid, max_ctid) > 0)
+ {
+ if (!cchunk->right && required <= cchunk_freespace(cchunk))
+ do_insert_tuple(ccache, cchunk, newtup);
+ else
+ {
+ if (!cchunk->right)
+ {
+ cchunk->right = ccache_alloc_chunk(ccache, cchunk);
+ if (!cchunk->right)
+ return false;
+ cchunk->r_depth = 1;
+ }
+ if (!ccache_insert_tuple_internal(ccache, cchunk->right, newtup))
+ return false;
+ cchunk->r_depth = TTREE_DEPTH(cchunk->right);
+ }
+ }
+ else
+ {
+ if (required <= cchunk_freespace(cchunk))
+ do_insert_tuple(ccache, cchunk, newtup);
+ else
+ {
+ HeapTuple movtup;
+
+ /* push out largest ctid until we get enough space */
+ if (!cchunk->right)
+ {
+ cchunk->right = ccache_alloc_chunk(ccache, cchunk);
+ if (!cchunk->right)
+ return false;
+ cchunk->r_depth = 1;
+ }
+ movtup = cchunk->tuples[cchunk->ntups - 1];
+
+ if (!ccache_insert_tuple_internal(ccache, cchunk->right, movtup))
+ return false;
+
+ cchunk->ntups--;
+ cchunk->usage += MAXALIGN(HEAPTUPLESIZE + movtup->t_len);
+ cchunk->r_depth = TTREE_DEPTH(cchunk->right);
+
+ goto retry;
+ }
+ }
+ /* Rebalance the tree, if needed */
+ ccache_rebalance_tree(ccache, cchunk);
+
+ return true;
+}
+
+bool
+ccache_insert_tuple(ccache_head *ccache, Relation rel, HeapTuple tuple)
+{
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ HeapTuple newtup;
+ Datum *cs_values = alloca(sizeof(Datum) * tupdesc->natts);
+ bool *cs_isnull = alloca(sizeof(bool) * tupdesc->natts);
+ int i, j;
+
+ /* remove unreferenced columns */
+ heap_deform_tuple(tuple, tupdesc, cs_values, cs_isnull);
+ for (i=0; i < tupdesc->natts; i++)
+ {
+ j = i + 1 - FirstLowInvalidHeapAttributeNumber;
+
+ if (!bms_is_member(j, &ccache->attrs_used))
+ cs_isnull[i] = true;
+ }
+ newtup = heap_form_tuple(tupdesc, cs_values, cs_isnull);
+ copy_tuple_properties(newtup, tuple);
+
+ return ccache_insert_tuple_internal(ccache, ccache->root_chunk, newtup);
+}
+
+/*
+ * ccache_find_tuple
+ *
+ * It find a tuple that satisfies the supplied ItemPointer according to
+ * the ScanDirection. If NoMovementScanDirection, it returns a tuple that
+ * has strictly same ItemPointer. On the other hand, it returns a tuple
+ * that has the least ItemPointer greater than the supplied one if
+ * ForwardScanDirection, and also returns a tuple with the greatest
+ * ItemPointer smaller than the supplied one if BackwardScanDirection.
+ */
+HeapTuple
+ccache_find_tuple(ccache_chunk *cchunk, ItemPointer ctid,
+ ScanDirection direction)
+{
+ ItemPointer min_ctid;
+ ItemPointer max_ctid;
+ HeapTuple tuple = NULL;
+ int i_min = 0;
+ int i_max = cchunk->ntups - 1;
+ int rc;
+
+ if (cchunk->ntups == 0)
+ return false;
+
+ min_ctid = &cchunk->tuples[i_min]->t_self;
+ max_ctid = &cchunk->tuples[i_max]->t_self;
+
+ if ((rc = ItemPointerCompare(ctid, min_ctid)) <= 0)
+ {
+ if (rc == 0 && (direction == NoMovementScanDirection ||
+ direction == ForwardScanDirection))
+ {
+ if (cchunk->ntups > direction)
+ return cchunk->tuples[direction];
+ }
+ else
+ {
+ if (cchunk->left)
+ tuple = ccache_find_tuple(cchunk->left, ctid, direction);
+ if (!HeapTupleIsValid(tuple) && direction == ForwardScanDirection)
+ return cchunk->tuples[0];
+ return tuple;
+ }
+ }
+
+ if ((rc = ItemPointerCompare(ctid, max_ctid)) >= 0)
+ {
+ if (rc == 0 && (direction == NoMovementScanDirection ||
+ direction == BackwardScanDirection))
+ {
+ if (i_max + direction >= 0)
+ return cchunk->tuples[i_max + direction];
+ }
+ else
+ {
+ if (cchunk->right)
+ tuple = ccache_find_tuple(cchunk->right, ctid, direction);
+ if (!HeapTupleIsValid(tuple) && direction == BackwardScanDirection)
+ return cchunk->tuples[i_max];
+ return tuple;
+ }
+ }
+
+ while (i_min < i_max)
+ {
+ int i_mid = (i_min + i_max) / 2;
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_mid]->t_self) <= 0)
+ i_max = i_mid;
+ else
+ i_min = i_mid + 1;
+ }
+ Assert(i_min == i_max);
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_min]->t_self) == 0)
+ {
+ if (direction == BackwardScanDirection && i_min > 0)
+ return cchunk->tuples[i_min - 1];
+ else if (direction == NoMovementScanDirection)
+ return cchunk->tuples[i_min];
+ else if (direction == ForwardScanDirection)
+ {
+ Assert(i_min + 1 < cchunk->ntups);
+ return cchunk->tuples[i_min + 1];
+ }
+ }
+ else
+ {
+ if (direction == BackwardScanDirection && i_min > 0)
+ return cchunk->tuples[i_min - 1];
+ else if (direction == ForwardScanDirection)
+ return cchunk->tuples[i_min];
+ }
+ return NULL;
+}
+
+/*
+ * ccache_delete_tuple
+ *
+ * It synchronizes the properties of tuple being already cached, usually
+ * for deletion.
+ */
+bool
+ccache_delete_tuple(ccache_head *ccache, HeapTuple oldtup)
+{
+ HeapTuple tuple;
+
+ tuple = ccache_find_tuple(ccache->root_chunk, &oldtup->t_self,
+ NoMovementScanDirection);
+ if (!tuple)
+ return false;
+
+ copy_tuple_properties(tuple, oldtup);
+
+ return true;
+}
+
+/*
+ * ccache_merge_chunk
+ *
+ * It merges two chunks if these have enough free space to consolidate
+ * its contents into one.
+ */
+static void
+ccache_merge_chunk(ccache_head *ccache, ccache_chunk *cchunk)
+{
+ ccache_chunk *curr;
+ ccache_chunk **upper;
+ int *p_depth;
+ int i;
+ bool needs_rebalance = false;
+
+ /* find the least right node that has no left node */
+ upper = &cchunk->right;
+ p_depth = &cchunk->r_depth;
+ curr = cchunk->right;
+ while (curr != NULL)
+ {
+ if (!curr->left)
+ {
+ Size shift = shmseg_blocksize - curr->usage;
+ long total_usage = cchunk->usage - shift;
+ int total_ntups = cchunk->ntups + curr->ntups;
+
+ if ((long)offsetof(ccache_chunk, tuples[total_ntups]) < total_usage)
+ {
+ ccache_chunk *rchunk = curr->right;
+
+ /* merge contents */
+ for (i=0; i < curr->ntups; i++)
+ {
+ HeapTuple oldtup = curr->tuples[i];
+ HeapTuple newtup;
+
+ cchunk->usage -= HEAPTUPLESIZE + MAXALIGN(oldtup->t_len);
+ newtup = (HeapTuple)((char *)cchunk + cchunk->usage);
+ memcpy(newtup, oldtup, HEAPTUPLESIZE);
+ newtup->t_data
+ = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, oldtup->t_data,
+ MAXALIGN(oldtup->t_len));
+
+ cchunk->tuples[cchunk->ntups++] = newtup;
+ }
+
+ /* detach the current chunk */
+ *upper = curr->right;
+ *p_depth = curr->r_depth;
+ if (rchunk)
+ rchunk->upper = curr->upper;
+
+ /* release it */
+ cs_free_shmblock(curr);
+ needs_rebalance = true;
+ }
+ break;
+ }
+ upper = &curr->left;
+ p_depth = &curr->l_depth;
+ curr = curr->left;
+ }
+
+ /* find the greatest left node that has no right node */
+ upper = &cchunk->left;
+ p_depth = &cchunk->l_depth;
+ curr = cchunk->left;
+
+ while (curr != NULL)
+ {
+ if (!curr->right)
+ {
+ Size shift = shmseg_blocksize - curr->usage;
+ long total_usage = cchunk->usage - shift;
+ int total_ntups = cchunk->ntups + curr->ntups;
+
+ if ((long)offsetof(ccache_chunk, tuples[total_ntups]) < total_usage)
+ {
+ ccache_chunk *lchunk = curr->left;
+ Size offset;
+
+ /* merge contents */
+ memmove((char *)cchunk + cchunk->usage - shift,
+ (char *)cchunk + cchunk->usage,
+ shmseg_blocksize - cchunk->usage);
+ for (i=cchunk->ntups - 1; i >= 0; i--)
+ {
+ HeapTuple temp
+ = (HeapTuple)((char *)cchunk->tuples[i] - shift);
+
+ cchunk->tuples[curr->ntups + i] = temp;
+ temp->t_data = (HeapTupleHeader)((char *)temp +
+ HEAPTUPLESIZE);
+ }
+ cchunk->usage -= shift;
+ cchunk->ntups += curr->ntups;
+
+ /* merge contents */
+ offset = shmseg_blocksize;
+ for (i=0; i < curr->ntups; i++)
+ {
+ HeapTuple oldtup = curr->tuples[i];
+ HeapTuple newtup;
+
+ offset -= HEAPTUPLESIZE + MAXALIGN(oldtup->t_len);
+ newtup = (HeapTuple)((char *)cchunk + offset);
+ memcpy(newtup, oldtup, HEAPTUPLESIZE);
+ newtup->t_data
+ = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, oldtup->t_data,
+ MAXALIGN(oldtup->t_len));
+ cchunk->tuples[i] = newtup;
+ }
+
+ /* detach the current chunk */
+ *upper = curr->left;
+ *p_depth = curr->l_depth;
+ if (lchunk)
+ lchunk->upper = curr->upper;
+ /* release it */
+ cs_free_shmblock(curr);
+ needs_rebalance = true;
+ }
+ break;
+ }
+ upper = &curr->right;
+ p_depth = &curr->r_depth;
+ curr = curr->right;
+ }
+ /* Rebalance the tree, if needed */
+ if (needs_rebalance)
+ ccache_rebalance_tree(ccache, cchunk);
+}
+
+/*
+ * ccache_vacuum_page
+ *
+ * It reclaims the tuples being already vacuumed. It shall be kicked on
+ * the callback function of heap_page_prune_hook to synchronize contents
+ * of the cache with on-disk image.
+ */
+static void
+ccache_vacuum_tuple(ccache_head *ccache,
+ ccache_chunk *cchunk,
+ ItemPointer ctid)
+{
+ ItemPointer min_ctid;
+ ItemPointer max_ctid;
+ int i_min = 0;
+ int i_max = cchunk->ntups;
+
+ if (cchunk->ntups == 0)
+ return;
+
+ min_ctid = &cchunk->tuples[i_min]->t_self;
+ max_ctid = &cchunk->tuples[i_max - 1]->t_self;
+
+ if (ItemPointerCompare(ctid, min_ctid) < 0)
+ {
+ if (cchunk->left)
+ ccache_vacuum_tuple(ccache, cchunk->left, ctid);
+ }
+ else if (ItemPointerCompare(ctid, max_ctid) > 0)
+ {
+ if (cchunk->right)
+ ccache_vacuum_tuple(ccache, cchunk->right, ctid);
+ }
+ else
+ {
+ while (i_min < i_max)
+ {
+ int i_mid = (i_min + i_max) / 2;
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_mid]->t_self) <= 0)
+ i_max = i_mid;
+ else
+ i_min = i_mid + 1;
+ }
+ Assert(i_min == i_max);
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_min]->t_self) == 0)
+ {
+ HeapTuple tuple = cchunk->tuples[i_min];
+ int length = MAXALIGN(HEAPTUPLESIZE + tuple->t_len);
+
+ if (i_min < cchunk->ntups - 1)
+ {
+ int j;
+
+ memmove((char *)cchunk + cchunk->usage + length,
+ (char *)cchunk + cchunk->usage,
+ (Size)tuple - ((Size)cchunk + cchunk->usage));
+ for (j=i_min + 1; j < cchunk->ntups; j++)
+ {
+ HeapTuple temp;
+
+ temp = (HeapTuple)((char *)cchunk->tuples[j] + length);
+ cchunk->tuples[j-1] = temp;
+ temp->t_data
+ = (HeapTupleHeader)((char *)temp->t_data + length);
+ }
+ }
+ cchunk->usage += length;
+ cchunk->ntups--;
+ }
+ }
+ /* merge chunks if this chunk has enough space to merge */
+ ccache_merge_chunk(ccache, cchunk);
+}
+
+void
+ccache_vacuum_page(ccache_head *ccache, Buffer buffer)
+{
+ /* XXX it needs buffer is valid and pinned */
+ BlockNumber blknum = BufferGetBlockNumber(buffer);
+ Page page = BufferGetPage(buffer);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber offnum;
+
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemPointerData ctid;
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsNormal(itemid))
+ continue;
+
+ ItemPointerSetBlockNumber(&ctid, blknum);
+ ItemPointerSetOffsetNumber(&ctid, offnum);
+
+ ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
+ }
+}
+
+static void
+ccache_release_all_chunks(ccache_chunk *cchunk)
+{
+ if (cchunk->left)
+ ccache_release_all_chunks(cchunk->left);
+ if (cchunk->right)
+ ccache_release_all_chunks(cchunk->right);
+ cs_free_shmblock(cchunk);
+}
+
+static void
+track_ccache_locally(ccache_head *ccache)
+{
+ ccache_entry *entry;
+ dlist_node *dnode;
+
+ if (dlist_is_empty(&ccache_free_list))
+ {
+ int i;
+
+ PG_TRY();
+ {
+ for (i=0; i < 20; i++)
+ {
+ entry = MemoryContextAlloc(TopMemoryContext,
+ sizeof(ccache_entry));
+ dlist_push_tail(&ccache_free_list, &entry->chain);
+ }
+ }
+ PG_CATCH();
+ {
+ cs_put_ccache(ccache);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+ dnode = dlist_pop_head_node(&ccache_free_list);
+ entry = dlist_container(ccache_entry, chain, dnode);
+ entry->owner = CurrentResourceOwner;
+ entry->ccache = ccache;
+ dlist_push_tail(&ccache_local_list, &entry->chain);
+}
+
+static void
+untrack_ccache_locally(ccache_head *ccache)
+{
+ dlist_mutable_iter iter;
+
+ dlist_foreach_modify(iter, &ccache_local_list)
+ {
+ ccache_entry *entry
+ = dlist_container(ccache_entry, chain, iter.cur);
+
+ if (entry->ccache == ccache &&
+ entry->owner == CurrentResourceOwner)
+ {
+ dlist_delete(&entry->chain);
+ dlist_push_tail(&ccache_free_list, &entry->chain);
+ return;
+ }
+ }
+}
+
+static void
+cs_put_ccache_nolock(ccache_head *ccache)
+{
+ Assert(ccache->refcnt > 0);
+ if (--ccache->refcnt == 0)
+ {
+ ccache_release_all_chunks(ccache->root_chunk);
+ dlist_delete(&ccache->hash_chain);
+ dlist_delete(&ccache->lru_chain);
+ dlist_push_head(&cs_ccache_hash->free_list, &ccache->hash_chain);
+ }
+ untrack_ccache_locally(ccache);
+}
+
+void
+cs_put_ccache(ccache_head *cache)
+{
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ cs_put_ccache_nolock(cache);
+ SpinLockRelease(&cs_ccache_hash->lock);
+}
+
+static ccache_head *
+cs_create_ccache(Oid tableoid, Bitmapset *attrs_used)
+{
+ ccache_head *temp;
+ ccache_head *new_cache;
+ dlist_node *dnode;
+ int i;
+
+ /*
+ * Here is no columnar cache of this relation or cache attributes are
+ * not enough to run the required query. So, it tries to create a new
+ * ccache_head for the upcoming cache-scan.
+ * Also allocate ones, if we have no free ccache_head any more.
+ */
+ if (dlist_is_empty(&cs_ccache_hash->free_list))
+ {
+ char *buffer;
+ int offset;
+ int nwords, size;
+
+ buffer = cs_alloc_shmblock();
+ if (!buffer)
+ return NULL;
+
+ nwords = (max_cached_attnum - FirstLowInvalidHeapAttributeNumber +
+ BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+ size = MAXALIGN(offsetof(ccache_head,
+ attrs_used.words[nwords + 1]));
+ for (offset = 0; offset <= shmseg_blocksize - size; offset += size)
+ {
+ temp = (ccache_head *)(buffer + offset);
+
+ dlist_push_tail(&cs_ccache_hash->free_list, &temp->hash_chain);
+ }
+ }
+ dnode = dlist_pop_head_node(&cs_ccache_hash->free_list);
+ new_cache = dlist_container(ccache_head, hash_chain, dnode);
+
+ i = cs_ccache_hash->lwlocks_usage++ % ccache_hash_size;
+ new_cache->lock = cs_ccache_hash->lwlocks[i];
+ new_cache->refcnt = 2;
+ new_cache->status = CCACHE_STATUS_INITIALIZED;
+
+ new_cache->tableoid = tableoid;
+ new_cache->root_chunk = ccache_alloc_chunk(new_cache, NULL);
+ if (!new_cache->root_chunk)
+ {
+ dlist_push_head(&cs_ccache_hash->free_list, &new_cache->hash_chain);
+ return NULL;
+ }
+
+ if (attrs_used)
+ memcpy(&new_cache->attrs_used, attrs_used,
+ offsetof(Bitmapset, words[attrs_used->nwords]));
+ else
+ {
+ new_cache->attrs_used.nwords = 1;
+ new_cache->attrs_used.words[0] = 0;
+ }
+ return new_cache;
+}
+
+ccache_head *
+cs_get_ccache(Oid tableoid, Bitmapset *attrs_used, bool create_on_demand)
+{
+ Datum hash = hash_any((unsigned char *)&tableoid, sizeof(Oid));
+ Index i = hash % ccache_hash_size;
+ dlist_iter iter;
+ ccache_head *old_cache = NULL;
+ ccache_head *new_cache = NULL;
+ ccache_head *temp;
+
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ PG_TRY();
+ {
+ /*
+ * Try to find out existing ccache that has all the columns being
+ * referenced in this query.
+ */
+ dlist_foreach(iter, &cs_ccache_hash->slots[i])
+ {
+ temp = dlist_container(ccache_head, hash_chain, iter.cur);
+
+ if (tableoid != temp->tableoid)
+ continue;
+
+ if (bms_is_subset(attrs_used, &temp->attrs_used))
+ {
+ temp->refcnt++;
+ if (create_on_demand)
+ dlist_move_head(&cs_ccache_hash->lru_list,
+ &temp->lru_chain);
+ new_cache = temp;
+ goto out_unlock;
+ }
+ old_cache = temp;
+ break;
+ }
+
+ if (create_on_demand)
+ {
+ if (old_cache)
+ attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
+
+ new_cache = cs_create_ccache(tableoid, attrs_used);
+ if (!new_cache)
+ goto out_unlock;
+
+ dlist_push_head(&cs_ccache_hash->slots[i], &new_cache->hash_chain);
+ dlist_push_head(&cs_ccache_hash->lru_list, &new_cache->lru_chain);
+ if (old_cache)
+ cs_put_ccache_nolock(old_cache);
+ }
+ }
+ PG_CATCH();
+ {
+ SpinLockRelease(&cs_ccache_hash->lock);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+out_unlock:
+ SpinLockRelease(&cs_ccache_hash->lock);
+
+ if (new_cache)
+ track_ccache_locally(new_cache);
+
+ return new_cache;
+}
+
+typedef struct {
+ Oid tableoid;
+ int status;
+ ccache_chunk *cchunk;
+ ccache_chunk *upper;
+ ccache_chunk *right;
+ ccache_chunk *left;
+ int r_depth;
+ int l_depth;
+ uint32 ntups;
+ uint32 usage;
+ ItemPointerData min_ctid;
+ ItemPointerData max_ctid;
+} ccache_status;
+
+static List *
+cache_scan_debuginfo_internal(ccache_head *ccache,
+ ccache_chunk *cchunk, List *result)
+{
+ ccache_status *cstatus = palloc0(sizeof(ccache_status));
+ List *temp;
+
+ if (cchunk->left)
+ {
+ temp = cache_scan_debuginfo_internal(ccache, cchunk->left, NIL);
+ result = list_concat(result, temp);
+ }
+ cstatus->tableoid = ccache->tableoid;
+ cstatus->status = ccache->status;
+ cstatus->cchunk = cchunk;
+ cstatus->upper = cchunk->upper;
+ cstatus->right = cchunk->right;
+ cstatus->left = cchunk->left;
+ cstatus->r_depth = cchunk->r_depth;
+ cstatus->l_depth = cchunk->l_depth;
+ cstatus->ntups = cchunk->ntups;
+ cstatus->usage = cchunk->usage;
+ if (cchunk->ntups > 0)
+ {
+ ItemPointerCopy(&cchunk->tuples[0]->t_self,
+ &cstatus->min_ctid);
+ ItemPointerCopy(&cchunk->tuples[cchunk->ntups - 1]->t_self,
+ &cstatus->max_ctid);
+ }
+ else
+ {
+ ItemPointerSet(&cstatus->min_ctid,
+ InvalidBlockNumber,
+ InvalidOffsetNumber);
+ ItemPointerSet(&cstatus->max_ctid,
+ InvalidBlockNumber,
+ InvalidOffsetNumber);
+ }
+ result = lappend(result, cstatus);
+
+ if (cchunk->right)
+ {
+ temp = cache_scan_debuginfo_internal(ccache, cchunk->right, NIL);
+ result = list_concat(result, temp);
+ }
+ return result;
+}
+
+/*
+ * cache_scan_debuginfo
+ *
+ * It shows the current status of ccache_chunks being allocated.
+ */
+Datum
+cache_scan_debuginfo(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *fncxt;
+ List *cstatus_list;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ TupleDesc tupdesc;
+ MemoryContext oldcxt;
+ int i;
+ dlist_iter iter;
+ List *result = NIL;
+
+ fncxt = SRF_FIRSTCALL_INIT();
+ oldcxt = MemoryContextSwitchTo(fncxt->multi_call_memory_ctx);
+
+ /* make definition of tuple-descriptor */
+ tupdesc = CreateTemplateTupleDesc(12, false);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "tableoid",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 2, "status",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 3, "chunk",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 4, "upper",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 5, "l_depth",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 6, "l_chunk",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 7, "r_depth",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 8, "r_chunk",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 9, "ntuples",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber)10, "usage",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber)11, "min_ctid",
+ TIDOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber)12, "max_ctid",
+ TIDOID, -1, 0);
+ fncxt->tuple_desc = BlessTupleDesc(tupdesc);
+
+ /* make a snapshot of the current table cache */
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ for (i=0; i < ccache_hash_size; i++)
+ {
+ dlist_foreach(iter, &cs_ccache_hash->slots[i])
+ {
+ ccache_head *ccache
+ = dlist_container(ccache_head, hash_chain, iter.cur);
+
+ ccache->refcnt++;
+ SpinLockRelease(&cs_ccache_hash->lock);
+ track_ccache_locally(ccache);
+
+ LWLockAcquire(ccache->lock, LW_SHARED);
+ result = cache_scan_debuginfo_internal(ccache,
+ ccache->root_chunk,
+ result);
+ LWLockRelease(ccache->lock);
+
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ cs_put_ccache_nolock(ccache);
+ }
+ }
+ SpinLockRelease(&cs_ccache_hash->lock);
+
+ fncxt->user_fctx = result;
+ MemoryContextSwitchTo(oldcxt);
+ }
+ fncxt = SRF_PERCALL_SETUP();
+
+ cstatus_list = (List *)fncxt->user_fctx;
+ if (cstatus_list != NIL &&
+ fncxt->call_cntr < cstatus_list->length)
+ {
+ ccache_status *cstatus = list_nth(cstatus_list, fncxt->call_cntr);
+ Datum values[12];
+ bool isnull[12];
+ HeapTuple tuple;
+
+ memset(isnull, false, sizeof(isnull));
+ values[0] = ObjectIdGetDatum(cstatus->tableoid);
+ if (cstatus->status == CCACHE_STATUS_INITIALIZED)
+ values[1] = CStringGetTextDatum("initialized");
+ else if (cstatus->status == CCACHE_STATUS_IN_PROGRESS)
+ values[1] = CStringGetTextDatum("in-progress");
+ else if (cstatus->status == CCACHE_STATUS_CONSTRUCTED)
+ values[1] = CStringGetTextDatum("constructed");
+ else
+ values[1] = CStringGetTextDatum("unknown");
+ values[2] = CStringGetTextDatum(psprintf("%p", cstatus->cchunk));
+ values[3] = CStringGetTextDatum(psprintf("%p", cstatus->upper));
+ values[4] = Int32GetDatum(cstatus->l_depth);
+ values[5] = CStringGetTextDatum(psprintf("%p", cstatus->left));
+ values[6] = Int32GetDatum(cstatus->r_depth);
+ values[7] = CStringGetTextDatum(psprintf("%p", cstatus->right));
+ values[8] = Int32GetDatum(cstatus->ntups);
+ values[9] = Int32GetDatum(cstatus->usage);
+
+ if (ItemPointerIsValid(&cstatus->min_ctid))
+ values[10] = PointerGetDatum(&cstatus->min_ctid);
+ else
+ isnull[10] = true;
+ if (ItemPointerIsValid(&cstatus->max_ctid))
+ values[11] = PointerGetDatum(&cstatus->max_ctid);
+ else
+ isnull[11] = true;
+
+ tuple = heap_form_tuple(fncxt->tuple_desc, values, isnull);
+
+ SRF_RETURN_NEXT(fncxt, HeapTupleGetDatum(tuple));
+ }
+ SRF_RETURN_DONE(fncxt);
+}
+PG_FUNCTION_INFO_V1(cache_scan_debuginfo);
+
+/*
+ * cs_alloc_shmblock
+ *
+ * It allocates a fixed-length block. The reason why this routine does not
+ * support variable length allocation is to simplify the logic for its purpose.
+ */
+static void *
+cs_alloc_shmblock(void)
+{
+ ccache_head *ccache;
+ dlist_node *dnode;
+ shmseg_block *block;
+ void *address = NULL;
+ int retry = 2;
+
+do_retry:
+ SpinLockAcquire(&cs_shmseg_head->lock);
+ if (dlist_is_empty(&cs_shmseg_head->free_list) && retry-- > 0)
+ {
+ SpinLockRelease(&cs_shmseg_head->lock);
+
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ if (!dlist_is_empty(&cs_ccache_hash->lru_list))
+ {
+ dnode = dlist_tail_node(&cs_ccache_hash->lru_list);
+ ccache = dlist_container(ccache_head, lru_chain, dnode);
+
+ cs_put_ccache_nolock(ccache);
+ }
+ SpinLockRelease(&cs_ccache_hash->lock);
+
+ goto do_retry;
+ }
+
+ if (!dlist_is_empty(&cs_shmseg_head->free_list))
+ {
+ dnode = dlist_pop_head_node(&cs_shmseg_head->free_list);
+ block = dlist_container(shmseg_block, chain, dnode);
+
+ memset(&block->chain, 0, sizeof(dlist_node));
+
+ address = (void *) block->address;
+ }
+ SpinLockRelease(&cs_shmseg_head->lock);
+
+ return address;
+}
+
+/*
+ * cs_free_shmblock
+ *
+ * It release a block being allocated by cs_alloc_shmblock
+ */
+static void
+cs_free_shmblock(void *address)
+{
+ Size curr = (Size) address;
+ Size base = cs_shmseg_head->base_address;
+ ulong index;
+ shmseg_block *block;
+
+ Assert((curr - base) % shmseg_blocksize == 0);
+ Assert(curr >= base && curr < base + shmseg_num_blocks * shmseg_blocksize);
+ index = (curr - base) / shmseg_blocksize;
+
+ SpinLockAcquire(&cs_shmseg_head->lock);
+ block = &cs_shmseg_head->blocks[index];
+
+ dlist_push_head(&cs_shmseg_head->free_list, &block->chain);
+
+ SpinLockRelease(&cs_shmseg_head->lock);
+}
+
+static void
+ccache_setup(void)
+{
+ Size curr_address;
+ ulong i;
+ bool found;
+
+ /* allocation of a shared memory segment for table's hash */
+ cs_ccache_hash = ShmemInitStruct("cache_scan: hash of columnar cache",
+ MAXALIGN(sizeof(ccache_hash)) +
+ MAXALIGN(sizeof(LWLockId) *
+ ccache_hash_size) +
+ MAXALIGN(sizeof(dlist_node) *
+ ccache_hash_size),
+ &found);
+ Assert(!found);
+
+ SpinLockInit(&cs_ccache_hash->lock);
+ dlist_init(&cs_ccache_hash->lru_list);
+ dlist_init(&cs_ccache_hash->free_list);
+ cs_ccache_hash->lwlocks = (void *)(&cs_ccache_hash[1]);
+ cs_ccache_hash->slots
+ = (void *)(&cs_ccache_hash->lwlocks[ccache_hash_size]);
+
+ for (i=0; i < ccache_hash_size; i++)
+ cs_ccache_hash->lwlocks[i] = LWLockAssign();
+ for (i=0; i < ccache_hash_size; i++)
+ dlist_init(&cs_ccache_hash->slots[i]);
+
+ /* allocation of a shared memory segment for columnar cache */
+ cs_shmseg_head = ShmemInitStruct("cache_scan: columnar cache",
+ offsetof(shmseg_head,
+ blocks[shmseg_num_blocks]) +
+ shmseg_num_blocks * shmseg_blocksize,
+ &found);
+ Assert(!found);
+
+ SpinLockInit(&cs_shmseg_head->lock);
+ dlist_init(&cs_shmseg_head->free_list);
+
+ curr_address = MAXALIGN(&cs_shmseg_head->blocks[shmseg_num_blocks]);
+
+ cs_shmseg_head->base_address = curr_address;
+ for (i=0; i < shmseg_num_blocks; i++)
+ {
+ shmseg_block *block = &cs_shmseg_head->blocks[i];
+
+ block->address = curr_address;
+ dlist_push_tail(&cs_shmseg_head->free_list, &block->chain);
+
+ curr_address += shmseg_blocksize;
+ }
+}
+
+void
+ccache_init(void)
+{
+ /* setup GUC variables */
+ DefineCustomIntVariable("cache_scan.block_size",
+ "block size of in-memory columnar cache",
+ NULL,
+ &shmseg_blocksize,
+ 2048 * 1024, /* 2MB */
+ 1024 * 1024, /* 1MB */
+ INT_MAX,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+ if ((shmseg_blocksize & (shmseg_blocksize - 1)) != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cache_scan.block_size must be power of 2")));
+
+ DefineCustomIntVariable("cache_scan.num_blocks",
+ "number of in-memory columnar cache blocks",
+ NULL,
+ &shmseg_num_blocks,
+ 64,
+ 64,
+ INT_MAX,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ DefineCustomIntVariable("cache_scan.hash_size",
+ "number of hash slots for columnar cache",
+ NULL,
+ &ccache_hash_size,
+ 128,
+ 128,
+ INT_MAX,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ DefineCustomIntVariable("cache_scan.max_cached_attnum",
+ "max attribute number we can cache",
+ NULL,
+ &max_cached_attnum,
+ 256,
+ sizeof(bitmapword) * BITS_PER_BYTE,
+ 2048,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ /* request shared memory segment for table's cache */
+ RequestAddinShmemSpace(MAXALIGN(sizeof(ccache_hash)) +
+ MAXALIGN(sizeof(dlist_head) * ccache_hash_size) +
+ MAXALIGN(sizeof(LWLockId) * ccache_hash_size) +
+ MAXALIGN(offsetof(shmseg_head,
+ blocks[shmseg_num_blocks])) +
+ shmseg_num_blocks * shmseg_blocksize);
+ RequestAddinLWLocks(ccache_hash_size);
+
+ shmem_startup_next = shmem_startup_hook;
+ shmem_startup_hook = ccache_setup;
+
+ /* register resource-release callback */
+ dlist_init(&ccache_local_list);
+ dlist_init(&ccache_free_list);
+ RegisterResourceReleaseCallback(ccache_on_resource_release, NULL);
+}
diff --git a/contrib/cache_scan/cscan.c b/contrib/cache_scan/cscan.c
new file mode 100644
index 0000000..0a63c2e
--- /dev/null
+++ b/contrib/cache_scan/cscan.c
@@ -0,0 +1,761 @@
+/* -------------------------------------------------------------------------
+ *
+ * contrib/cache_scan/cscan.c
+ *
+ * An extension that offers an alternative way to scan a table utilizing column
+ * oriented database cache.
+ *
+ * Copyright (c) 2010-2013, PostgreSQL Global Development Group
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "access/heapam.h"
+#include "access/relscan.h"
+#include "access/sysattr.h"
+#include "catalog/objectaccess.h"
+#include "catalog/pg_language.h"
+#include "catalog/pg_proc.h"
+#include "catalog/pg_trigger.h"
+#include "commands/trigger.h"
+#include "executor/nodeCustom.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/var.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/guc.h"
+#include "utils/spccache.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+#include "cache_scan.h"
+#include <limits.h>
+
+PG_MODULE_MAGIC;
+
+/* Static variables */
+static add_scan_path_hook_type add_scan_path_next = NULL;
+static object_access_hook_type object_access_next = NULL;
+static heap_page_prune_hook_type heap_page_prune_next = NULL;
+
+static bool cache_scan_disabled;
+
+static bool
+cs_estimate_costs(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Relation rel,
+ CustomPath *cpath,
+ Bitmapset **attrs_used)
+{
+ ListCell *lc;
+ ccache_head *ccache;
+ Oid tableoid = RelationGetRelid(rel);
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ int total_width = 0;
+ int tuple_width = 0;
+ double hit_ratio;
+ Cost run_cost = 0.0;
+ Cost startup_cost = 0.0;
+ double tablespace_page_cost;
+ QualCost qpqual_cost;
+ Cost cpu_per_tuple;
+ int i;
+
+ /* Mark the path with the correct row estimate */
+ if (cpath->path.param_info)
+ cpath->path.rows = cpath->path.param_info->ppi_rows;
+ else
+ cpath->path.rows = baserel->rows;
+
+ /* List up all the columns being in-use */
+ pull_varattnos((Node *) baserel->reltargetlist,
+ baserel->relid,
+ attrs_used);
+ foreach(lc, baserel->baserestrictinfo)
+ {
+ RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+ pull_varattnos((Node *) rinfo->clause,
+ baserel->relid,
+ attrs_used);
+ }
+
+ for (i=FirstLowInvalidHeapAttributeNumber + 1; i <= 0; i++)
+ {
+ int attidx = i - FirstLowInvalidHeapAttributeNumber;
+
+ if (bms_is_member(attidx, *attrs_used))
+ {
+ /* oid and whole-row reference is not supported */
+ if (i == ObjectIdAttributeNumber || i == InvalidAttrNumber)
+ return false;
+
+ /* clear system attributes from the bitmap */
+ *attrs_used = bms_del_member(*attrs_used, attidx);
+ }
+ }
+
+ /*
+ * Because of layout on the shared memory segment, we have to restrict
+ * the largest attribute number in use to prevent overrun by growth of
+ * Bitmapset.
+ */
+ if (*attrs_used &&
+ (*attrs_used)->nwords > ccache_max_attribute_number())
+ return false;
+
+ /*
+ * Estimation of average width of cached tuples - it does not make
+ * sense to construct a new cache if its average width is more than
+ * 30% of the raw data.
+ */
+ for (i=0; i < tupdesc->natts; i++)
+ {
+ Form_pg_attribute attr = tupdesc->attrs[i];
+ int attidx = i + 1 - FirstLowInvalidHeapAttributeNumber;
+ int width;
+
+ if (attr->attlen > 0)
+ width = attr->attlen;
+ else
+ width = get_attavgwidth(tableoid, attr->attnum);
+
+ total_width += width;
+ if (bms_is_member(attidx, *attrs_used))
+ tuple_width += width;
+ }
+
+ ccache = cs_get_ccache(RelationGetRelid(rel), *attrs_used, false);
+ if (!ccache)
+ {
+ if ((double)tuple_width / (double)total_width > 0.3)
+ return false;
+ hit_ratio = 0.05;
+ }
+ else
+ {
+ hit_ratio = 0.95;
+ cs_put_ccache(ccache);
+ }
+
+ get_tablespace_page_costs(baserel->reltablespace,
+ NULL,
+ &tablespace_page_cost);
+ /* Disk costs */
+ run_cost += (1.0 - hit_ratio) * tablespace_page_cost * baserel->pages;
+
+ /* CPU costs */
+ get_restriction_qual_cost(root, baserel,
+ cpath->path.param_info,
+ &qpqual_cost);
+
+ startup_cost += qpqual_cost.startup;
+ cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+ run_cost += cpu_per_tuple * baserel->tuples;
+
+ cpath->path.startup_cost = startup_cost;
+ cpath->path.total_cost = startup_cost + run_cost;
+
+ return true;
+}
+
+/*
+ * cs_relation_has_synchronizer
+ *
+ * A table that can have columner-cache also needs to have trigger for
+ * synchronization, to ensure the on-memory cache keeps the latest contents
+ * of the heap. It returns TRUE, if supplied relation has triggers that
+ * invokes cache_scan_synchronizer on appropriate context. Elsewhere, FALSE
+ * shall be returned.
+ */
+static bool
+cs_relation_has_synchronizer(Relation rel)
+{
+ int i, numtriggers;
+ bool has_on_insert_synchronizer = false;
+ bool has_on_update_synchronizer = false;
+ bool has_on_delete_synchronizer = false;
+ bool has_on_truncate_synchronizer = false;
+
+ if (!rel->trigdesc)
+ return false;
+
+ numtriggers = rel->trigdesc->numtriggers;
+ for (i=0; i < numtriggers; i++)
+ {
+ Trigger *trig = rel->trigdesc->triggers + i;
+ HeapTuple tup;
+
+ if (!trig->tgenabled)
+ continue;
+
+ tup = SearchSysCache1(PROCOID, ObjectIdGetDatum(trig->tgfoid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for function %u", trig->tgfoid);
+
+ if (((Form_pg_proc) GETSTRUCT(tup))->prolang == ClanguageId)
+ {
+ Datum value;
+ bool isnull;
+ char *prosrc;
+ char *probin;
+
+ value = SysCacheGetAttr(PROCOID, tup,
+ Anum_pg_proc_prosrc, &isnull);
+ if (isnull)
+ elog(ERROR, "null prosrc for C function %u", trig->tgoid);
+ prosrc = TextDatumGetCString(value);
+
+ value = SysCacheGetAttr(PROCOID, tup,
+ Anum_pg_proc_probin, &isnull);
+ if (isnull)
+ elog(ERROR, "null probin for C function %u", trig->tgoid);
+ probin = TextDatumGetCString(value);
+
+ if (strcmp(prosrc, "cache_scan_synchronizer") == 0 &&
+ strcmp(probin, "$libdir/cache_scan") == 0)
+ {
+ int16 tgtype = trig->tgtype;
+
+ if (TRIGGER_TYPE_MATCHES(tgtype,
+ TRIGGER_TYPE_ROW,
+ TRIGGER_TYPE_AFTER,
+ TRIGGER_TYPE_INSERT))
+ has_on_insert_synchronizer = true;
+ if (TRIGGER_TYPE_MATCHES(tgtype,
+ TRIGGER_TYPE_ROW,
+ TRIGGER_TYPE_AFTER,
+ TRIGGER_TYPE_UPDATE))
+ has_on_update_synchronizer = true;
+ if (TRIGGER_TYPE_MATCHES(tgtype,
+ TRIGGER_TYPE_ROW,
+ TRIGGER_TYPE_AFTER,
+ TRIGGER_TYPE_DELETE))
+ has_on_delete_synchronizer = true;
+ if (TRIGGER_TYPE_MATCHES(tgtype,
+ TRIGGER_TYPE_STATEMENT,
+ TRIGGER_TYPE_AFTER,
+ TRIGGER_TYPE_TRUNCATE))
+ has_on_truncate_synchronizer = true;
+ }
+ pfree(prosrc);
+ pfree(probin);
+ }
+ ReleaseSysCache(tup);
+ }
+
+ if (has_on_insert_synchronizer &&
+ has_on_update_synchronizer &&
+ has_on_delete_synchronizer &&
+ has_on_truncate_synchronizer)
+ return true;
+ return false;
+}
+
+
+static void
+cs_add_scan_path(PlannerInfo *root,
+ RelOptInfo *baserel,
+ RangeTblEntry *rte)
+{
+ Relation rel;
+
+ /* call the secondary hook if exist */
+ if (add_scan_path_next)
+ (*add_scan_path_next)(root, baserel, rte);
+
+ /* Is this feature available now? */
+ if (cache_scan_disabled)
+ return;
+
+ /* Only regular tables can be cached */
+ if (baserel->reloptkind != RELOPT_BASEREL ||
+ rte->rtekind != RTE_RELATION)
+ return;
+
+ /* Core code should already acquire an appropriate lock */
+ rel = heap_open(rte->relid, NoLock);
+
+ if (cs_relation_has_synchronizer(rel))
+ {
+ CustomPath *cpath = makeNode(CustomPath);
+ Relids required_outer;
+ Bitmapset *attrs_used = NULL;
+
+ /*
+ * We don't support pushing join clauses into the quals of a ctidscan,
+ * but it could still have required parameterization due to LATERAL
+ * refs in its tlist.
+ */
+ required_outer = baserel->lateral_relids;
+
+ cpath->path.pathtype = T_CustomScan;
+ cpath->path.parent = baserel;
+ cpath->path.param_info = get_baserel_parampathinfo(root, baserel,
+ required_outer);
+ if (cs_estimate_costs(root, baserel, rel, cpath, &attrs_used))
+ {
+ cpath->custom_name = pstrdup("cache scan");
+ cpath->custom_flags = 0;
+ cpath->custom_private
+ = list_make1(makeString(bms_to_string(attrs_used)));
+
+ add_path(baserel, &cpath->path);
+ }
+ }
+ heap_close(rel, NoLock);
+}
+
+static void
+cs_init_custom_scan_plan(PlannerInfo *root,
+ CustomScan *cscan_plan,
+ CustomPath *cscan_path,
+ List *tlist,
+ List *scan_clauses)
+{
+ List *quals = NIL;
+ ListCell *lc;
+
+ /* should be a base relation */
+ Assert(cscan_path->path.parent->relid > 0);
+ Assert(cscan_path->path.parent->rtekind == RTE_RELATION);
+
+ /* extract the supplied RestrictInfo */
+ foreach (lc, scan_clauses)
+ {
+ RestrictInfo *rinfo = lfirst(lc);
+ quals = lappend(quals, rinfo->clause);
+ }
+
+ /* do nothing something special pushing-down */
+ cscan_plan->scan.plan.targetlist = tlist;
+ cscan_plan->scan.plan.qual = quals;
+ cscan_plan->custom_private = cscan_path->custom_private;
+}
+
+typedef struct
+{
+ ccache_head *ccache;
+ ItemPointerData curr_ctid;
+ bool normal_seqscan;
+ bool with_construction;
+} cs_state;
+
+static void
+cs_begin_custom_scan(CustomScanState *node, int eflags)
+{
+ CustomScan *cscan = (CustomScan *)node->ss.ps.plan;
+ Relation rel = node->ss.ss_currentRelation;
+ EState *estate = node->ss.ps.state;
+ HeapScanDesc scandesc = NULL;
+ cs_state *csstate;
+ Bitmapset *attrs_used;
+ ccache_head *ccache;
+
+ csstate = palloc0(sizeof(cs_state));
+
+ attrs_used = bms_from_string(strVal(linitial(cscan->custom_private)));
+
+ ccache = cs_get_ccache(RelationGetRelid(rel), attrs_used, true);
+ if (ccache)
+ {
+ LWLockAcquire(ccache->lock, LW_SHARED);
+ if (ccache->status != CCACHE_STATUS_CONSTRUCTED)
+ {
+ LWLockRelease(ccache->lock);
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (ccache->status == CCACHE_STATUS_INITIALIZED)
+ {
+ ccache->status = CCACHE_STATUS_IN_PROGRESS;
+ csstate->with_construction = true;
+ scandesc = heap_beginscan(rel, SnapshotAny, 0, NULL);
+ }
+ else if (ccache->status == CCACHE_STATUS_IN_PROGRESS)
+ {
+ csstate->normal_seqscan = true;
+ scandesc = heap_beginscan(rel, estate->es_snapshot, 0, NULL);
+ }
+ }
+ LWLockRelease(ccache->lock);
+ csstate->ccache = ccache;
+
+ /* seek to the first position */
+ if (estate->es_direction == ForwardScanDirection)
+ {
+ ItemPointerSetBlockNumber(&csstate->curr_ctid, 0);
+ ItemPointerSetOffsetNumber(&csstate->curr_ctid, 0);
+ }
+ else
+ {
+ ItemPointerSetBlockNumber(&csstate->curr_ctid, MaxBlockNumber);
+ ItemPointerSetOffsetNumber(&csstate->curr_ctid, MaxOffsetNumber);
+ }
+ }
+ else
+ {
+ scandesc = heap_beginscan(rel, estate->es_snapshot, 0, NULL);
+ csstate->normal_seqscan = true;
+ }
+ node->ss.ss_currentScanDesc = scandesc;
+
+ node->custom_state = csstate;
+}
+
+/*
+ * cache_scan_needs_next
+ *
+ * We may fetch a tuple to be invisible because columner cache stores
+ * all the living tuples, including ones updated / deleted by concurrent
+ * sessions. So, it is a job of the caller to check MVCC visibility.
+ * It decides whether we need to move the next tuple due to the visibility
+ * condition, or not. If given tuple was NULL, it is obviously a time to
+ * break searching because it means no more tuples on the cache.
+ */
+static bool
+cache_scan_needs_next(HeapTuple tuple, Snapshot snapshot, Buffer buffer)
+{
+ bool visibility;
+
+ /* end of the scan */
+ if (!HeapTupleIsValid(tuple))
+ return false;
+
+ if (buffer != InvalidBuffer)
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ visibility = HeapTupleSatisfiesVisibility(tuple, snapshot, buffer);
+
+ if (buffer != InvalidBuffer)
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ return !visibility ? true : false;
+}
+
+static TupleTableSlot *
+cache_scan_next(CustomScanState *node)
+{
+ cs_state *csstate = node->custom_state;
+ Relation rel = node->ss.ss_currentRelation;
+ HeapScanDesc scan = node->ss.ss_currentScanDesc;
+ TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
+ EState *estate = node->ss.ps.state;
+ Snapshot snapshot = estate->es_snapshot;
+ HeapTuple tuple;
+ Buffer buffer;
+
+ /* in case of fallback path, we don't need to something special. */
+ if (csstate->normal_seqscan)
+ {
+ tuple = heap_getnext(scan, estate->es_direction);
+ if (HeapTupleIsValid(tuple))
+ ExecStoreTuple(tuple, slot, scan->rs_cbuf, false);
+ else
+ ExecClearTuple(slot);
+ return slot;
+ }
+ Assert(csstate->ccache != NULL);
+
+ /* elsewhere, we either run or construct the columner cache */
+ do {
+ ccache_head *ccache = csstate->ccache;
+
+ /*
+ * "with_construction" means the columner cache is under construction,
+ * so we need to fetch a tuple from heap of the target relation and
+ * insert it into the cache. Note that we use SnapshotAny to fetch
+ * all the tuples both of visible and invisible ones, so it is our
+ * responsibility to check tuple visibility according to snapshot or
+ * the current estate.
+ * It is same even when we fetch tuples from the cache, without
+ * referencing heap buffer.
+ */
+ if (csstate->with_construction)
+ {
+ tuple = heap_getnext(scan, estate->es_direction);
+
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (HeapTupleIsValid(tuple))
+ {
+ if (ccache_insert_tuple(ccache, rel, tuple))
+ LWLockRelease(ccache->lock);
+ else
+ {
+ /*
+ * If ccache_insert_tuple got failed, it usually means
+ * lack of shared memory and unable to continue
+ * construction of the columner cacher.
+ * So, we put is twice to reset its reference counter
+ * to zero and release shared memory blocks.
+ */
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+ cs_put_ccache(ccache);
+ csstate->ccache = NULL;
+ }
+ }
+ else
+ {
+ /*
+ * If we reached end of the relation, it means the columner-
+ * cache become constructed.
+ */
+ ccache->status = CCACHE_STATUS_CONSTRUCTED;
+ LWLockRelease(ccache->lock);
+ }
+ buffer = scan->rs_cbuf;
+ }
+ else
+ {
+ LWLockAcquire(ccache->lock, LW_SHARED);
+ tuple = ccache_find_tuple(ccache->root_chunk,
+ &csstate->curr_ctid,
+ estate->es_direction);
+ if (HeapTupleIsValid(tuple))
+ {
+ ItemPointerCopy(&tuple->t_self, &csstate->curr_ctid);
+ tuple = heap_copytuple(tuple);
+ }
+ LWLockRelease(ccache->lock);
+ buffer = InvalidBuffer;
+ }
+ } while (cache_scan_needs_next(tuple, snapshot, buffer));
+
+ if (HeapTupleIsValid(tuple))
+ ExecStoreTuple(tuple, slot, buffer, buffer == InvalidBuffer);
+ else
+ ExecClearTuple(slot);
+
+ return slot;
+}
+
+static bool
+cache_scan_recheck(CustomScanState *node, TupleTableSlot *slot)
+{
+ return true;
+}
+
+static TupleTableSlot *
+cs_exec_custom_scan(CustomScanState *node)
+{
+ return ExecScan((ScanState *) node,
+ (ExecScanAccessMtd) cache_scan_next,
+ (ExecScanRecheckMtd) cache_scan_recheck);
+}
+
+static void
+cs_end_custom_scan(CustomScanState *node)
+{
+ cs_state *csstate = node->custom_state;
+
+ if (csstate->ccache)
+ {
+ ccache_head *ccache = csstate->ccache;
+ bool needs_remove = false;
+
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (ccache->status == CCACHE_STATUS_IN_PROGRESS)
+ needs_remove = true;
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+ if (needs_remove)
+ cs_put_ccache(ccache);
+ }
+ if (node->ss.ss_currentScanDesc)
+ heap_endscan(node->ss.ss_currentScanDesc);
+}
+
+static void
+cs_rescan_custom_scan(CustomScanState *node)
+{
+ elog(ERROR, "not implemented yet");
+}
+
+/*
+ * cache_scan_synchronizer
+ *
+ * trigger function to synchronize the columner-cache with heap contents.
+ */
+Datum
+cache_scan_synchronizer(PG_FUNCTION_ARGS)
+{
+ TriggerData *trigdata = (TriggerData *) fcinfo->context;
+ Relation rel = trigdata->tg_relation;
+ HeapTuple tuple = trigdata->tg_trigtuple;
+ HeapTuple newtup = trigdata->tg_newtuple;
+ HeapTuple result = NULL;
+ const char *tg_name = trigdata->tg_trigger->tgname;
+ ccache_head *ccache;
+
+ if (!CALLED_AS_TRIGGER(fcinfo))
+ elog(ERROR, "%s: not fired by trigger manager", tg_name);
+
+ ccache = cs_get_ccache(RelationGetRelid(rel), NULL, false);
+ if (!ccache)
+ return PointerGetDatum(newtup);
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+
+ PG_TRY();
+ {
+ TriggerEvent tg_event = trigdata->tg_event;
+
+ if (TRIGGER_FIRED_AFTER(tg_event) &&
+ TRIGGER_FIRED_FOR_ROW(tg_event) &&
+ TRIGGER_FIRED_BY_INSERT(tg_event))
+ {
+ ccache_insert_tuple(ccache, rel, tuple);
+ result = tuple;
+ }
+ else if (TRIGGER_FIRED_AFTER(tg_event) &&
+ TRIGGER_FIRED_FOR_ROW(tg_event) &&
+ TRIGGER_FIRED_BY_UPDATE(tg_event))
+ {
+ ccache_insert_tuple(ccache, rel, newtup);
+ ccache_delete_tuple(ccache, tuple);
+ result = newtup;
+ }
+ else if (TRIGGER_FIRED_AFTER(tg_event) &&
+ TRIGGER_FIRED_FOR_ROW(tg_event) &&
+ TRIGGER_FIRED_BY_DELETE(tg_event))
+ {
+ ccache_delete_tuple(ccache, tuple);
+ result = tuple;
+ }
+ else if (TRIGGER_FIRED_AFTER(tg_event) &&
+ TRIGGER_FIRED_FOR_STATEMENT(tg_event) &&
+ TRIGGER_FIRED_BY_TRUNCATE(tg_event))
+ {
+ if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
+ cs_put_ccache(ccache);
+ }
+ else
+ elog(ERROR, "%s: fired by unexpected context (%08x)",
+ tg_name, tg_event);
+ }
+ PG_CATCH();
+ {
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+
+ PG_RETURN_POINTER(result);
+}
+PG_FUNCTION_INFO_V1(cache_scan_synchronizer);
+
+/*
+ * ccache_on_object_access
+ *
+ * It dropps an existing columner-cache if the cached table was altered or
+ * dropped.
+ */
+static void
+ccache_on_object_access(ObjectAccessType access,
+ Oid classId,
+ Oid objectId,
+ int subId,
+ void *arg)
+{
+ ccache_head *ccache;
+
+ /* ALTER TABLE and DROP TABLE needs cache invalidation */
+ if (access != OAT_DROP && access != OAT_POST_ALTER)
+ return;
+ if (classId != RelationRelationId)
+ return;
+
+ ccache = cs_get_ccache(objectId, NULL, false);
+ if (!ccache)
+ return;
+
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
+ cs_put_ccache(ccache);
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+}
+
+/*
+ * ccache_on_page_prune
+ *
+ * It is a callback function when a particular heap block got vacuumed.
+ * On vacuuming, its dead space, being allocated by dead tuples, got
+ * reclaimed and tuple's location was ought to be moved.
+ * This routine also reclaims the space by dead tuples on the columner
+ * cache according to layout changes on the heap.
+ */
+static void
+ccache_on_page_prune(Relation relation,
+ Buffer buffer,
+ int ndeleted,
+ TransactionId OldestXmin,
+ TransactionId latestRemovedXid)
+{
+ ccache_head *ccache;
+
+ /* call the secondary hook */
+ if (heap_page_prune_next)
+ (*heap_page_prune_next)(relation, buffer, ndeleted,
+ OldestXmin, latestRemovedXid);
+
+ ccache = cs_get_ccache(RelationGetRelid(relation), NULL, false);
+ if (ccache)
+ {
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+
+ ccache_vacuum_page(ccache, buffer);
+
+ LWLockRelease(ccache->lock);
+
+ cs_put_ccache(ccache);
+ }
+}
+
+void
+_PG_init(void)
+{
+ CustomProvider provider;
+
+ if (IsUnderPostmaster)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cache_scan must be loaded via shared_preload_libraries")));
+
+ DefineCustomBoolVariable("cache_scan.disabled",
+ "turn on/off cache_scan feature on run-time",
+ NULL,
+ &cache_scan_disabled,
+ false,
+ PGC_USERSET,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ /* initialization of cache subsystem */
+ ccache_init();
+
+ /* callbacks for cache invalidation */
+ object_access_next = object_access_hook;
+ object_access_hook = ccache_on_object_access;
+
+ heap_page_prune_next = heap_page_prune_hook;
+ heap_page_prune_hook = ccache_on_page_prune;
+
+ /* registration of custom scan provider */
+ add_scan_path_next = add_scan_path_hook;
+ add_scan_path_hook = cs_add_scan_path;
+
+ memset(&provider, 0, sizeof(provider));
+ strncpy(provider.name, "cache scan", sizeof(provider.name));
+ provider.InitCustomScanPlan = cs_init_custom_scan_plan;
+ provider.BeginCustomScan = cs_begin_custom_scan;
+ provider.ExecCustomScan = cs_exec_custom_scan;
+ provider.EndCustomScan = cs_end_custom_scan;
+ provider.ReScanCustomScan = cs_rescan_custom_scan;
+
+ register_custom_provider(&provider);
+}
diff --git a/doc/src/sgml/cache-scan.sgml b/doc/src/sgml/cache-scan.sgml
new file mode 100644
index 0000000..c4cc165
--- /dev/null
+++ b/doc/src/sgml/cache-scan.sgml
@@ -0,0 +1,224 @@
+<!-- doc/src/sgml/cache-scan.sgml -->
+
+<sect1 id="cache-scan" xreflabel="cache-scan">
+ <title>cache-scan</title>
+
+ <indexterm zone="cache-scan">
+ <primary>cache-scan</primary>
+ </indexterm>
+
+ <sect2>
+ <title>Overview</title>
+ <para>
+ The <filename>cache-scan</> module provides an alternative way to scan
+ relations using on-memory columner cache, instead of usual heap scan,
+ in case when previous scan already holds contents of the table on the
+ cache.
+ Unlike buffer cache, it holds contents of the limited number of columns,
+ but not whole of the record, thus it allows to hold larger number of records
+ per same amount of RAM. Probably, this characteristic makes sense to run
+ analytic queries on a table with many columns and records.
+ </para>
+ <para>
+ Once this module gets loaded, it registers itself as a custom-scan provider.
+ It allows to provide an additional scan path on regular relations using
+ on-memory columner cache, instead of regular heap scan.
+ It also performs as a proof-of-concept implementation that works on
+ the custom-scan API that enables to extend the core executor system.
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Installation</title>
+ <para>
+ This module has to be loaded using
+ <xref linkend="guc-shared-preload-libraries"> parameter to acquired
+ a particular amount of shared memory on startup time.
+ In addition, the relation to be cached has special triggers, called
+ synchronizer, are implemented with <literal>cache_scan_synchronizer</>
+ function that synchronizes the cache contents according to the latest
+ heap on <command>INSERT</>, <command>UPDATE</>, <command>DELETE</> or
+ <command>TRUNCATE</>.
+ </para>
+ <para>
+ You can run this extension according to the following steps.
+ </para>
+ <procedure>
+ <step>
+ <para>
+ Adjust <xref linkend="guc-shared-preload-libraries"> parameter to
+ load <filename>cache_scan</> binary on startup time, then restart
+ the postmaster.
+ </para>
+ </step>
+ <step>
+ <para>
+ Run <xref linkend="sql-createextension"> to create synchronizer
+ function of <filename>cache_scan</>.
+<programlisting>
+CREATE EXTENSION cache_scan;
+</programlisting>
+ </para>
+ </step>
+ <step>
+ <para>
+ Create triggers of synchronizer on the target relation.
+<programlisting>
+CREATE TRIGGER t1_cache_row_sync
+ AFTER INSERT OR UPDATE OR DELETE ON t1 FOR ROW
+ EXECUTE PROCEDURE cache_scan_synchronizer();
+CREATE TRIGGER t1_cache_stmt_sync
+ AFTER TRUNCATE ON t1 FOR STATEMENT
+ EXECUTE PROCEDURE cache_scan_synchronizer();
+</programlisting>
+ </para>
+ </step>
+ </procedure>
+ </sect2>
+
+ <sect2>
+ <title>How does it works</title>
+ <para>
+ This module performs according to the usual fashion of
+ <xref linkend="custom-scan">.
+ It offers an alternative way to scan a relation if relation has synchronizer
+ triggers and width of referenced columns are less than 30% of average
+ record width.
+ Then, query optimizer will pick up the cheapest path. If the path chosen
+ is a custom-scan path managed by <filename>cache_scan</>, it runs on the
+ target relation using columner cache.
+ On the first time running, it tries to construct relation's cache along
+ with regular sequential scan. Next time or later, it can run on
+ the columner cache without referencing the heap.
+ </para>
+ <para>
+ You can check whether the query plan uses <filename>cache_scan</> using
+ <xref linkend="sql-explain"> command, as follows:
+<programlisting>
+postgres=# EXPLAIN (costs off) SELECT a,b FROM t1 WHERE b < pi();
+ QUERY PLAN
+----------------------------------------------------
+ Custom Scan (cache scan) on t1
+ Filter: (b < 3.14159265358979::double precision)
+(2 rows)
+</programlisting>
+ </para>
+ <para>
+ A columner cache, associated with a particular relation, has one or more chunks
+ that performs as node or leaf of t-tree structure.
+ The <literal>cache_scan_debuginfo()</> function can dump useful informationl;
+ properties of all the active chunks as follows.
+<programlisting>
+postgres=# SELECT * FROM cache_scan_debuginfo();
+ tableoid | status | chunk | upper | l_depth | l_chunk | r_depth | r_chunk | ntuples | usage | min_ctid | max_ct
+id
+----------+-------------+----------------+----------------+---------+----------------+---------+----------------+---------+---------+-----------+-----------
+ 16400 | constructed | 0x7f2b8ad84740 | 0x7f2b8af84740 | 0 | (nil) | 0 | (nil) | 29126 | 233088 | (0,1) | (677,15)
+ 16400 | constructed | 0x7f2b8af84740 | (nil) | 1 | 0x7f2b8ad84740 | 2 | 0x7f2b8b384740 | 29126 | 233088 | (677,16) | (1354,30)
+ 16400 | constructed | 0x7f2b8b184740 | 0x7f2b8b384740 | 0 | (nil) | 0 | (nil) | 29126 | 233088 | (1354,31) | (2032,2)
+ 16400 | constructed | 0x7f2b8b384740 | 0x7f2b8af84740 | 1 | 0x7f2b8b184740 | 1 | 0x7f2b8b584740 | 29126 | 233088 | (2032,3) | (2709,33)
+ 16400 | constructed | 0x7f2b8b584740 | 0x7f2b8b384740 | 0 | (nil) | 0 | (nil) | 3478 | 1874560 | (2709,34) | (2790,28)
+(5 rows)
+</programlisting>
+ </para>
+ <para>
+ All the cached tuples are indexed with <literal>ctid</> order, and each chunk has
+ an array of partial tuples with min- and max- values. Its left node is linked to
+ the chunks that have tuples with smaller <literal>ctid</>, and its right node is
+ linked to the chunks that have larger ones.
+ It enables to find out tuples in timely fashion when it needs to be invalidated
+ according to heap updates by DDL, DML or vacuuming.
+ </para>
+ <para>
+ The columner cache are not owned by a particular session, so it retains the cache
+ unless it does not dropped or postmaster does not restart.
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>GUC Parameters</title>
+ <variablelist>
+ <varlistentry id="guc-cache-scan-block_size" xreflabel="cache_scan.block_size">
+ <term><varname>cache_scan.block_size</> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>cache_scan.block_size</> configuration parameter</>
+ </indexterm>
+ <listitem>
+ <para>
+ This parameter controls length of the block on shared memory segment
+ for the columner-cache. It needs to restart postmaster for validation.
+ </para>
+ <para>
+ <filename>cache_scan</> module acquires <literal>cache_scan.num_blocks</>
+ x <literal>cache_scan.block_size</> bytes of shared memory segment on
+ the startup time, then allocates them for columner cache on demand.
+ Too large block size damages flexibility of memory assignment, and
+ too small block size consumes much management are for each block.
+ So, we recommend to keep is as the default value; that is 2MB per block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-cache-scan-num_blocks" xreflabel="cache_scan.num_blocks">
+ <term><varname>cache_scan.num_blocks</> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>cache_scan.num_blocks</> configuration parameter</>
+ </indexterm>
+ <listitem>
+ <para>
+ This parameter controls number of the block on shared memory segment
+ for the columner-cache. It needs to restart postmaster for validation.
+ </para>
+ <para>
+ <filename>cache_scan</> module acquires <literal>cache_scan.num_blocks</>
+ x <literal>cache_scan.block_size</> bytes of shared memory segment on
+ the startup time, then allocates them for columner cache on demand.
+ Too small number of blocks damages flexibility of memory assignment
+ and may cause undesired cache dropping.
+ So, we recommend to set enough number of blocks to keep contents of
+ the target relations on memory.
+ Its default is <literal>64</literal>; probably too small for most of
+ real use cases.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-cache-scan-hash_size" xreflabel="cache_scan.hash_size">
+ <term><varname>cache_scan.hash_size</> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>cache_scan.hash_size</> configuration parameter</>
+ </indexterm>
+ <listitem>
+ <para>
+ This parameter controls width of the internal hash table slots; that
+ link every columnar cache distributed by table's oid.
+ Its default is <literal>128</>; no need to adjust it usually.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-cache-scan-max_cached_attnum" xreflabel="cache_scan.max_cached_attnum">
+ <term><varname>cache_scan.max_cached_attnum</> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>cache_scan.max_cached_attnum</> configuration parameter</>
+ </indexterm>
+ <listitem>
+ <para>
+ This parameter controls the maximum attribute number we can cache on
+ the columner cache. Because of internal data representation, a bitmap set
+ to track attributes being cached has to be fixed-length.
+ Thus, the largest attribute number needs to be fixed preliminary.
+ Its default is <literal>128</>; although most tables likely have less than
+ 100 columns.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </sect2>
+ <sect2>
+ <title>Author</title>
+ <para>
+ KaiGai Kohei <email>kaigai@kaigai.gr.jp</email>
+ </para>
+ </sect2>
+</sect1>
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 2002f60..3d8fd05 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -107,6 +107,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
&auto-explain;
&btree-gin;
&btree-gist;
+ &cache-scan;
&chkpass;
&citext;
&ctidscan;
diff --git a/doc/src/sgml/custom-scan.sgml b/doc/src/sgml/custom-scan.sgml
index f53902d..218a5fd 100644
--- a/doc/src/sgml/custom-scan.sgml
+++ b/doc/src/sgml/custom-scan.sgml
@@ -55,6 +55,20 @@
</para>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><xref linkend="cache-scan"></term>
+ <listitem>
+ <para>
+ This custom scan in this module enables a scan refering the on-memory
+ columner cache instead of the heap, if the target relation already has
+ this cache being constructed already.
+ Unlike buffer cache, it holds limited number of columns that have been
+ referenced before, but not all the columns in the table definition.
+ Thus, it allows to cache much larger number of records on-memory than
+ buffer cache.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
<para>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index aa2be4b..10c7666 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -103,6 +103,7 @@
<!ENTITY auto-explain SYSTEM "auto-explain.sgml">
<!ENTITY btree-gin SYSTEM "btree-gin.sgml">
<!ENTITY btree-gist SYSTEM "btree-gist.sgml">
+<!ENTITY cache-scan SYSTEM "cache-scan.sgml">
<!ENTITY chkpass SYSTEM "chkpass.sgml">
<!ENTITY citext SYSTEM "citext.sgml">
<!ENTITY ctidscan SYSTEM "ctidscan.sgml">
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 27cbac8..1fb5f4a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -42,6 +42,9 @@ typedef struct
bool marked[MaxHeapTuplesPerPage + 1];
} PruneState;
+/* Callback for each page pruning */
+heap_page_prune_hook_type heap_page_prune_hook = NULL;
+
/* Local functions */
static int heap_prune_chain(Relation relation, Buffer buffer,
OffsetNumber rootoffnum,
@@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
* and update FSM with the remaining space.
*/
+ /*
+ * This callback allows extensions to synchronize their own status with
+ * heap image on the disk, when this buffer page is vacuumed.
+ */
+ if (heap_page_prune_hook)
+ (*heap_page_prune_hook)(relation,
+ buffer,
+ ndeleted,
+ OldestXmin,
+ prstate.latestRemovedXid);
return ndeleted;
}
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f626755..023f78e 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
*
* The caller should pass xid as the XID of the transaction to check, or
* InvalidTransactionId if no check is needed.
+ *
+ * In case when the supplied HeapTuple is not associated with a particular
+ * buffer, it just returns without any jobs. It may happen when an extension
+ * caches tuple with their own way.
*/
static inline void
SetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid)
{
+ if (BufferIsInvalid(buffer))
+ return;
+
if (TransactionIdIsValid(xid))
{
/* NB: xid must be known committed here! */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bfdadc3..9775aad 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -164,6 +164,13 @@ extern void heap_restrpos(HeapScanDesc scan);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
+typedef void (*heap_page_prune_hook_type)(Relation relation,
+ Buffer buffer,
+ int ndeleted,
+ TransactionId OldestXmin,
+ TransactionId latestRemovedXid);
+extern heap_page_prune_hook_type heap_page_prune_hook;
+
extern void heap_page_prune_opt(Relation relation, Buffer buffer,
TransactionId OldestXmin);
extern int heap_page_prune(Relation relation, Buffer buffer,
Hello,
Because of time pressure in the commit-fest:Jan, I tried to simplifies the patch
for cache-only scan into three portions; (1) add a hook on heap_page_prune
for cache invalidation on vacuuming a particular page. (2) add a check to accept
InvalidBuffer on SetHintBits (3) a proof-of-concept module of cache-only scan.
(1) pgsql-v9.4-heap_page_prune_hook.v1.patch
Once on-memory columnar cache is constructed, then it needs to be invalidated
if heap page on behalf of the cache is modified. In usual DML cases, extension
can get control using row-level trigger functions for invalidation,
however, we right
now have no way to get control on a page is vacuumed, usually handled by
autovacuum process.
This patch adds a callback on heap_page_prune(), to allow extensions to prune
dead entries on its cache, not only heap pages.
I'd also like to see any other scenario we need to invalidate columnar cache
entries, if exist. It seems to me object_access_hook makes sense to conver
DDL and VACUUM FULL scenario...
(2) pgsql-v9.4-HeapTupleSatisfies-accepts-InvalidBuffer.v1.patch
In case when we want to check visibility of the tuples on cache entries (thus
no particular shared buffer is associated) using HeapTupleSatisfiesVisibility,
it internally tries to update hint bits of tuples. However, it does
not make sense
onto the tuples being not associated with a particular shared buffer.
Due to its definition, tuple entries being on cache does not connected with
a particular shared buffer. If we need to load whole of the buffer page to set
hint bits, it is totally nonsense because the purpose of on-memory cache is
to reduce disk accesses.
This patch adds an exceptional condition on SetHintBits() to skip anything
if the given buffer is InvalidBuffer. It allows to check tuple
visibility using regular
visibility check functions, without re-invention of the wheel by themselves.
(3) pgsql-v9.4-contrib-cache-scan.v1.patch
Unlike (1) and (2), this patch is just a proof of the concept to
implement cache-
only scan on top of the custom-scan interface.
It tries to offer an alternative scan path on the table with row-level
triggers for
cache invalidation if total width of referenced columns are less than 30% of the
total width of table definition. Thus, it can keep larger number of records with
meaningful portion on the main memory.
This cache shall be invalidated according to the main heap update. One is
row-level trigger, second is object_access_hook on DDL, and the third is
heap_page_prune hook. Once a columns reduced tuple gets cached, it is
copied to the cache memory from the shared buffer, so it needs a feature
to ignore InvalidBuffer for visibility check functions.
Please volunteer to reviewing the patches, especially (1) and (2) that are
very small portion.
Thanks,
2014-01-21 KaiGai Kohei <kaigai@ak.jp.nec.com>:
Hello,
I revisited the patch for contrib/cache_scan extension.
The previous one had a problem when T-tree node shall be rebalanced
then crashed on merging the node.Even though contrib/cache_scan portion has more than 2KL code,
things I'd like to have a discussion first is a portion of the
core enhancements to run MVCCsnapshot on the cached tuple, and
to get callback on vacuumed pages for cache synchronization.Any comments please.
Thanks,
(2014/01/15 0:06), Kohei KaiGai wrote:
Hello,
The attached patch is what we discussed just before the commit-fest:Nov.
It implements an alternative way to scan a particular table using
on-memory
cache instead of the usual heap access method. Unlike buffer cache, this
mechanism caches a limited number of columns on the memory, so memory
consumption per tuple is much smaller than the regular heap access method,
thus it allows much larger number of tuples on the memory.I'd like to extend this idea to implement a feature to cache data
according to
column-oriented data structure to utilize parallel calculation processors
like
CPU's SIMD operations or simple GPU cores. (Probably, it makes sense to
evaluate multiple records with a single vector instruction if contents of
a particular column is put as a large array.)
However, this patch still keeps all the tuples in row-oriented data
format,
because row <=> column translation makes this patch bigger than the
current form (about 2KL), and GPU integration needs to link proprietary
library (cuda or opencl) thus I thought it is not preferable for the
upstream
code.Also note that this patch needs part-1 ~ part-3 patches of CustomScan
APIs as prerequisites because it is implemented on top of the APIs.One thing I have to apologize is, lack of documentation and source code
comments around the contrib/ code. Please give me a couple of days to
clean-up the code.
Aside from the extension code, I put two enhancement on the core code
as follows. I'd like to have a discussion about adequacy of these
enhancement.The first enhancement is a hook on heap_page_prune() to synchronize
internal state of extension with changes of heap image on the disk.
It is not avoidable to hold garbage, increasing time by time, on the
cache,
thus needs to clean up as vacuum process doing. The best timing to do
is when dead tuples are reclaimed because it is certain nobody will
reference the tuples any more.diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c index f626755..023f78e 100644 --- a/src/backend/utils/time/tqual.c bool marked[MaxHeapTuplesPerPage + 1]; } PruneState;+/* Callback for each page pruning */ +heap_page_prune_hook_type heap_page_prune_hook = NULL; + /* Local functions */ static int heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, @@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, Transacti onId OldestXmin, * and update FSM with the remaining space. */+ /* + * This callback allows extensions to synchronize their own status with + * heap image on the disk, when this buffer page is vacuumed. + */ + if (heap_page_prune_hook) + (*heap_page_prune_hook)(relation, + buffer, + ndeleted, + OldestXmin, + prstate.latestRemovedXid); return ndeleted; }The second enhancement makes SetHintBits() accepts InvalidBuffer to
ignore all the jobs. We need to check visibility of cached tuples when
custom-scan node scans cached table instead of the heap.
Even though we can use MVCC snapshot to check tuple's visibility,
it may internally set hint bit of tuples thus we always needs to give
a valid buffer pointer to HeapTupleSatisfiesVisibility(). Unfortunately,
it kills all the benefit of table cache if it takes to load the heap
buffer
being associated with the cached tuple.
So, I'd like to have a special case handling on the SetHintBits() for
dry-run when InvalidBuffer is given.diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c index f626755..023f78e 100644 --- a/src/backend/utils/time/tqual.c +++ b/src/backend/utils/time/tqual.c @@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot); * * The caller should pass xid as the XID of the transaction to check, or * InvalidTransactionId if no check is needed. + * + * In case when the supplied HeapTuple is not associated with a particular + * buffer, it just returns without any jobs. It may happen when an extension + * caches tuple with their own way. */ static inline void SetHintBits(HeapTupleHeader tuple, Buffer buffer, uint16 infomask, TransactionId xid) { + if (BufferIsInvalid(buffer)) + return; + if (TransactionIdIsValid(xid)) { /* NB: xid must be known committed here! */Thanks,
2013/11/13 Kohei KaiGai <kaigai@kaigai.gr.jp>:
2013/11/12 Tom Lane <tgl@sss.pgh.pa.us>:
Kohei KaiGai <kaigai@kaigai.gr.jp> writes:
So, are you thinking it is a feasible approach to focus on custom-scan
APIs during the upcoming CF3, then table-caching feature as use-case
of this APIs on CF4?Sure. If you work on this extension after CF3, and it reveals that the
custom scan stuff needs some adjustments, there would be time to do that
in CF4. The policy about what can be submitted in CF4 is that we don't
want new major features that no one has seen before, not that you can't
make fixes to previously submitted stuff. Something like a new hook
in vacuum wouldn't be a "major feature", anyway.Thanks for this clarification.
3 days are too short to write a patch, however, 2 month may be sufficient
to develop a feature on top of the scheme being discussed in the previous
comitfest.Best regards,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>--
OSS Promotion Center / The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>
--
KaiGai Kohei <kaigai@kaigai.gr.jp>
Attachments:
pgsql-v9.4-contrib-cache-scan.v1.patchapplication/octet-stream; name=pgsql-v9.4-contrib-cache-scan.v1.patchDownload
contrib/cache_scan/Makefile | 19 +
contrib/cache_scan/cache_scan--1.0.sql | 26 +
contrib/cache_scan/cache_scan--unpackaged--1.0.sql | 3 +
contrib/cache_scan/cache_scan.control | 5 +
contrib/cache_scan/cache_scan.h | 68 +
contrib/cache_scan/ccache.c | 1410 ++++++++++++++++++++
contrib/cache_scan/cscan.c | 761 +++++++++++
doc/src/sgml/cache-scan.sgml | 224 ++++
doc/src/sgml/contrib.sgml | 1 +
doc/src/sgml/custom-scan.sgml | 14 +
doc/src/sgml/filelist.sgml | 1 +
11 files changed, 2532 insertions(+)
diff --git a/contrib/cache_scan/Makefile b/contrib/cache_scan/Makefile
new file mode 100644
index 0000000..4e68b68
--- /dev/null
+++ b/contrib/cache_scan/Makefile
@@ -0,0 +1,19 @@
+# contrib/dbcache/Makefile
+
+MODULE_big = cache_scan
+OBJS = cscan.o ccache.o
+
+EXTENSION = cache_scan
+DATA = cache_scan--1.0.sql cache_scan--unpackaged--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/cache_scan
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
diff --git a/contrib/cache_scan/cache_scan--1.0.sql b/contrib/cache_scan/cache_scan--1.0.sql
new file mode 100644
index 0000000..4bd04d1
--- /dev/null
+++ b/contrib/cache_scan/cache_scan--1.0.sql
@@ -0,0 +1,26 @@
+CREATE FUNCTION public.cache_scan_synchronizer()
+RETURNS trigger
+AS 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE STRICT;
+
+CREATE TYPE public.__cache_scan_debuginfo AS
+(
+ tableoid oid,
+ status text,
+ chunk text,
+ upper text,
+ l_depth int4,
+ l_chunk text,
+ r_depth int4,
+ r_chunk text,
+ ntuples int4,
+ usage int4,
+ min_ctid tid,
+ max_ctid tid
+);
+CREATE FUNCTION public.cache_scan_debuginfo()
+ RETURNS SETOF public.__cache_scan_debuginfo
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT;
+
+
diff --git a/contrib/cache_scan/cache_scan--unpackaged--1.0.sql b/contrib/cache_scan/cache_scan--unpackaged--1.0.sql
new file mode 100644
index 0000000..718a2de
--- /dev/null
+++ b/contrib/cache_scan/cache_scan--unpackaged--1.0.sql
@@ -0,0 +1,3 @@
+DROP FUNCTION public.cache_scan_synchronizer() CASCADE;
+DROP FUNCTION public.cache_scan_debuginfo() CASCADE;
+DROP TYPE public.__cache_scan_debuginfo;
diff --git a/contrib/cache_scan/cache_scan.control b/contrib/cache_scan/cache_scan.control
new file mode 100644
index 0000000..77946da
--- /dev/null
+++ b/contrib/cache_scan/cache_scan.control
@@ -0,0 +1,5 @@
+# cache_scan extension
+comment = 'custom scan provider for cache-only scan'
+default_version = '1.0'
+module_pathname = '$libdir/cache_scan'
+relocatable = false
diff --git a/contrib/cache_scan/cache_scan.h b/contrib/cache_scan/cache_scan.h
new file mode 100644
index 0000000..d06156e
--- /dev/null
+++ b/contrib/cache_scan/cache_scan.h
@@ -0,0 +1,68 @@
+/* -------------------------------------------------------------------------
+ *
+ * contrib/cache_scan/cache_scan.h
+ *
+ * Definitions for the cache_scan extension
+ *
+ * Copyright (c) 2010-2013, PostgreSQL Global Development Group
+ *
+ * -------------------------------------------------------------------------
+ */
+#ifndef CACHE_SCAN_H
+#define CACHE_SCAN_H
+#include "access/htup_details.h"
+#include "lib/ilist.h"
+#include "nodes/bitmapset.h"
+#include "storage/lwlock.h"
+#include "utils/rel.h"
+
+typedef struct ccache_chunk {
+ struct ccache_chunk *upper; /* link to the upper node */
+ struct ccache_chunk *right; /* link to the greaternode, if exist */
+ struct ccache_chunk *left; /* link to the less node, if exist */
+ int r_depth; /* max depth in right branch */
+ int l_depth; /* max depth in left branch */
+ uint32 ntups; /* number of tuples being cached */
+ uint32 usage; /* usage counter of this chunk */
+ HeapTuple tuples[FLEXIBLE_ARRAY_MEMBER];
+} ccache_chunk;
+
+#define CCACHE_STATUS_INITIALIZED 1
+#define CCACHE_STATUS_IN_PROGRESS 2
+#define CCACHE_STATUS_CONSTRUCTED 3
+
+typedef struct {
+ LWLockId lock; /* used to protect ttree links */
+ volatile int refcnt;
+ int status;
+
+ dlist_node hash_chain; /* linked to ccache_hash->slots[] */
+ dlist_node lru_chain; /* linked to ccache_hash->lru_list */
+
+ Oid tableoid;
+ ccache_chunk *root_chunk;
+ Bitmapset attrs_used; /* !Bitmapset is variable length! */
+} ccache_head;
+
+extern int ccache_max_attribute_number(void);
+extern ccache_head *cs_get_ccache(Oid tableoid, Bitmapset *attrs_used,
+ bool create_on_demand);
+extern void cs_put_ccache(ccache_head *ccache);
+
+extern bool ccache_insert_tuple(ccache_head *ccache,
+ Relation rel, HeapTuple tuple);
+extern bool ccache_delete_tuple(ccache_head *ccache, HeapTuple oldtup);
+
+extern void ccache_vacuum_page(ccache_head *ccache, Buffer buffer);
+
+extern HeapTuple ccache_find_tuple(ccache_chunk *cchunk,
+ ItemPointer ctid,
+ ScanDirection direction);
+extern void ccache_init(void);
+
+extern Datum cache_scan_synchronizer(PG_FUNCTION_ARGS);
+extern Datum cache_scan_debuginfo(PG_FUNCTION_ARGS);
+
+extern void _PG_init(void);
+
+#endif /* CACHE_SCAN_H */
diff --git a/contrib/cache_scan/ccache.c b/contrib/cache_scan/ccache.c
new file mode 100644
index 0000000..0bb9ff4
--- /dev/null
+++ b/contrib/cache_scan/ccache.c
@@ -0,0 +1,1410 @@
+/* -------------------------------------------------------------------------
+ *
+ * contrib/cache_scan/ccache.c
+ *
+ * Routines for columns-culled cache implementation
+ *
+ * Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "access/heapam.h"
+#include "access/sysattr.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "storage/ipc.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+#include "cache_scan.h"
+
+/*
+ * Hash table to manage all the ccache_head
+ */
+typedef struct {
+ slock_t lock; /* lock of the hash table */
+ dlist_head lru_list; /* list of recently used cache */
+ dlist_head free_list; /* list of free ccache_head */
+ volatile int lwlocks_usage;
+ LWLockId *lwlocks;
+ dlist_head *slots;
+} ccache_hash;
+
+/*
+ * Data structure to manage blocks on the shared memory segment.
+ * This extension acquires (shmseg_blocksize) x (shmseg_num_blocks) bytes of
+ * shared memory, then it shall be split into the fixed-length memory blocks.
+ * All the memory allocation and relase are done by block, to avoid memory
+ * fragmentation that eventually makes implementation complicated.
+ *
+ * The shmseg_head has a spinlock and global free_list to link free blocks.
+ * Its blocks[] array contains shmseg_block structures that points a particular
+ * address of the associated memory block.
+ * The shmseg_block being chained in the free_list of shmseg_head are available
+ * to allocate. Elsewhere, this block is already allocated on somewhere.
+ */
+typedef struct {
+ dlist_node chain;
+ Size address;
+} shmseg_block;
+
+typedef struct {
+ slock_t lock;
+ dlist_head free_list;
+ Size base_address;
+ shmseg_block blocks[FLEXIBLE_ARRAY_MEMBER];
+} shmseg_head;
+
+/*
+ * ccache_entry is used to track ccache_head being acquired by this backend.
+ */
+typedef struct {
+ dlist_node chain;
+ ResourceOwner owner;
+ ccache_head *ccache;
+} ccache_entry;
+
+static dlist_head ccache_local_list;
+static dlist_head ccache_free_list;
+
+/* Static variables */
+static shmem_startup_hook_type shmem_startup_next = NULL;
+
+static ccache_hash *cs_ccache_hash = NULL;
+static shmseg_head *cs_shmseg_head = NULL;
+
+/* GUC variables */
+static int ccache_hash_size;
+static int shmseg_blocksize;
+static int shmseg_num_blocks;
+static int max_cached_attnum;
+
+/* Static functions */
+static void *cs_alloc_shmblock(void);
+static void cs_free_shmblock(void *address);
+
+int
+ccache_max_attribute_number(void)
+{
+ return (max_cached_attnum - FirstLowInvalidHeapAttributeNumber +
+ BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+}
+
+/*
+ * ccache_on_resource_release
+ *
+ * It is a callback to put ccache_head being acquired locally, to keep
+ * consistency of reference counter.
+ */
+static void
+ccache_on_resource_release(ResourceReleasePhase phase,
+ bool isCommit,
+ bool isTopLevel,
+ void *arg)
+{
+ dlist_mutable_iter iter;
+
+ if (phase != RESOURCE_RELEASE_AFTER_LOCKS)
+ return;
+
+ dlist_foreach_modify(iter, &ccache_local_list)
+ {
+ ccache_entry *entry
+ = dlist_container(ccache_entry, chain, iter.cur);
+
+ if (entry->owner == CurrentResourceOwner)
+ {
+ dlist_delete(&entry->chain);
+
+ if (isCommit)
+ elog(WARNING, "cache reference leak (tableoid=%u, refcnt=%d)",
+ entry->ccache->tableoid, entry->ccache->refcnt);
+ cs_put_ccache(entry->ccache);
+
+ entry->ccache = NULL;
+ dlist_push_tail(&ccache_free_list, &entry->chain);
+ }
+ }
+}
+
+static ccache_chunk *
+ccache_alloc_chunk(ccache_head *ccache, ccache_chunk *upper)
+{
+ ccache_chunk *cchunk = cs_alloc_shmblock();
+
+ if (cchunk)
+ {
+ cchunk->upper = upper;
+ cchunk->right = NULL;
+ cchunk->left = NULL;
+ cchunk->r_depth = 0;
+ cchunk->l_depth = 0;
+ cchunk->ntups = 0;
+ cchunk->usage = shmseg_blocksize;
+ }
+ return cchunk;
+}
+
+/*
+ * ccache_rebalance_tree
+ *
+ * It keeps the balance of ccache tree if the supplied chunk has
+ * unbalanced subtrees.
+ */
+#define AssertIfNotShmem(addr) \
+ Assert((addr) == NULL || \
+ (((Size)(addr)) >= cs_shmseg_head->base_address && \
+ ((Size)(addr)) < (cs_shmseg_head->base_address + \
+ shmseg_num_blocks * shmseg_blocksize)))
+
+#define TTREE_DEPTH(chunk) \
+ ((chunk) == 0 ? 0 : Max((chunk)->l_depth, (chunk)->r_depth) + 1)
+
+static void
+ccache_rebalance_tree(ccache_head *ccache, ccache_chunk *cchunk)
+{
+ Assert(cchunk->upper != NULL
+ ? (cchunk->upper->left == cchunk || cchunk->upper->right == cchunk)
+ : (ccache->root_chunk == cchunk));
+
+ if (cchunk->l_depth + 1 < cchunk->r_depth)
+ {
+ /* anticlockwise rotation */
+ ccache_chunk *rchunk = cchunk->right;
+ ccache_chunk *upper = cchunk->upper;
+
+ cchunk->right = rchunk->left;
+ cchunk->r_depth = TTREE_DEPTH(cchunk->right);
+ cchunk->upper = rchunk;
+
+ rchunk->left = cchunk;
+ rchunk->l_depth = TTREE_DEPTH(rchunk->left);
+ rchunk->upper = upper;
+
+ if (!upper)
+ ccache->root_chunk = rchunk;
+ else if (upper->left == cchunk)
+ {
+ upper->left = rchunk;
+ upper->l_depth = TTREE_DEPTH(rchunk);
+ }
+ else
+ {
+ upper->right = rchunk;
+ upper->r_depth = TTREE_DEPTH(rchunk);
+ }
+ AssertIfNotShmem(cchunk->right);
+ AssertIfNotShmem(cchunk->left);
+ AssertIfNotShmem(cchunk->upper);
+ AssertIfNotShmem(rchunk->left);
+ AssertIfNotShmem(rchunk->right);
+ AssertIfNotShmem(rchunk->upper);
+ }
+ else if (cchunk->l_depth > cchunk->r_depth + 1)
+ {
+ /* clockwise rotation */
+ ccache_chunk *lchunk = cchunk->left;
+ ccache_chunk *upper = cchunk->upper;
+
+ cchunk->left = lchunk->right;
+ cchunk->l_depth = TTREE_DEPTH(cchunk->left);
+ cchunk->upper = lchunk;
+
+ lchunk->right = cchunk;
+ lchunk->l_depth = TTREE_DEPTH(lchunk->right);
+ lchunk->upper = upper;
+
+ if (!upper)
+ ccache->root_chunk = lchunk;
+ else if (upper->right == cchunk)
+ {
+ upper->right = lchunk;
+ upper->r_depth = TTREE_DEPTH(lchunk) + 1;
+ }
+ else
+ {
+ upper->left = lchunk;
+ upper->l_depth = TTREE_DEPTH(lchunk) + 1;
+ }
+ AssertIfNotShmem(cchunk->right);
+ AssertIfNotShmem(cchunk->left);
+ AssertIfNotShmem(cchunk->upper);
+ AssertIfNotShmem(lchunk->left);
+ AssertIfNotShmem(lchunk->right);
+ AssertIfNotShmem(lchunk->upper);
+ }
+}
+
+/*
+ * ccache_insert_tuple
+ *
+ * It inserts the supplied tuple, but uncached columns are dropped off,
+ * onto the ccache_head. If no space is left, it expands the t-tree
+ * structure with a chunk newly allocated. If no shared memory space was
+ * left, it returns false.
+ */
+#define cchunk_freespace(cchunk) \
+ ((cchunk)->usage - offsetof(ccache_chunk, tuples[(cchunk)->ntups + 1]))
+
+static void
+do_insert_tuple(ccache_head *ccache, ccache_chunk *cchunk, HeapTuple tuple)
+{
+ HeapTuple newtup;
+ ItemPointer ctid = &tuple->t_self;
+ int i_min = 0;
+ int i_max = cchunk->ntups;
+ int i, required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
+
+ Assert(required <= cchunk_freespace(cchunk));
+
+ while (i_min < i_max)
+ {
+ int i_mid = (i_min + i_max) / 2;
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_mid]->t_self) <= 0)
+ i_max = i_mid;
+ else
+ i_min = i_mid + 1;
+ }
+
+ if (i_min < cchunk->ntups)
+ {
+ HeapTuple movtup = cchunk->tuples[i_min];
+ Size movlen = HEAPTUPLESIZE + MAXALIGN(movtup->t_len);
+ char *destaddr = (char *)movtup + movlen - required;
+
+ Assert(ItemPointerCompare(&tuple->t_self, &movtup->t_self) < 0);
+
+ memmove((char *)cchunk + cchunk->usage - required,
+ (char *)cchunk + cchunk->usage,
+ ((Size)movtup + movlen) - ((Size)cchunk + cchunk->usage));
+ for (i=cchunk->ntups; i > i_min; i--)
+ {
+ HeapTuple temp;
+
+ temp = (HeapTuple)((char *)cchunk->tuples[i-1] - required);
+ cchunk->tuples[i] = temp;
+ temp->t_data = (HeapTupleHeader)((char *)temp->t_data - required);
+ }
+ cchunk->tuples[i_min] = newtup = (HeapTuple)destaddr;
+ memcpy(newtup, tuple, HEAPTUPLESIZE);
+ newtup->t_data = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, tuple->t_data, tuple->t_len);
+ cchunk->usage -= required;
+ cchunk->ntups++;
+
+ Assert(cchunk->usage >= offsetof(ccache_chunk, tuples[cchunk->ntups]));
+ }
+ else
+ {
+ cchunk->usage -= required;
+ newtup = (HeapTuple)(((char *)cchunk) + cchunk->usage);
+ memcpy(newtup, tuple, HEAPTUPLESIZE);
+ newtup->t_data = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, tuple->t_data, tuple->t_len);
+
+ cchunk->tuples[i_min] = newtup;
+ cchunk->ntups++;
+
+ Assert(cchunk->usage >= offsetof(ccache_chunk, tuples[cchunk->ntups]));
+ }
+}
+
+static void
+copy_tuple_properties(HeapTuple newtup, HeapTuple oldtup)
+{
+ ItemPointerCopy(&oldtup->t_self, &newtup->t_self);
+ newtup->t_tableOid = oldtup->t_tableOid;
+ memcpy(&newtup->t_data->t_choice.t_heap,
+ &oldtup->t_data->t_choice.t_heap,
+ sizeof(HeapTupleFields));
+ ItemPointerCopy(&oldtup->t_data->t_ctid,
+ &newtup->t_data->t_ctid);
+ newtup->t_data->t_infomask
+ = ((newtup->t_data->t_infomask & ~HEAP_XACT_MASK) |
+ (oldtup->t_data->t_infomask & HEAP_XACT_MASK));
+ newtup->t_data->t_infomask2
+ = ((newtup->t_data->t_infomask2 & ~HEAP2_XACT_MASK) |
+ (oldtup->t_data->t_infomask2 & HEAP2_XACT_MASK));
+}
+
+static bool
+ccache_insert_tuple_internal(ccache_head *ccache,
+ ccache_chunk *cchunk,
+ HeapTuple newtup)
+{
+ ItemPointer ctid = &newtup->t_self;
+ ItemPointer min_ctid;
+ ItemPointer max_ctid;
+ int required = MAXALIGN(HEAPTUPLESIZE + newtup->t_len);
+
+ if (cchunk->ntups == 0)
+ {
+ HeapTuple tup;
+
+ cchunk->usage -= required;
+ cchunk->tuples[0] = tup = (HeapTuple)((char *)cchunk + cchunk->usage);
+ memcpy(tup, newtup, HEAPTUPLESIZE);
+ tup->t_data = (HeapTupleHeader)((char *)tup + HEAPTUPLESIZE);
+ memcpy(tup->t_data, newtup->t_data, newtup->t_len);
+ cchunk->ntups++;
+
+ return true;
+ }
+
+retry:
+ min_ctid = &cchunk->tuples[0]->t_self;
+ max_ctid = &cchunk->tuples[cchunk->ntups - 1]->t_self;
+
+ if (ItemPointerCompare(ctid, min_ctid) < 0)
+ {
+ if (!cchunk->left && required <= cchunk_freespace(cchunk))
+ do_insert_tuple(ccache, cchunk, newtup);
+ else
+ {
+ if (!cchunk->left)
+ {
+ cchunk->left = ccache_alloc_chunk(ccache, cchunk);
+ if (!cchunk->left)
+ return false;
+ cchunk->l_depth = 1;
+ }
+ if (!ccache_insert_tuple_internal(ccache, cchunk->left, newtup))
+ return false;
+ cchunk->l_depth = TTREE_DEPTH(cchunk->left);
+ }
+ }
+ else if (ItemPointerCompare(ctid, max_ctid) > 0)
+ {
+ if (!cchunk->right && required <= cchunk_freespace(cchunk))
+ do_insert_tuple(ccache, cchunk, newtup);
+ else
+ {
+ if (!cchunk->right)
+ {
+ cchunk->right = ccache_alloc_chunk(ccache, cchunk);
+ if (!cchunk->right)
+ return false;
+ cchunk->r_depth = 1;
+ }
+ if (!ccache_insert_tuple_internal(ccache, cchunk->right, newtup))
+ return false;
+ cchunk->r_depth = TTREE_DEPTH(cchunk->right);
+ }
+ }
+ else
+ {
+ if (required <= cchunk_freespace(cchunk))
+ do_insert_tuple(ccache, cchunk, newtup);
+ else
+ {
+ HeapTuple movtup;
+
+ /* push out largest ctid until we get enough space */
+ if (!cchunk->right)
+ {
+ cchunk->right = ccache_alloc_chunk(ccache, cchunk);
+ if (!cchunk->right)
+ return false;
+ cchunk->r_depth = 1;
+ }
+ movtup = cchunk->tuples[cchunk->ntups - 1];
+
+ if (!ccache_insert_tuple_internal(ccache, cchunk->right, movtup))
+ return false;
+
+ cchunk->ntups--;
+ cchunk->usage += MAXALIGN(HEAPTUPLESIZE + movtup->t_len);
+ cchunk->r_depth = TTREE_DEPTH(cchunk->right);
+
+ goto retry;
+ }
+ }
+ /* Rebalance the tree, if needed */
+ ccache_rebalance_tree(ccache, cchunk);
+
+ return true;
+}
+
+bool
+ccache_insert_tuple(ccache_head *ccache, Relation rel, HeapTuple tuple)
+{
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ HeapTuple newtup;
+ Datum *cs_values = alloca(sizeof(Datum) * tupdesc->natts);
+ bool *cs_isnull = alloca(sizeof(bool) * tupdesc->natts);
+ int i, j;
+
+ /* remove unreferenced columns */
+ heap_deform_tuple(tuple, tupdesc, cs_values, cs_isnull);
+ for (i=0; i < tupdesc->natts; i++)
+ {
+ j = i + 1 - FirstLowInvalidHeapAttributeNumber;
+
+ if (!bms_is_member(j, &ccache->attrs_used))
+ cs_isnull[i] = true;
+ }
+ newtup = heap_form_tuple(tupdesc, cs_values, cs_isnull);
+ copy_tuple_properties(newtup, tuple);
+
+ return ccache_insert_tuple_internal(ccache, ccache->root_chunk, newtup);
+}
+
+/*
+ * ccache_find_tuple
+ *
+ * It find a tuple that satisfies the supplied ItemPointer according to
+ * the ScanDirection. If NoMovementScanDirection, it returns a tuple that
+ * has strictly same ItemPointer. On the other hand, it returns a tuple
+ * that has the least ItemPointer greater than the supplied one if
+ * ForwardScanDirection, and also returns a tuple with the greatest
+ * ItemPointer smaller than the supplied one if BackwardScanDirection.
+ */
+HeapTuple
+ccache_find_tuple(ccache_chunk *cchunk, ItemPointer ctid,
+ ScanDirection direction)
+{
+ ItemPointer min_ctid;
+ ItemPointer max_ctid;
+ HeapTuple tuple = NULL;
+ int i_min = 0;
+ int i_max = cchunk->ntups - 1;
+ int rc;
+
+ if (cchunk->ntups == 0)
+ return false;
+
+ min_ctid = &cchunk->tuples[i_min]->t_self;
+ max_ctid = &cchunk->tuples[i_max]->t_self;
+
+ if ((rc = ItemPointerCompare(ctid, min_ctid)) <= 0)
+ {
+ if (rc == 0 && (direction == NoMovementScanDirection ||
+ direction == ForwardScanDirection))
+ {
+ if (cchunk->ntups > direction)
+ return cchunk->tuples[direction];
+ }
+ else
+ {
+ if (cchunk->left)
+ tuple = ccache_find_tuple(cchunk->left, ctid, direction);
+ if (!HeapTupleIsValid(tuple) && direction == ForwardScanDirection)
+ return cchunk->tuples[0];
+ return tuple;
+ }
+ }
+
+ if ((rc = ItemPointerCompare(ctid, max_ctid)) >= 0)
+ {
+ if (rc == 0 && (direction == NoMovementScanDirection ||
+ direction == BackwardScanDirection))
+ {
+ if (i_max + direction >= 0)
+ return cchunk->tuples[i_max + direction];
+ }
+ else
+ {
+ if (cchunk->right)
+ tuple = ccache_find_tuple(cchunk->right, ctid, direction);
+ if (!HeapTupleIsValid(tuple) && direction == BackwardScanDirection)
+ return cchunk->tuples[i_max];
+ return tuple;
+ }
+ }
+
+ while (i_min < i_max)
+ {
+ int i_mid = (i_min + i_max) / 2;
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_mid]->t_self) <= 0)
+ i_max = i_mid;
+ else
+ i_min = i_mid + 1;
+ }
+ Assert(i_min == i_max);
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_min]->t_self) == 0)
+ {
+ if (direction == BackwardScanDirection && i_min > 0)
+ return cchunk->tuples[i_min - 1];
+ else if (direction == NoMovementScanDirection)
+ return cchunk->tuples[i_min];
+ else if (direction == ForwardScanDirection)
+ {
+ Assert(i_min + 1 < cchunk->ntups);
+ return cchunk->tuples[i_min + 1];
+ }
+ }
+ else
+ {
+ if (direction == BackwardScanDirection && i_min > 0)
+ return cchunk->tuples[i_min - 1];
+ else if (direction == ForwardScanDirection)
+ return cchunk->tuples[i_min];
+ }
+ return NULL;
+}
+
+/*
+ * ccache_delete_tuple
+ *
+ * It synchronizes the properties of tuple being already cached, usually
+ * for deletion.
+ */
+bool
+ccache_delete_tuple(ccache_head *ccache, HeapTuple oldtup)
+{
+ HeapTuple tuple;
+
+ tuple = ccache_find_tuple(ccache->root_chunk, &oldtup->t_self,
+ NoMovementScanDirection);
+ if (!tuple)
+ return false;
+
+ copy_tuple_properties(tuple, oldtup);
+
+ return true;
+}
+
+/*
+ * ccache_merge_chunk
+ *
+ * It merges two chunks if these have enough free space to consolidate
+ * its contents into one.
+ */
+static void
+ccache_merge_chunk(ccache_head *ccache, ccache_chunk *cchunk)
+{
+ ccache_chunk *curr;
+ ccache_chunk **upper;
+ int *p_depth;
+ int i;
+ bool needs_rebalance = false;
+
+ /* find the least right node that has no left node */
+ upper = &cchunk->right;
+ p_depth = &cchunk->r_depth;
+ curr = cchunk->right;
+ while (curr != NULL)
+ {
+ if (!curr->left)
+ {
+ Size shift = shmseg_blocksize - curr->usage;
+ long total_usage = cchunk->usage - shift;
+ int total_ntups = cchunk->ntups + curr->ntups;
+
+ if ((long)offsetof(ccache_chunk, tuples[total_ntups]) < total_usage)
+ {
+ ccache_chunk *rchunk = curr->right;
+
+ /* merge contents */
+ for (i=0; i < curr->ntups; i++)
+ {
+ HeapTuple oldtup = curr->tuples[i];
+ HeapTuple newtup;
+
+ cchunk->usage -= HEAPTUPLESIZE + MAXALIGN(oldtup->t_len);
+ newtup = (HeapTuple)((char *)cchunk + cchunk->usage);
+ memcpy(newtup, oldtup, HEAPTUPLESIZE);
+ newtup->t_data
+ = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, oldtup->t_data,
+ MAXALIGN(oldtup->t_len));
+
+ cchunk->tuples[cchunk->ntups++] = newtup;
+ }
+
+ /* detach the current chunk */
+ *upper = curr->right;
+ *p_depth = curr->r_depth;
+ if (rchunk)
+ rchunk->upper = curr->upper;
+
+ /* release it */
+ cs_free_shmblock(curr);
+ needs_rebalance = true;
+ }
+ break;
+ }
+ upper = &curr->left;
+ p_depth = &curr->l_depth;
+ curr = curr->left;
+ }
+
+ /* find the greatest left node that has no right node */
+ upper = &cchunk->left;
+ p_depth = &cchunk->l_depth;
+ curr = cchunk->left;
+
+ while (curr != NULL)
+ {
+ if (!curr->right)
+ {
+ Size shift = shmseg_blocksize - curr->usage;
+ long total_usage = cchunk->usage - shift;
+ int total_ntups = cchunk->ntups + curr->ntups;
+
+ if ((long)offsetof(ccache_chunk, tuples[total_ntups]) < total_usage)
+ {
+ ccache_chunk *lchunk = curr->left;
+ Size offset;
+
+ /* merge contents */
+ memmove((char *)cchunk + cchunk->usage - shift,
+ (char *)cchunk + cchunk->usage,
+ shmseg_blocksize - cchunk->usage);
+ for (i=cchunk->ntups - 1; i >= 0; i--)
+ {
+ HeapTuple temp
+ = (HeapTuple)((char *)cchunk->tuples[i] - shift);
+
+ cchunk->tuples[curr->ntups + i] = temp;
+ temp->t_data = (HeapTupleHeader)((char *)temp +
+ HEAPTUPLESIZE);
+ }
+ cchunk->usage -= shift;
+ cchunk->ntups += curr->ntups;
+
+ /* merge contents */
+ offset = shmseg_blocksize;
+ for (i=0; i < curr->ntups; i++)
+ {
+ HeapTuple oldtup = curr->tuples[i];
+ HeapTuple newtup;
+
+ offset -= HEAPTUPLESIZE + MAXALIGN(oldtup->t_len);
+ newtup = (HeapTuple)((char *)cchunk + offset);
+ memcpy(newtup, oldtup, HEAPTUPLESIZE);
+ newtup->t_data
+ = (HeapTupleHeader)((char *)newtup + HEAPTUPLESIZE);
+ memcpy(newtup->t_data, oldtup->t_data,
+ MAXALIGN(oldtup->t_len));
+ cchunk->tuples[i] = newtup;
+ }
+
+ /* detach the current chunk */
+ *upper = curr->left;
+ *p_depth = curr->l_depth;
+ if (lchunk)
+ lchunk->upper = curr->upper;
+ /* release it */
+ cs_free_shmblock(curr);
+ needs_rebalance = true;
+ }
+ break;
+ }
+ upper = &curr->right;
+ p_depth = &curr->r_depth;
+ curr = curr->right;
+ }
+ /* Rebalance the tree, if needed */
+ if (needs_rebalance)
+ ccache_rebalance_tree(ccache, cchunk);
+}
+
+/*
+ * ccache_vacuum_page
+ *
+ * It reclaims the tuples being already vacuumed. It shall be kicked on
+ * the callback function of heap_page_prune_hook to synchronize contents
+ * of the cache with on-disk image.
+ */
+static void
+ccache_vacuum_tuple(ccache_head *ccache,
+ ccache_chunk *cchunk,
+ ItemPointer ctid)
+{
+ ItemPointer min_ctid;
+ ItemPointer max_ctid;
+ int i_min = 0;
+ int i_max = cchunk->ntups;
+
+ if (cchunk->ntups == 0)
+ return;
+
+ min_ctid = &cchunk->tuples[i_min]->t_self;
+ max_ctid = &cchunk->tuples[i_max - 1]->t_self;
+
+ if (ItemPointerCompare(ctid, min_ctid) < 0)
+ {
+ if (cchunk->left)
+ ccache_vacuum_tuple(ccache, cchunk->left, ctid);
+ }
+ else if (ItemPointerCompare(ctid, max_ctid) > 0)
+ {
+ if (cchunk->right)
+ ccache_vacuum_tuple(ccache, cchunk->right, ctid);
+ }
+ else
+ {
+ while (i_min < i_max)
+ {
+ int i_mid = (i_min + i_max) / 2;
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_mid]->t_self) <= 0)
+ i_max = i_mid;
+ else
+ i_min = i_mid + 1;
+ }
+ Assert(i_min == i_max);
+
+ if (ItemPointerCompare(ctid, &cchunk->tuples[i_min]->t_self) == 0)
+ {
+ HeapTuple tuple = cchunk->tuples[i_min];
+ int length = MAXALIGN(HEAPTUPLESIZE + tuple->t_len);
+
+ if (i_min < cchunk->ntups - 1)
+ {
+ int j;
+
+ memmove((char *)cchunk + cchunk->usage + length,
+ (char *)cchunk + cchunk->usage,
+ (Size)tuple - ((Size)cchunk + cchunk->usage));
+ for (j=i_min + 1; j < cchunk->ntups; j++)
+ {
+ HeapTuple temp;
+
+ temp = (HeapTuple)((char *)cchunk->tuples[j] + length);
+ cchunk->tuples[j-1] = temp;
+ temp->t_data
+ = (HeapTupleHeader)((char *)temp->t_data + length);
+ }
+ }
+ cchunk->usage += length;
+ cchunk->ntups--;
+ }
+ }
+ /* merge chunks if this chunk has enough space to merge */
+ ccache_merge_chunk(ccache, cchunk);
+}
+
+void
+ccache_vacuum_page(ccache_head *ccache, Buffer buffer)
+{
+ /* XXX it needs buffer is valid and pinned */
+ BlockNumber blknum = BufferGetBlockNumber(buffer);
+ Page page = BufferGetPage(buffer);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber offnum;
+
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemPointerData ctid;
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsNormal(itemid))
+ continue;
+
+ ItemPointerSetBlockNumber(&ctid, blknum);
+ ItemPointerSetOffsetNumber(&ctid, offnum);
+
+ ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
+ }
+}
+
+static void
+ccache_release_all_chunks(ccache_chunk *cchunk)
+{
+ if (cchunk->left)
+ ccache_release_all_chunks(cchunk->left);
+ if (cchunk->right)
+ ccache_release_all_chunks(cchunk->right);
+ cs_free_shmblock(cchunk);
+}
+
+static void
+track_ccache_locally(ccache_head *ccache)
+{
+ ccache_entry *entry;
+ dlist_node *dnode;
+
+ if (dlist_is_empty(&ccache_free_list))
+ {
+ int i;
+
+ PG_TRY();
+ {
+ for (i=0; i < 20; i++)
+ {
+ entry = MemoryContextAlloc(TopMemoryContext,
+ sizeof(ccache_entry));
+ dlist_push_tail(&ccache_free_list, &entry->chain);
+ }
+ }
+ PG_CATCH();
+ {
+ cs_put_ccache(ccache);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ }
+ dnode = dlist_pop_head_node(&ccache_free_list);
+ entry = dlist_container(ccache_entry, chain, dnode);
+ entry->owner = CurrentResourceOwner;
+ entry->ccache = ccache;
+ dlist_push_tail(&ccache_local_list, &entry->chain);
+}
+
+static void
+untrack_ccache_locally(ccache_head *ccache)
+{
+ dlist_mutable_iter iter;
+
+ dlist_foreach_modify(iter, &ccache_local_list)
+ {
+ ccache_entry *entry
+ = dlist_container(ccache_entry, chain, iter.cur);
+
+ if (entry->ccache == ccache &&
+ entry->owner == CurrentResourceOwner)
+ {
+ dlist_delete(&entry->chain);
+ dlist_push_tail(&ccache_free_list, &entry->chain);
+ return;
+ }
+ }
+}
+
+static void
+cs_put_ccache_nolock(ccache_head *ccache)
+{
+ Assert(ccache->refcnt > 0);
+ if (--ccache->refcnt == 0)
+ {
+ ccache_release_all_chunks(ccache->root_chunk);
+ dlist_delete(&ccache->hash_chain);
+ dlist_delete(&ccache->lru_chain);
+ dlist_push_head(&cs_ccache_hash->free_list, &ccache->hash_chain);
+ }
+ untrack_ccache_locally(ccache);
+}
+
+void
+cs_put_ccache(ccache_head *cache)
+{
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ cs_put_ccache_nolock(cache);
+ SpinLockRelease(&cs_ccache_hash->lock);
+}
+
+static ccache_head *
+cs_create_ccache(Oid tableoid, Bitmapset *attrs_used)
+{
+ ccache_head *temp;
+ ccache_head *new_cache;
+ dlist_node *dnode;
+ int i;
+
+ /*
+ * Here is no columnar cache of this relation or cache attributes are
+ * not enough to run the required query. So, it tries to create a new
+ * ccache_head for the upcoming cache-scan.
+ * Also allocate ones, if we have no free ccache_head any more.
+ */
+ if (dlist_is_empty(&cs_ccache_hash->free_list))
+ {
+ char *buffer;
+ int offset;
+ int nwords, size;
+
+ buffer = cs_alloc_shmblock();
+ if (!buffer)
+ return NULL;
+
+ nwords = (max_cached_attnum - FirstLowInvalidHeapAttributeNumber +
+ BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+ size = MAXALIGN(offsetof(ccache_head,
+ attrs_used.words[nwords + 1]));
+ for (offset = 0; offset <= shmseg_blocksize - size; offset += size)
+ {
+ temp = (ccache_head *)(buffer + offset);
+
+ dlist_push_tail(&cs_ccache_hash->free_list, &temp->hash_chain);
+ }
+ }
+ dnode = dlist_pop_head_node(&cs_ccache_hash->free_list);
+ new_cache = dlist_container(ccache_head, hash_chain, dnode);
+
+ i = cs_ccache_hash->lwlocks_usage++ % ccache_hash_size;
+ new_cache->lock = cs_ccache_hash->lwlocks[i];
+ new_cache->refcnt = 2;
+ new_cache->status = CCACHE_STATUS_INITIALIZED;
+
+ new_cache->tableoid = tableoid;
+ new_cache->root_chunk = ccache_alloc_chunk(new_cache, NULL);
+ if (!new_cache->root_chunk)
+ {
+ dlist_push_head(&cs_ccache_hash->free_list, &new_cache->hash_chain);
+ return NULL;
+ }
+
+ if (attrs_used)
+ memcpy(&new_cache->attrs_used, attrs_used,
+ offsetof(Bitmapset, words[attrs_used->nwords]));
+ else
+ {
+ new_cache->attrs_used.nwords = 1;
+ new_cache->attrs_used.words[0] = 0;
+ }
+ return new_cache;
+}
+
+ccache_head *
+cs_get_ccache(Oid tableoid, Bitmapset *attrs_used, bool create_on_demand)
+{
+ Datum hash = hash_any((unsigned char *)&tableoid, sizeof(Oid));
+ Index i = hash % ccache_hash_size;
+ dlist_iter iter;
+ ccache_head *old_cache = NULL;
+ ccache_head *new_cache = NULL;
+ ccache_head *temp;
+
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ PG_TRY();
+ {
+ /*
+ * Try to find out existing ccache that has all the columns being
+ * referenced in this query.
+ */
+ dlist_foreach(iter, &cs_ccache_hash->slots[i])
+ {
+ temp = dlist_container(ccache_head, hash_chain, iter.cur);
+
+ if (tableoid != temp->tableoid)
+ continue;
+
+ if (bms_is_subset(attrs_used, &temp->attrs_used))
+ {
+ temp->refcnt++;
+ if (create_on_demand)
+ dlist_move_head(&cs_ccache_hash->lru_list,
+ &temp->lru_chain);
+ new_cache = temp;
+ goto out_unlock;
+ }
+ old_cache = temp;
+ break;
+ }
+
+ if (create_on_demand)
+ {
+ if (old_cache)
+ attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
+
+ new_cache = cs_create_ccache(tableoid, attrs_used);
+ if (!new_cache)
+ goto out_unlock;
+
+ dlist_push_head(&cs_ccache_hash->slots[i], &new_cache->hash_chain);
+ dlist_push_head(&cs_ccache_hash->lru_list, &new_cache->lru_chain);
+ if (old_cache)
+ cs_put_ccache_nolock(old_cache);
+ }
+ }
+ PG_CATCH();
+ {
+ SpinLockRelease(&cs_ccache_hash->lock);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+out_unlock:
+ SpinLockRelease(&cs_ccache_hash->lock);
+
+ if (new_cache)
+ track_ccache_locally(new_cache);
+
+ return new_cache;
+}
+
+typedef struct {
+ Oid tableoid;
+ int status;
+ ccache_chunk *cchunk;
+ ccache_chunk *upper;
+ ccache_chunk *right;
+ ccache_chunk *left;
+ int r_depth;
+ int l_depth;
+ uint32 ntups;
+ uint32 usage;
+ ItemPointerData min_ctid;
+ ItemPointerData max_ctid;
+} ccache_status;
+
+static List *
+cache_scan_debuginfo_internal(ccache_head *ccache,
+ ccache_chunk *cchunk, List *result)
+{
+ ccache_status *cstatus = palloc0(sizeof(ccache_status));
+ List *temp;
+
+ if (cchunk->left)
+ {
+ temp = cache_scan_debuginfo_internal(ccache, cchunk->left, NIL);
+ result = list_concat(result, temp);
+ }
+ cstatus->tableoid = ccache->tableoid;
+ cstatus->status = ccache->status;
+ cstatus->cchunk = cchunk;
+ cstatus->upper = cchunk->upper;
+ cstatus->right = cchunk->right;
+ cstatus->left = cchunk->left;
+ cstatus->r_depth = cchunk->r_depth;
+ cstatus->l_depth = cchunk->l_depth;
+ cstatus->ntups = cchunk->ntups;
+ cstatus->usage = cchunk->usage;
+ if (cchunk->ntups > 0)
+ {
+ ItemPointerCopy(&cchunk->tuples[0]->t_self,
+ &cstatus->min_ctid);
+ ItemPointerCopy(&cchunk->tuples[cchunk->ntups - 1]->t_self,
+ &cstatus->max_ctid);
+ }
+ else
+ {
+ ItemPointerSet(&cstatus->min_ctid,
+ InvalidBlockNumber,
+ InvalidOffsetNumber);
+ ItemPointerSet(&cstatus->max_ctid,
+ InvalidBlockNumber,
+ InvalidOffsetNumber);
+ }
+ result = lappend(result, cstatus);
+
+ if (cchunk->right)
+ {
+ temp = cache_scan_debuginfo_internal(ccache, cchunk->right, NIL);
+ result = list_concat(result, temp);
+ }
+ return result;
+}
+
+/*
+ * cache_scan_debuginfo
+ *
+ * It shows the current status of ccache_chunks being allocated.
+ */
+Datum
+cache_scan_debuginfo(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *fncxt;
+ List *cstatus_list;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ TupleDesc tupdesc;
+ MemoryContext oldcxt;
+ int i;
+ dlist_iter iter;
+ List *result = NIL;
+
+ fncxt = SRF_FIRSTCALL_INIT();
+ oldcxt = MemoryContextSwitchTo(fncxt->multi_call_memory_ctx);
+
+ /* make definition of tuple-descriptor */
+ tupdesc = CreateTemplateTupleDesc(12, false);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "tableoid",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 2, "status",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 3, "chunk",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 4, "upper",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 5, "l_depth",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 6, "l_chunk",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 7, "r_depth",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 8, "r_chunk",
+ TEXTOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 9, "ntuples",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber)10, "usage",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber)11, "min_ctid",
+ TIDOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber)12, "max_ctid",
+ TIDOID, -1, 0);
+ fncxt->tuple_desc = BlessTupleDesc(tupdesc);
+
+ /* make a snapshot of the current table cache */
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ for (i=0; i < ccache_hash_size; i++)
+ {
+ dlist_foreach(iter, &cs_ccache_hash->slots[i])
+ {
+ ccache_head *ccache
+ = dlist_container(ccache_head, hash_chain, iter.cur);
+
+ ccache->refcnt++;
+ SpinLockRelease(&cs_ccache_hash->lock);
+ track_ccache_locally(ccache);
+
+ LWLockAcquire(ccache->lock, LW_SHARED);
+ result = cache_scan_debuginfo_internal(ccache,
+ ccache->root_chunk,
+ result);
+ LWLockRelease(ccache->lock);
+
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ cs_put_ccache_nolock(ccache);
+ }
+ }
+ SpinLockRelease(&cs_ccache_hash->lock);
+
+ fncxt->user_fctx = result;
+ MemoryContextSwitchTo(oldcxt);
+ }
+ fncxt = SRF_PERCALL_SETUP();
+
+ cstatus_list = (List *)fncxt->user_fctx;
+ if (cstatus_list != NIL &&
+ fncxt->call_cntr < cstatus_list->length)
+ {
+ ccache_status *cstatus = list_nth(cstatus_list, fncxt->call_cntr);
+ Datum values[12];
+ bool isnull[12];
+ HeapTuple tuple;
+
+ memset(isnull, false, sizeof(isnull));
+ values[0] = ObjectIdGetDatum(cstatus->tableoid);
+ if (cstatus->status == CCACHE_STATUS_INITIALIZED)
+ values[1] = CStringGetTextDatum("initialized");
+ else if (cstatus->status == CCACHE_STATUS_IN_PROGRESS)
+ values[1] = CStringGetTextDatum("in-progress");
+ else if (cstatus->status == CCACHE_STATUS_CONSTRUCTED)
+ values[1] = CStringGetTextDatum("constructed");
+ else
+ values[1] = CStringGetTextDatum("unknown");
+ values[2] = CStringGetTextDatum(psprintf("%p", cstatus->cchunk));
+ values[3] = CStringGetTextDatum(psprintf("%p", cstatus->upper));
+ values[4] = Int32GetDatum(cstatus->l_depth);
+ values[5] = CStringGetTextDatum(psprintf("%p", cstatus->left));
+ values[6] = Int32GetDatum(cstatus->r_depth);
+ values[7] = CStringGetTextDatum(psprintf("%p", cstatus->right));
+ values[8] = Int32GetDatum(cstatus->ntups);
+ values[9] = Int32GetDatum(cstatus->usage);
+
+ if (ItemPointerIsValid(&cstatus->min_ctid))
+ values[10] = PointerGetDatum(&cstatus->min_ctid);
+ else
+ isnull[10] = true;
+ if (ItemPointerIsValid(&cstatus->max_ctid))
+ values[11] = PointerGetDatum(&cstatus->max_ctid);
+ else
+ isnull[11] = true;
+
+ tuple = heap_form_tuple(fncxt->tuple_desc, values, isnull);
+
+ SRF_RETURN_NEXT(fncxt, HeapTupleGetDatum(tuple));
+ }
+ SRF_RETURN_DONE(fncxt);
+}
+PG_FUNCTION_INFO_V1(cache_scan_debuginfo);
+
+/*
+ * cs_alloc_shmblock
+ *
+ * It allocates a fixed-length block. The reason why this routine does not
+ * support variable length allocation is to simplify the logic for its purpose.
+ */
+static void *
+cs_alloc_shmblock(void)
+{
+ ccache_head *ccache;
+ dlist_node *dnode;
+ shmseg_block *block;
+ void *address = NULL;
+ int retry = 2;
+
+do_retry:
+ SpinLockAcquire(&cs_shmseg_head->lock);
+ if (dlist_is_empty(&cs_shmseg_head->free_list) && retry-- > 0)
+ {
+ SpinLockRelease(&cs_shmseg_head->lock);
+
+ SpinLockAcquire(&cs_ccache_hash->lock);
+ if (!dlist_is_empty(&cs_ccache_hash->lru_list))
+ {
+ dnode = dlist_tail_node(&cs_ccache_hash->lru_list);
+ ccache = dlist_container(ccache_head, lru_chain, dnode);
+
+ cs_put_ccache_nolock(ccache);
+ }
+ SpinLockRelease(&cs_ccache_hash->lock);
+
+ goto do_retry;
+ }
+
+ if (!dlist_is_empty(&cs_shmseg_head->free_list))
+ {
+ dnode = dlist_pop_head_node(&cs_shmseg_head->free_list);
+ block = dlist_container(shmseg_block, chain, dnode);
+
+ memset(&block->chain, 0, sizeof(dlist_node));
+
+ address = (void *) block->address;
+ }
+ SpinLockRelease(&cs_shmseg_head->lock);
+
+ return address;
+}
+
+/*
+ * cs_free_shmblock
+ *
+ * It release a block being allocated by cs_alloc_shmblock
+ */
+static void
+cs_free_shmblock(void *address)
+{
+ Size curr = (Size) address;
+ Size base = cs_shmseg_head->base_address;
+ ulong index;
+ shmseg_block *block;
+
+ Assert((curr - base) % shmseg_blocksize == 0);
+ Assert(curr >= base && curr < base + shmseg_num_blocks * shmseg_blocksize);
+ index = (curr - base) / shmseg_blocksize;
+
+ SpinLockAcquire(&cs_shmseg_head->lock);
+ block = &cs_shmseg_head->blocks[index];
+
+ dlist_push_head(&cs_shmseg_head->free_list, &block->chain);
+
+ SpinLockRelease(&cs_shmseg_head->lock);
+}
+
+static void
+ccache_setup(void)
+{
+ Size curr_address;
+ ulong i;
+ bool found;
+
+ /* allocation of a shared memory segment for table's hash */
+ cs_ccache_hash = ShmemInitStruct("cache_scan: hash of columnar cache",
+ MAXALIGN(sizeof(ccache_hash)) +
+ MAXALIGN(sizeof(LWLockId) *
+ ccache_hash_size) +
+ MAXALIGN(sizeof(dlist_node) *
+ ccache_hash_size),
+ &found);
+ Assert(!found);
+
+ SpinLockInit(&cs_ccache_hash->lock);
+ dlist_init(&cs_ccache_hash->lru_list);
+ dlist_init(&cs_ccache_hash->free_list);
+ cs_ccache_hash->lwlocks = (void *)(&cs_ccache_hash[1]);
+ cs_ccache_hash->slots
+ = (void *)(&cs_ccache_hash->lwlocks[ccache_hash_size]);
+
+ for (i=0; i < ccache_hash_size; i++)
+ cs_ccache_hash->lwlocks[i] = LWLockAssign();
+ for (i=0; i < ccache_hash_size; i++)
+ dlist_init(&cs_ccache_hash->slots[i]);
+
+ /* allocation of a shared memory segment for columnar cache */
+ cs_shmseg_head = ShmemInitStruct("cache_scan: columnar cache",
+ offsetof(shmseg_head,
+ blocks[shmseg_num_blocks]) +
+ shmseg_num_blocks * shmseg_blocksize,
+ &found);
+ Assert(!found);
+
+ SpinLockInit(&cs_shmseg_head->lock);
+ dlist_init(&cs_shmseg_head->free_list);
+
+ curr_address = MAXALIGN(&cs_shmseg_head->blocks[shmseg_num_blocks]);
+
+ cs_shmseg_head->base_address = curr_address;
+ for (i=0; i < shmseg_num_blocks; i++)
+ {
+ shmseg_block *block = &cs_shmseg_head->blocks[i];
+
+ block->address = curr_address;
+ dlist_push_tail(&cs_shmseg_head->free_list, &block->chain);
+
+ curr_address += shmseg_blocksize;
+ }
+}
+
+void
+ccache_init(void)
+{
+ /* setup GUC variables */
+ DefineCustomIntVariable("cache_scan.block_size",
+ "block size of in-memory columnar cache",
+ NULL,
+ &shmseg_blocksize,
+ 2048 * 1024, /* 2MB */
+ 1024 * 1024, /* 1MB */
+ INT_MAX,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+ if ((shmseg_blocksize & (shmseg_blocksize - 1)) != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cache_scan.block_size must be power of 2")));
+
+ DefineCustomIntVariable("cache_scan.num_blocks",
+ "number of in-memory columnar cache blocks",
+ NULL,
+ &shmseg_num_blocks,
+ 64,
+ 64,
+ INT_MAX,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ DefineCustomIntVariable("cache_scan.hash_size",
+ "number of hash slots for columnar cache",
+ NULL,
+ &ccache_hash_size,
+ 128,
+ 128,
+ INT_MAX,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ DefineCustomIntVariable("cache_scan.max_cached_attnum",
+ "max attribute number we can cache",
+ NULL,
+ &max_cached_attnum,
+ 256,
+ sizeof(bitmapword) * BITS_PER_BYTE,
+ 2048,
+ PGC_SIGHUP,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ /* request shared memory segment for table's cache */
+ RequestAddinShmemSpace(MAXALIGN(sizeof(ccache_hash)) +
+ MAXALIGN(sizeof(dlist_head) * ccache_hash_size) +
+ MAXALIGN(sizeof(LWLockId) * ccache_hash_size) +
+ MAXALIGN(offsetof(shmseg_head,
+ blocks[shmseg_num_blocks])) +
+ shmseg_num_blocks * shmseg_blocksize);
+ RequestAddinLWLocks(ccache_hash_size);
+
+ shmem_startup_next = shmem_startup_hook;
+ shmem_startup_hook = ccache_setup;
+
+ /* register resource-release callback */
+ dlist_init(&ccache_local_list);
+ dlist_init(&ccache_free_list);
+ RegisterResourceReleaseCallback(ccache_on_resource_release, NULL);
+}
diff --git a/contrib/cache_scan/cscan.c b/contrib/cache_scan/cscan.c
new file mode 100644
index 0000000..0a63c2e
--- /dev/null
+++ b/contrib/cache_scan/cscan.c
@@ -0,0 +1,761 @@
+/* -------------------------------------------------------------------------
+ *
+ * contrib/cache_scan/cscan.c
+ *
+ * An extension that offers an alternative way to scan a table utilizing column
+ * oriented database cache.
+ *
+ * Copyright (c) 2010-2013, PostgreSQL Global Development Group
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "access/heapam.h"
+#include "access/relscan.h"
+#include "access/sysattr.h"
+#include "catalog/objectaccess.h"
+#include "catalog/pg_language.h"
+#include "catalog/pg_proc.h"
+#include "catalog/pg_trigger.h"
+#include "commands/trigger.h"
+#include "executor/nodeCustom.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/var.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/guc.h"
+#include "utils/spccache.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+#include "cache_scan.h"
+#include <limits.h>
+
+PG_MODULE_MAGIC;
+
+/* Static variables */
+static add_scan_path_hook_type add_scan_path_next = NULL;
+static object_access_hook_type object_access_next = NULL;
+static heap_page_prune_hook_type heap_page_prune_next = NULL;
+
+static bool cache_scan_disabled;
+
+static bool
+cs_estimate_costs(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Relation rel,
+ CustomPath *cpath,
+ Bitmapset **attrs_used)
+{
+ ListCell *lc;
+ ccache_head *ccache;
+ Oid tableoid = RelationGetRelid(rel);
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ int total_width = 0;
+ int tuple_width = 0;
+ double hit_ratio;
+ Cost run_cost = 0.0;
+ Cost startup_cost = 0.0;
+ double tablespace_page_cost;
+ QualCost qpqual_cost;
+ Cost cpu_per_tuple;
+ int i;
+
+ /* Mark the path with the correct row estimate */
+ if (cpath->path.param_info)
+ cpath->path.rows = cpath->path.param_info->ppi_rows;
+ else
+ cpath->path.rows = baserel->rows;
+
+ /* List up all the columns being in-use */
+ pull_varattnos((Node *) baserel->reltargetlist,
+ baserel->relid,
+ attrs_used);
+ foreach(lc, baserel->baserestrictinfo)
+ {
+ RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+ pull_varattnos((Node *) rinfo->clause,
+ baserel->relid,
+ attrs_used);
+ }
+
+ for (i=FirstLowInvalidHeapAttributeNumber + 1; i <= 0; i++)
+ {
+ int attidx = i - FirstLowInvalidHeapAttributeNumber;
+
+ if (bms_is_member(attidx, *attrs_used))
+ {
+ /* oid and whole-row reference is not supported */
+ if (i == ObjectIdAttributeNumber || i == InvalidAttrNumber)
+ return false;
+
+ /* clear system attributes from the bitmap */
+ *attrs_used = bms_del_member(*attrs_used, attidx);
+ }
+ }
+
+ /*
+ * Because of layout on the shared memory segment, we have to restrict
+ * the largest attribute number in use to prevent overrun by growth of
+ * Bitmapset.
+ */
+ if (*attrs_used &&
+ (*attrs_used)->nwords > ccache_max_attribute_number())
+ return false;
+
+ /*
+ * Estimation of average width of cached tuples - it does not make
+ * sense to construct a new cache if its average width is more than
+ * 30% of the raw data.
+ */
+ for (i=0; i < tupdesc->natts; i++)
+ {
+ Form_pg_attribute attr = tupdesc->attrs[i];
+ int attidx = i + 1 - FirstLowInvalidHeapAttributeNumber;
+ int width;
+
+ if (attr->attlen > 0)
+ width = attr->attlen;
+ else
+ width = get_attavgwidth(tableoid, attr->attnum);
+
+ total_width += width;
+ if (bms_is_member(attidx, *attrs_used))
+ tuple_width += width;
+ }
+
+ ccache = cs_get_ccache(RelationGetRelid(rel), *attrs_used, false);
+ if (!ccache)
+ {
+ if ((double)tuple_width / (double)total_width > 0.3)
+ return false;
+ hit_ratio = 0.05;
+ }
+ else
+ {
+ hit_ratio = 0.95;
+ cs_put_ccache(ccache);
+ }
+
+ get_tablespace_page_costs(baserel->reltablespace,
+ NULL,
+ &tablespace_page_cost);
+ /* Disk costs */
+ run_cost += (1.0 - hit_ratio) * tablespace_page_cost * baserel->pages;
+
+ /* CPU costs */
+ get_restriction_qual_cost(root, baserel,
+ cpath->path.param_info,
+ &qpqual_cost);
+
+ startup_cost += qpqual_cost.startup;
+ cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+ run_cost += cpu_per_tuple * baserel->tuples;
+
+ cpath->path.startup_cost = startup_cost;
+ cpath->path.total_cost = startup_cost + run_cost;
+
+ return true;
+}
+
+/*
+ * cs_relation_has_synchronizer
+ *
+ * A table that can have columner-cache also needs to have trigger for
+ * synchronization, to ensure the on-memory cache keeps the latest contents
+ * of the heap. It returns TRUE, if supplied relation has triggers that
+ * invokes cache_scan_synchronizer on appropriate context. Elsewhere, FALSE
+ * shall be returned.
+ */
+static bool
+cs_relation_has_synchronizer(Relation rel)
+{
+ int i, numtriggers;
+ bool has_on_insert_synchronizer = false;
+ bool has_on_update_synchronizer = false;
+ bool has_on_delete_synchronizer = false;
+ bool has_on_truncate_synchronizer = false;
+
+ if (!rel->trigdesc)
+ return false;
+
+ numtriggers = rel->trigdesc->numtriggers;
+ for (i=0; i < numtriggers; i++)
+ {
+ Trigger *trig = rel->trigdesc->triggers + i;
+ HeapTuple tup;
+
+ if (!trig->tgenabled)
+ continue;
+
+ tup = SearchSysCache1(PROCOID, ObjectIdGetDatum(trig->tgfoid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for function %u", trig->tgfoid);
+
+ if (((Form_pg_proc) GETSTRUCT(tup))->prolang == ClanguageId)
+ {
+ Datum value;
+ bool isnull;
+ char *prosrc;
+ char *probin;
+
+ value = SysCacheGetAttr(PROCOID, tup,
+ Anum_pg_proc_prosrc, &isnull);
+ if (isnull)
+ elog(ERROR, "null prosrc for C function %u", trig->tgoid);
+ prosrc = TextDatumGetCString(value);
+
+ value = SysCacheGetAttr(PROCOID, tup,
+ Anum_pg_proc_probin, &isnull);
+ if (isnull)
+ elog(ERROR, "null probin for C function %u", trig->tgoid);
+ probin = TextDatumGetCString(value);
+
+ if (strcmp(prosrc, "cache_scan_synchronizer") == 0 &&
+ strcmp(probin, "$libdir/cache_scan") == 0)
+ {
+ int16 tgtype = trig->tgtype;
+
+ if (TRIGGER_TYPE_MATCHES(tgtype,
+ TRIGGER_TYPE_ROW,
+ TRIGGER_TYPE_AFTER,
+ TRIGGER_TYPE_INSERT))
+ has_on_insert_synchronizer = true;
+ if (TRIGGER_TYPE_MATCHES(tgtype,
+ TRIGGER_TYPE_ROW,
+ TRIGGER_TYPE_AFTER,
+ TRIGGER_TYPE_UPDATE))
+ has_on_update_synchronizer = true;
+ if (TRIGGER_TYPE_MATCHES(tgtype,
+ TRIGGER_TYPE_ROW,
+ TRIGGER_TYPE_AFTER,
+ TRIGGER_TYPE_DELETE))
+ has_on_delete_synchronizer = true;
+ if (TRIGGER_TYPE_MATCHES(tgtype,
+ TRIGGER_TYPE_STATEMENT,
+ TRIGGER_TYPE_AFTER,
+ TRIGGER_TYPE_TRUNCATE))
+ has_on_truncate_synchronizer = true;
+ }
+ pfree(prosrc);
+ pfree(probin);
+ }
+ ReleaseSysCache(tup);
+ }
+
+ if (has_on_insert_synchronizer &&
+ has_on_update_synchronizer &&
+ has_on_delete_synchronizer &&
+ has_on_truncate_synchronizer)
+ return true;
+ return false;
+}
+
+
+static void
+cs_add_scan_path(PlannerInfo *root,
+ RelOptInfo *baserel,
+ RangeTblEntry *rte)
+{
+ Relation rel;
+
+ /* call the secondary hook if exist */
+ if (add_scan_path_next)
+ (*add_scan_path_next)(root, baserel, rte);
+
+ /* Is this feature available now? */
+ if (cache_scan_disabled)
+ return;
+
+ /* Only regular tables can be cached */
+ if (baserel->reloptkind != RELOPT_BASEREL ||
+ rte->rtekind != RTE_RELATION)
+ return;
+
+ /* Core code should already acquire an appropriate lock */
+ rel = heap_open(rte->relid, NoLock);
+
+ if (cs_relation_has_synchronizer(rel))
+ {
+ CustomPath *cpath = makeNode(CustomPath);
+ Relids required_outer;
+ Bitmapset *attrs_used = NULL;
+
+ /*
+ * We don't support pushing join clauses into the quals of a ctidscan,
+ * but it could still have required parameterization due to LATERAL
+ * refs in its tlist.
+ */
+ required_outer = baserel->lateral_relids;
+
+ cpath->path.pathtype = T_CustomScan;
+ cpath->path.parent = baserel;
+ cpath->path.param_info = get_baserel_parampathinfo(root, baserel,
+ required_outer);
+ if (cs_estimate_costs(root, baserel, rel, cpath, &attrs_used))
+ {
+ cpath->custom_name = pstrdup("cache scan");
+ cpath->custom_flags = 0;
+ cpath->custom_private
+ = list_make1(makeString(bms_to_string(attrs_used)));
+
+ add_path(baserel, &cpath->path);
+ }
+ }
+ heap_close(rel, NoLock);
+}
+
+static void
+cs_init_custom_scan_plan(PlannerInfo *root,
+ CustomScan *cscan_plan,
+ CustomPath *cscan_path,
+ List *tlist,
+ List *scan_clauses)
+{
+ List *quals = NIL;
+ ListCell *lc;
+
+ /* should be a base relation */
+ Assert(cscan_path->path.parent->relid > 0);
+ Assert(cscan_path->path.parent->rtekind == RTE_RELATION);
+
+ /* extract the supplied RestrictInfo */
+ foreach (lc, scan_clauses)
+ {
+ RestrictInfo *rinfo = lfirst(lc);
+ quals = lappend(quals, rinfo->clause);
+ }
+
+ /* do nothing something special pushing-down */
+ cscan_plan->scan.plan.targetlist = tlist;
+ cscan_plan->scan.plan.qual = quals;
+ cscan_plan->custom_private = cscan_path->custom_private;
+}
+
+typedef struct
+{
+ ccache_head *ccache;
+ ItemPointerData curr_ctid;
+ bool normal_seqscan;
+ bool with_construction;
+} cs_state;
+
+static void
+cs_begin_custom_scan(CustomScanState *node, int eflags)
+{
+ CustomScan *cscan = (CustomScan *)node->ss.ps.plan;
+ Relation rel = node->ss.ss_currentRelation;
+ EState *estate = node->ss.ps.state;
+ HeapScanDesc scandesc = NULL;
+ cs_state *csstate;
+ Bitmapset *attrs_used;
+ ccache_head *ccache;
+
+ csstate = palloc0(sizeof(cs_state));
+
+ attrs_used = bms_from_string(strVal(linitial(cscan->custom_private)));
+
+ ccache = cs_get_ccache(RelationGetRelid(rel), attrs_used, true);
+ if (ccache)
+ {
+ LWLockAcquire(ccache->lock, LW_SHARED);
+ if (ccache->status != CCACHE_STATUS_CONSTRUCTED)
+ {
+ LWLockRelease(ccache->lock);
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (ccache->status == CCACHE_STATUS_INITIALIZED)
+ {
+ ccache->status = CCACHE_STATUS_IN_PROGRESS;
+ csstate->with_construction = true;
+ scandesc = heap_beginscan(rel, SnapshotAny, 0, NULL);
+ }
+ else if (ccache->status == CCACHE_STATUS_IN_PROGRESS)
+ {
+ csstate->normal_seqscan = true;
+ scandesc = heap_beginscan(rel, estate->es_snapshot, 0, NULL);
+ }
+ }
+ LWLockRelease(ccache->lock);
+ csstate->ccache = ccache;
+
+ /* seek to the first position */
+ if (estate->es_direction == ForwardScanDirection)
+ {
+ ItemPointerSetBlockNumber(&csstate->curr_ctid, 0);
+ ItemPointerSetOffsetNumber(&csstate->curr_ctid, 0);
+ }
+ else
+ {
+ ItemPointerSetBlockNumber(&csstate->curr_ctid, MaxBlockNumber);
+ ItemPointerSetOffsetNumber(&csstate->curr_ctid, MaxOffsetNumber);
+ }
+ }
+ else
+ {
+ scandesc = heap_beginscan(rel, estate->es_snapshot, 0, NULL);
+ csstate->normal_seqscan = true;
+ }
+ node->ss.ss_currentScanDesc = scandesc;
+
+ node->custom_state = csstate;
+}
+
+/*
+ * cache_scan_needs_next
+ *
+ * We may fetch a tuple to be invisible because columner cache stores
+ * all the living tuples, including ones updated / deleted by concurrent
+ * sessions. So, it is a job of the caller to check MVCC visibility.
+ * It decides whether we need to move the next tuple due to the visibility
+ * condition, or not. If given tuple was NULL, it is obviously a time to
+ * break searching because it means no more tuples on the cache.
+ */
+static bool
+cache_scan_needs_next(HeapTuple tuple, Snapshot snapshot, Buffer buffer)
+{
+ bool visibility;
+
+ /* end of the scan */
+ if (!HeapTupleIsValid(tuple))
+ return false;
+
+ if (buffer != InvalidBuffer)
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ visibility = HeapTupleSatisfiesVisibility(tuple, snapshot, buffer);
+
+ if (buffer != InvalidBuffer)
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+ return !visibility ? true : false;
+}
+
+static TupleTableSlot *
+cache_scan_next(CustomScanState *node)
+{
+ cs_state *csstate = node->custom_state;
+ Relation rel = node->ss.ss_currentRelation;
+ HeapScanDesc scan = node->ss.ss_currentScanDesc;
+ TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
+ EState *estate = node->ss.ps.state;
+ Snapshot snapshot = estate->es_snapshot;
+ HeapTuple tuple;
+ Buffer buffer;
+
+ /* in case of fallback path, we don't need to something special. */
+ if (csstate->normal_seqscan)
+ {
+ tuple = heap_getnext(scan, estate->es_direction);
+ if (HeapTupleIsValid(tuple))
+ ExecStoreTuple(tuple, slot, scan->rs_cbuf, false);
+ else
+ ExecClearTuple(slot);
+ return slot;
+ }
+ Assert(csstate->ccache != NULL);
+
+ /* elsewhere, we either run or construct the columner cache */
+ do {
+ ccache_head *ccache = csstate->ccache;
+
+ /*
+ * "with_construction" means the columner cache is under construction,
+ * so we need to fetch a tuple from heap of the target relation and
+ * insert it into the cache. Note that we use SnapshotAny to fetch
+ * all the tuples both of visible and invisible ones, so it is our
+ * responsibility to check tuple visibility according to snapshot or
+ * the current estate.
+ * It is same even when we fetch tuples from the cache, without
+ * referencing heap buffer.
+ */
+ if (csstate->with_construction)
+ {
+ tuple = heap_getnext(scan, estate->es_direction);
+
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (HeapTupleIsValid(tuple))
+ {
+ if (ccache_insert_tuple(ccache, rel, tuple))
+ LWLockRelease(ccache->lock);
+ else
+ {
+ /*
+ * If ccache_insert_tuple got failed, it usually means
+ * lack of shared memory and unable to continue
+ * construction of the columner cacher.
+ * So, we put is twice to reset its reference counter
+ * to zero and release shared memory blocks.
+ */
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+ cs_put_ccache(ccache);
+ csstate->ccache = NULL;
+ }
+ }
+ else
+ {
+ /*
+ * If we reached end of the relation, it means the columner-
+ * cache become constructed.
+ */
+ ccache->status = CCACHE_STATUS_CONSTRUCTED;
+ LWLockRelease(ccache->lock);
+ }
+ buffer = scan->rs_cbuf;
+ }
+ else
+ {
+ LWLockAcquire(ccache->lock, LW_SHARED);
+ tuple = ccache_find_tuple(ccache->root_chunk,
+ &csstate->curr_ctid,
+ estate->es_direction);
+ if (HeapTupleIsValid(tuple))
+ {
+ ItemPointerCopy(&tuple->t_self, &csstate->curr_ctid);
+ tuple = heap_copytuple(tuple);
+ }
+ LWLockRelease(ccache->lock);
+ buffer = InvalidBuffer;
+ }
+ } while (cache_scan_needs_next(tuple, snapshot, buffer));
+
+ if (HeapTupleIsValid(tuple))
+ ExecStoreTuple(tuple, slot, buffer, buffer == InvalidBuffer);
+ else
+ ExecClearTuple(slot);
+
+ return slot;
+}
+
+static bool
+cache_scan_recheck(CustomScanState *node, TupleTableSlot *slot)
+{
+ return true;
+}
+
+static TupleTableSlot *
+cs_exec_custom_scan(CustomScanState *node)
+{
+ return ExecScan((ScanState *) node,
+ (ExecScanAccessMtd) cache_scan_next,
+ (ExecScanRecheckMtd) cache_scan_recheck);
+}
+
+static void
+cs_end_custom_scan(CustomScanState *node)
+{
+ cs_state *csstate = node->custom_state;
+
+ if (csstate->ccache)
+ {
+ ccache_head *ccache = csstate->ccache;
+ bool needs_remove = false;
+
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (ccache->status == CCACHE_STATUS_IN_PROGRESS)
+ needs_remove = true;
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+ if (needs_remove)
+ cs_put_ccache(ccache);
+ }
+ if (node->ss.ss_currentScanDesc)
+ heap_endscan(node->ss.ss_currentScanDesc);
+}
+
+static void
+cs_rescan_custom_scan(CustomScanState *node)
+{
+ elog(ERROR, "not implemented yet");
+}
+
+/*
+ * cache_scan_synchronizer
+ *
+ * trigger function to synchronize the columner-cache with heap contents.
+ */
+Datum
+cache_scan_synchronizer(PG_FUNCTION_ARGS)
+{
+ TriggerData *trigdata = (TriggerData *) fcinfo->context;
+ Relation rel = trigdata->tg_relation;
+ HeapTuple tuple = trigdata->tg_trigtuple;
+ HeapTuple newtup = trigdata->tg_newtuple;
+ HeapTuple result = NULL;
+ const char *tg_name = trigdata->tg_trigger->tgname;
+ ccache_head *ccache;
+
+ if (!CALLED_AS_TRIGGER(fcinfo))
+ elog(ERROR, "%s: not fired by trigger manager", tg_name);
+
+ ccache = cs_get_ccache(RelationGetRelid(rel), NULL, false);
+ if (!ccache)
+ return PointerGetDatum(newtup);
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+
+ PG_TRY();
+ {
+ TriggerEvent tg_event = trigdata->tg_event;
+
+ if (TRIGGER_FIRED_AFTER(tg_event) &&
+ TRIGGER_FIRED_FOR_ROW(tg_event) &&
+ TRIGGER_FIRED_BY_INSERT(tg_event))
+ {
+ ccache_insert_tuple(ccache, rel, tuple);
+ result = tuple;
+ }
+ else if (TRIGGER_FIRED_AFTER(tg_event) &&
+ TRIGGER_FIRED_FOR_ROW(tg_event) &&
+ TRIGGER_FIRED_BY_UPDATE(tg_event))
+ {
+ ccache_insert_tuple(ccache, rel, newtup);
+ ccache_delete_tuple(ccache, tuple);
+ result = newtup;
+ }
+ else if (TRIGGER_FIRED_AFTER(tg_event) &&
+ TRIGGER_FIRED_FOR_ROW(tg_event) &&
+ TRIGGER_FIRED_BY_DELETE(tg_event))
+ {
+ ccache_delete_tuple(ccache, tuple);
+ result = tuple;
+ }
+ else if (TRIGGER_FIRED_AFTER(tg_event) &&
+ TRIGGER_FIRED_FOR_STATEMENT(tg_event) &&
+ TRIGGER_FIRED_BY_TRUNCATE(tg_event))
+ {
+ if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
+ cs_put_ccache(ccache);
+ }
+ else
+ elog(ERROR, "%s: fired by unexpected context (%08x)",
+ tg_name, tg_event);
+ }
+ PG_CATCH();
+ {
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+
+ PG_RETURN_POINTER(result);
+}
+PG_FUNCTION_INFO_V1(cache_scan_synchronizer);
+
+/*
+ * ccache_on_object_access
+ *
+ * It dropps an existing columner-cache if the cached table was altered or
+ * dropped.
+ */
+static void
+ccache_on_object_access(ObjectAccessType access,
+ Oid classId,
+ Oid objectId,
+ int subId,
+ void *arg)
+{
+ ccache_head *ccache;
+
+ /* ALTER TABLE and DROP TABLE needs cache invalidation */
+ if (access != OAT_DROP && access != OAT_POST_ALTER)
+ return;
+ if (classId != RelationRelationId)
+ return;
+
+ ccache = cs_get_ccache(objectId, NULL, false);
+ if (!ccache)
+ return;
+
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+ if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
+ cs_put_ccache(ccache);
+ LWLockRelease(ccache->lock);
+ cs_put_ccache(ccache);
+}
+
+/*
+ * ccache_on_page_prune
+ *
+ * It is a callback function when a particular heap block got vacuumed.
+ * On vacuuming, its dead space, being allocated by dead tuples, got
+ * reclaimed and tuple's location was ought to be moved.
+ * This routine also reclaims the space by dead tuples on the columner
+ * cache according to layout changes on the heap.
+ */
+static void
+ccache_on_page_prune(Relation relation,
+ Buffer buffer,
+ int ndeleted,
+ TransactionId OldestXmin,
+ TransactionId latestRemovedXid)
+{
+ ccache_head *ccache;
+
+ /* call the secondary hook */
+ if (heap_page_prune_next)
+ (*heap_page_prune_next)(relation, buffer, ndeleted,
+ OldestXmin, latestRemovedXid);
+
+ ccache = cs_get_ccache(RelationGetRelid(relation), NULL, false);
+ if (ccache)
+ {
+ LWLockAcquire(ccache->lock, LW_EXCLUSIVE);
+
+ ccache_vacuum_page(ccache, buffer);
+
+ LWLockRelease(ccache->lock);
+
+ cs_put_ccache(ccache);
+ }
+}
+
+void
+_PG_init(void)
+{
+ CustomProvider provider;
+
+ if (IsUnderPostmaster)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cache_scan must be loaded via shared_preload_libraries")));
+
+ DefineCustomBoolVariable("cache_scan.disabled",
+ "turn on/off cache_scan feature on run-time",
+ NULL,
+ &cache_scan_disabled,
+ false,
+ PGC_USERSET,
+ GUC_NOT_IN_SAMPLE,
+ NULL, NULL, NULL);
+
+ /* initialization of cache subsystem */
+ ccache_init();
+
+ /* callbacks for cache invalidation */
+ object_access_next = object_access_hook;
+ object_access_hook = ccache_on_object_access;
+
+ heap_page_prune_next = heap_page_prune_hook;
+ heap_page_prune_hook = ccache_on_page_prune;
+
+ /* registration of custom scan provider */
+ add_scan_path_next = add_scan_path_hook;
+ add_scan_path_hook = cs_add_scan_path;
+
+ memset(&provider, 0, sizeof(provider));
+ strncpy(provider.name, "cache scan", sizeof(provider.name));
+ provider.InitCustomScanPlan = cs_init_custom_scan_plan;
+ provider.BeginCustomScan = cs_begin_custom_scan;
+ provider.ExecCustomScan = cs_exec_custom_scan;
+ provider.EndCustomScan = cs_end_custom_scan;
+ provider.ReScanCustomScan = cs_rescan_custom_scan;
+
+ register_custom_provider(&provider);
+}
diff --git a/doc/src/sgml/cache-scan.sgml b/doc/src/sgml/cache-scan.sgml
new file mode 100644
index 0000000..c4cc165
--- /dev/null
+++ b/doc/src/sgml/cache-scan.sgml
@@ -0,0 +1,224 @@
+<!-- doc/src/sgml/cache-scan.sgml -->
+
+<sect1 id="cache-scan" xreflabel="cache-scan">
+ <title>cache-scan</title>
+
+ <indexterm zone="cache-scan">
+ <primary>cache-scan</primary>
+ </indexterm>
+
+ <sect2>
+ <title>Overview</title>
+ <para>
+ The <filename>cache-scan</> module provides an alternative way to scan
+ relations using on-memory columner cache, instead of usual heap scan,
+ in case when previous scan already holds contents of the table on the
+ cache.
+ Unlike buffer cache, it holds contents of the limited number of columns,
+ but not whole of the record, thus it allows to hold larger number of records
+ per same amount of RAM. Probably, this characteristic makes sense to run
+ analytic queries on a table with many columns and records.
+ </para>
+ <para>
+ Once this module gets loaded, it registers itself as a custom-scan provider.
+ It allows to provide an additional scan path on regular relations using
+ on-memory columner cache, instead of regular heap scan.
+ It also performs as a proof-of-concept implementation that works on
+ the custom-scan API that enables to extend the core executor system.
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Installation</title>
+ <para>
+ This module has to be loaded using
+ <xref linkend="guc-shared-preload-libraries"> parameter to acquired
+ a particular amount of shared memory on startup time.
+ In addition, the relation to be cached has special triggers, called
+ synchronizer, are implemented with <literal>cache_scan_synchronizer</>
+ function that synchronizes the cache contents according to the latest
+ heap on <command>INSERT</>, <command>UPDATE</>, <command>DELETE</> or
+ <command>TRUNCATE</>.
+ </para>
+ <para>
+ You can run this extension according to the following steps.
+ </para>
+ <procedure>
+ <step>
+ <para>
+ Adjust <xref linkend="guc-shared-preload-libraries"> parameter to
+ load <filename>cache_scan</> binary on startup time, then restart
+ the postmaster.
+ </para>
+ </step>
+ <step>
+ <para>
+ Run <xref linkend="sql-createextension"> to create synchronizer
+ function of <filename>cache_scan</>.
+<programlisting>
+CREATE EXTENSION cache_scan;
+</programlisting>
+ </para>
+ </step>
+ <step>
+ <para>
+ Create triggers of synchronizer on the target relation.
+<programlisting>
+CREATE TRIGGER t1_cache_row_sync
+ AFTER INSERT OR UPDATE OR DELETE ON t1 FOR ROW
+ EXECUTE PROCEDURE cache_scan_synchronizer();
+CREATE TRIGGER t1_cache_stmt_sync
+ AFTER TRUNCATE ON t1 FOR STATEMENT
+ EXECUTE PROCEDURE cache_scan_synchronizer();
+</programlisting>
+ </para>
+ </step>
+ </procedure>
+ </sect2>
+
+ <sect2>
+ <title>How does it works</title>
+ <para>
+ This module performs according to the usual fashion of
+ <xref linkend="custom-scan">.
+ It offers an alternative way to scan a relation if relation has synchronizer
+ triggers and width of referenced columns are less than 30% of average
+ record width.
+ Then, query optimizer will pick up the cheapest path. If the path chosen
+ is a custom-scan path managed by <filename>cache_scan</>, it runs on the
+ target relation using columner cache.
+ On the first time running, it tries to construct relation's cache along
+ with regular sequential scan. Next time or later, it can run on
+ the columner cache without referencing the heap.
+ </para>
+ <para>
+ You can check whether the query plan uses <filename>cache_scan</> using
+ <xref linkend="sql-explain"> command, as follows:
+<programlisting>
+postgres=# EXPLAIN (costs off) SELECT a,b FROM t1 WHERE b < pi();
+ QUERY PLAN
+----------------------------------------------------
+ Custom Scan (cache scan) on t1
+ Filter: (b < 3.14159265358979::double precision)
+(2 rows)
+</programlisting>
+ </para>
+ <para>
+ A columner cache, associated with a particular relation, has one or more chunks
+ that performs as node or leaf of t-tree structure.
+ The <literal>cache_scan_debuginfo()</> function can dump useful informationl;
+ properties of all the active chunks as follows.
+<programlisting>
+postgres=# SELECT * FROM cache_scan_debuginfo();
+ tableoid | status | chunk | upper | l_depth | l_chunk | r_depth | r_chunk | ntuples | usage | min_ctid | max_ct
+id
+----------+-------------+----------------+----------------+---------+----------------+---------+----------------+---------+---------+-----------+-----------
+ 16400 | constructed | 0x7f2b8ad84740 | 0x7f2b8af84740 | 0 | (nil) | 0 | (nil) | 29126 | 233088 | (0,1) | (677,15)
+ 16400 | constructed | 0x7f2b8af84740 | (nil) | 1 | 0x7f2b8ad84740 | 2 | 0x7f2b8b384740 | 29126 | 233088 | (677,16) | (1354,30)
+ 16400 | constructed | 0x7f2b8b184740 | 0x7f2b8b384740 | 0 | (nil) | 0 | (nil) | 29126 | 233088 | (1354,31) | (2032,2)
+ 16400 | constructed | 0x7f2b8b384740 | 0x7f2b8af84740 | 1 | 0x7f2b8b184740 | 1 | 0x7f2b8b584740 | 29126 | 233088 | (2032,3) | (2709,33)
+ 16400 | constructed | 0x7f2b8b584740 | 0x7f2b8b384740 | 0 | (nil) | 0 | (nil) | 3478 | 1874560 | (2709,34) | (2790,28)
+(5 rows)
+</programlisting>
+ </para>
+ <para>
+ All the cached tuples are indexed with <literal>ctid</> order, and each chunk has
+ an array of partial tuples with min- and max- values. Its left node is linked to
+ the chunks that have tuples with smaller <literal>ctid</>, and its right node is
+ linked to the chunks that have larger ones.
+ It enables to find out tuples in timely fashion when it needs to be invalidated
+ according to heap updates by DDL, DML or vacuuming.
+ </para>
+ <para>
+ The columner cache are not owned by a particular session, so it retains the cache
+ unless it does not dropped or postmaster does not restart.
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>GUC Parameters</title>
+ <variablelist>
+ <varlistentry id="guc-cache-scan-block_size" xreflabel="cache_scan.block_size">
+ <term><varname>cache_scan.block_size</> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>cache_scan.block_size</> configuration parameter</>
+ </indexterm>
+ <listitem>
+ <para>
+ This parameter controls length of the block on shared memory segment
+ for the columner-cache. It needs to restart postmaster for validation.
+ </para>
+ <para>
+ <filename>cache_scan</> module acquires <literal>cache_scan.num_blocks</>
+ x <literal>cache_scan.block_size</> bytes of shared memory segment on
+ the startup time, then allocates them for columner cache on demand.
+ Too large block size damages flexibility of memory assignment, and
+ too small block size consumes much management are for each block.
+ So, we recommend to keep is as the default value; that is 2MB per block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-cache-scan-num_blocks" xreflabel="cache_scan.num_blocks">
+ <term><varname>cache_scan.num_blocks</> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>cache_scan.num_blocks</> configuration parameter</>
+ </indexterm>
+ <listitem>
+ <para>
+ This parameter controls number of the block on shared memory segment
+ for the columner-cache. It needs to restart postmaster for validation.
+ </para>
+ <para>
+ <filename>cache_scan</> module acquires <literal>cache_scan.num_blocks</>
+ x <literal>cache_scan.block_size</> bytes of shared memory segment on
+ the startup time, then allocates them for columner cache on demand.
+ Too small number of blocks damages flexibility of memory assignment
+ and may cause undesired cache dropping.
+ So, we recommend to set enough number of blocks to keep contents of
+ the target relations on memory.
+ Its default is <literal>64</literal>; probably too small for most of
+ real use cases.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-cache-scan-hash_size" xreflabel="cache_scan.hash_size">
+ <term><varname>cache_scan.hash_size</> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>cache_scan.hash_size</> configuration parameter</>
+ </indexterm>
+ <listitem>
+ <para>
+ This parameter controls width of the internal hash table slots; that
+ link every columnar cache distributed by table's oid.
+ Its default is <literal>128</>; no need to adjust it usually.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-cache-scan-max_cached_attnum" xreflabel="cache_scan.max_cached_attnum">
+ <term><varname>cache_scan.max_cached_attnum</> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>cache_scan.max_cached_attnum</> configuration parameter</>
+ </indexterm>
+ <listitem>
+ <para>
+ This parameter controls the maximum attribute number we can cache on
+ the columner cache. Because of internal data representation, a bitmap set
+ to track attributes being cached has to be fixed-length.
+ Thus, the largest attribute number needs to be fixed preliminary.
+ Its default is <literal>128</>; although most tables likely have less than
+ 100 columns.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </sect2>
+ <sect2>
+ <title>Author</title>
+ <para>
+ KaiGai Kohei <email>kaigai@kaigai.gr.jp</email>
+ </para>
+ </sect2>
+</sect1>
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 2002f60..3d8fd05 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -107,6 +107,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
&auto-explain;
&btree-gin;
&btree-gist;
+ &cache-scan;
&chkpass;
&citext;
&ctidscan;
diff --git a/doc/src/sgml/custom-scan.sgml b/doc/src/sgml/custom-scan.sgml
index f53902d..218a5fd 100644
--- a/doc/src/sgml/custom-scan.sgml
+++ b/doc/src/sgml/custom-scan.sgml
@@ -55,6 +55,20 @@
</para>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><xref linkend="cache-scan"></term>
+ <listitem>
+ <para>
+ This custom scan in this module enables a scan refering the on-memory
+ columner cache instead of the heap, if the target relation already has
+ this cache being constructed already.
+ Unlike buffer cache, it holds limited number of columns that have been
+ referenced before, but not all the columns in the table definition.
+ Thus, it allows to cache much larger number of records on-memory than
+ buffer cache.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
<para>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index aa2be4b..10c7666 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -103,6 +103,7 @@
<!ENTITY auto-explain SYSTEM "auto-explain.sgml">
<!ENTITY btree-gin SYSTEM "btree-gin.sgml">
<!ENTITY btree-gist SYSTEM "btree-gist.sgml">
+<!ENTITY cache-scan SYSTEM "cache-scan.sgml">
<!ENTITY chkpass SYSTEM "chkpass.sgml">
<!ENTITY citext SYSTEM "citext.sgml">
<!ENTITY ctidscan SYSTEM "ctidscan.sgml">
pgsql-v9.4-heap_page_prune_hook.v1.patchapplication/octet-stream; name=pgsql-v9.4-heap_page_prune_hook.v1.patchDownload
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 27cbac8..1fb5f4a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -42,6 +42,9 @@ typedef struct
bool marked[MaxHeapTuplesPerPage + 1];
} PruneState;
+/* Callback for each page pruning */
+heap_page_prune_hook_type heap_page_prune_hook = NULL;
+
/* Local functions */
static int heap_prune_chain(Relation relation, Buffer buffer,
OffsetNumber rootoffnum,
@@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
* and update FSM with the remaining space.
*/
+ /*
+ * This callback allows extensions to synchronize their own status with
+ * heap image on the disk, when this buffer page is vacuumed.
+ */
+ if (heap_page_prune_hook)
+ (*heap_page_prune_hook)(relation,
+ buffer,
+ ndeleted,
+ OldestXmin,
+ prstate.latestRemovedXid);
return ndeleted;
}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bfdadc3..9775aad 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -164,6 +164,13 @@ extern void heap_restrpos(HeapScanDesc scan);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
+typedef void (*heap_page_prune_hook_type)(Relation relation,
+ Buffer buffer,
+ int ndeleted,
+ TransactionId OldestXmin,
+ TransactionId latestRemovedXid);
+extern heap_page_prune_hook_type heap_page_prune_hook;
+
extern void heap_page_prune_opt(Relation relation, Buffer buffer,
TransactionId OldestXmin);
extern int heap_page_prune(Relation relation, Buffer buffer,
pgsql-v9.4-HeapTupleSatisfies-accepts-InvalidBuffer.v1.patchapplication/octet-stream; name=pgsql-v9.4-HeapTupleSatisfies-accepts-InvalidBuffer.v1.patchDownload
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f626755..023f78e 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
*
* The caller should pass xid as the XID of the transaction to check, or
* InvalidTransactionId if no check is needed.
+ *
+ * In case when the supplied HeapTuple is not associated with a particular
+ * buffer, it just returns without any jobs. It may happen when an extension
+ * caches tuple with their own way.
*/
static inline void
SetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid)
{
+ if (BufferIsInvalid(buffer))
+ return;
+
if (TransactionIdIsValid(xid))
{
/* NB: xid must be known committed here! */
On Sat, Feb 8, 2014 at 1:09 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
Hello,
Because of time pressure in the commit-fest:Jan, I tried to simplifies the
patch
for cache-only scan into three portions; (1) add a hook on heap_page_prune
for cache invalidation on vacuuming a particular page. (2) add a check to
accept
InvalidBuffer on SetHintBits (3) a proof-of-concept module of cache-only
scan.(1) pgsql-v9.4-heap_page_prune_hook.v1.patch
Once on-memory columnar cache is constructed, then it needs to be
invalidated
if heap page on behalf of the cache is modified. In usual DML cases,
extension
can get control using row-level trigger functions for invalidation,
however, we right
now have no way to get control on a page is vacuumed, usually handled by
autovacuum process.
This patch adds a callback on heap_page_prune(), to allow extensions to
prune
dead entries on its cache, not only heap pages.
I'd also like to see any other scenario we need to invalidate columnar
cache
entries, if exist. It seems to me object_access_hook makes sense to conver
DDL and VACUUM FULL scenario...(2) pgsql-v9.4-HeapTupleSatisfies-accepts-InvalidBuffer.v1.patch
In case when we want to check visibility of the tuples on cache entries
(thus
no particular shared buffer is associated) using
HeapTupleSatisfiesVisibility,
it internally tries to update hint bits of tuples. However, it does
not make sense
onto the tuples being not associated with a particular shared buffer.
Due to its definition, tuple entries being on cache does not connected with
a particular shared buffer. If we need to load whole of the buffer page to
set
hint bits, it is totally nonsense because the purpose of on-memory cache is
to reduce disk accesses.
This patch adds an exceptional condition on SetHintBits() to skip anything
if the given buffer is InvalidBuffer. It allows to check tuple
visibility using regular
visibility check functions, without re-invention of the wheel by
themselves.(3) pgsql-v9.4-contrib-cache-scan.v1.patch
Unlike (1) and (2), this patch is just a proof of the concept to
implement cache-
only scan on top of the custom-scan interface.
It tries to offer an alternative scan path on the table with row-level
triggers for
cache invalidation if total width of referenced columns are less than 30%
of the
total width of table definition. Thus, it can keep larger number of
records with
meaningful portion on the main memory.
This cache shall be invalidated according to the main heap update. One is
row-level trigger, second is object_access_hook on DDL, and the third is
heap_page_prune hook. Once a columns reduced tuple gets cached, it is
copied to the cache memory from the shared buffer, so it needs a feature
to ignore InvalidBuffer for visibility check functions.
I reviewed all the three patches. The first 1 and 2 core PostgreSQL patches
are fine.
And I have comments in the third patch related to cache scan.
1. +# contrib/dbcache/Makefile
Makefile header comment is not matched with file name location.
2.+ /*
+ * Estimation of average width of cached tuples - it does not make
+ * sense to construct a new cache if its average width is more than
+ * 30% of the raw data.
+ */
Move the estimation of average width calculation of cached tuples into
the case where the cache is not found,
otherwise it is an overhead for cache hit scenario.
3. + if (old_cache)
+ attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
can't we need the check to see the average width is more than 30%?
During estimation it doesn't
include the existing other attributes.
4. + lchunk->right = cchunk;
+ lchunk->l_depth = TTREE_DEPTH(lchunk->right);
I think it should be lchunk->r_depth needs to be set in a clock wise
rotation.
5. can you add some comments in the code with how the block is used?
6. In do_insert_tuple function I felt moving the tuples and rearranging
their addresses is little bit costly. How about the following way?
Always insert the tuple from the bottom of the block where the empty
space is started and store their corresponding reference pointers
in the starting of the block in an array. As and when the new tuple
inserts this array increases from block start and tuples from block end.
Just need to sort this array based on item pointers, no need to update
their reference pointers.
In this case the movement is required only when the tuple is moved from
one block to another block and also whenever if the continuous
free space is not available to insert the new tuple. you can decide
based on how frequent the sorting will happen in general.
7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups);
can go wrong when only one tuple present in the block
with the equal item pointer what we are searching in the forward scan
direction.
8. I am not able to find a protection mechanism in insert/delete and etc of
a tuple in Ttree. As this is a shared memory it can cause problems.
9. + /* merge chunks if this chunk has enough space to merge */
+ ccache_merge_chunk(ccache, cchunk);
calling the merge chunks for every call back of heap page prune is a
overhead for vacuum. After the merge which may again leads
to node splits because of new data.
10. "columner" is present in some places of the patch. correct it.
11. In cache_scan_next function, incase of cache insert fails because of
shared memory the tuple pointer is not reset and cache is NULL.
Because of this during next record fetch it leads to assert as cache !=
NULL.
12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
+ cs_put_ccache(ccache);
The cache is created with refcnt as 2 and in some times two times put
cache is called to eliminate it and in some times with a different approach.
It is little bit confusing, can you explain in with comments with why 2
is required and how it maintains?
13. A performance report is required to see how much impact it can cause on
insert/delete/update operations because of cache synchronizer.
14. The Guc variable "cache_scan_disabled" is missed in docs description.
please let me know if you need any support.
Regards,
Hari Babu
Fujitsu Australia
2014-02-12 14:59 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
I reviewed all the three patches. The first 1 and 2 core PostgreSQL patches
are fine.
And I have comments in the third patch related to cache scan.
Thanks for your volunteering.
1. +# contrib/dbcache/Makefile
Makefile header comment is not matched with file name location.
Ahh, it's an old name when I started to implement.
2.+ /* + * Estimation of average width of cached tuples - it does not make + * sense to construct a new cache if its average width is more than + * 30% of the raw data. + */Move the estimation of average width calculation of cached tuples into
the case where the cache is not found,
otherwise it is an overhead for cache hit scenario.
You are right. If and when existing cache is found and match, its width is
obviously less than 30% of total width.
3. + if (old_cache)
+ attrs_used = bms_union(attrs_used, &old_cache->attrs_used);can't we need the check to see the average width is more than 30%? During
estimation it doesn't
include the existing other attributes.
Indeed. It should drop some attributes on the existing cache if total average
width grows more than the threshold. Probably, we need to have a statistical
variable to track how many times or how recently referenced.
4. + lchunk->right = cchunk;
+ lchunk->l_depth = TTREE_DEPTH(lchunk->right);I think it should be lchunk->r_depth needs to be set in a clock wise
rotation.
Oops, nice cache.
5. can you add some comments in the code with how the block is used?
Sorry, I'll add it. A block is consumed from the head to store pointers of
tuples, and from the tail to store contents of the tuples. A block can hold
multiple tuples unless usage of tuple pointers from the head does not cross
the area for tuple contents. Anyway, I'll put it on the source code.
6. In do_insert_tuple function I felt moving the tuples and rearranging
their addresses is little bit costly. How about the following way?Always insert the tuple from the bottom of the block where the empty
space is started and store their corresponding reference pointers
in the starting of the block in an array. As and when the new tuple
inserts this array increases from block start and tuples from block end.
Just need to sort this array based on item pointers, no need to update
their reference pointers.In this case the movement is required only when the tuple is moved from
one block to another block and also whenever if the continuous
free space is not available to insert the new tuple. you can decide based
on how frequent the sorting will happen in general.
It seems to me a reasonable suggestion.
Probably, an easier implementation is replacing an old block with dead-
spaces by a new block that contains only valid tuples, if and when dead-
space grows threshold of block-usage.
7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups); can
go wrong when only one tuple present in the block
with the equal item pointer what we are searching in the forward scan
direction.
It shouldn't happen, because the first or second ItemPointerCompare will
handle the condition. Please assume the cchunk->ntups == 1. In this case,
any given ctid shall match either of them, because any ctid is less, equal or
larger to the tuple being only cached, thus, it moves to the right or left node
according to the scan direction.
8. I am not able to find a protection mechanism in insert/delete and etc of
a tuple in Ttree. As this is a shared memory it can cause problems.
For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to invocation of
ccache_insert_tuple / ccache_delete_tuple.
9. + /* merge chunks if this chunk has enough space to merge */
+ ccache_merge_chunk(ccache, cchunk);calling the merge chunks for every call back of heap page prune is a
overhead for vacuum. After the merge which may again leads
to node splits because of new data.
OK, I'll check the condition to merge the chunks, to prevent too frequent
merge / split.
10. "columner" is present in some places of the patch. correct it.
Ahh, it should be "columnar".
11. In cache_scan_next function, incase of cache insert fails because of
shared memory the tuple pointer is not reset and cache is NULL.
Because of this during next record fetch it leads to assert as cache !=
NULL.
You are right. I had to modify the state of scan as if normal-seqscan path,
not just setting NULL on csstate->ccache.
I left an older manner during try & error during implementation.
12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
+ cs_put_ccache(ccache);The cache is created with refcnt as 2 and in some times two times put
cache is called to eliminate it and in some times with a different approach.
It is little bit confusing, can you explain in with comments with why 2
is required and how it maintains?
I thought, 2 is same as create + get, so putting the cache at end of the scan
does not release the cache. However, it might be confusing as you pointed out.
The process who create the cache knows it is the creator process. So, all it
needs to do is exiting the scan without putting the cache, if it successfully
create the cache and wants to leave the cache for later scan.
13. A performance report is required to see how much impact it can cause on
insert/delete/update operations because of cache synchronizer.
OK, I'll try to measure the difference between them on next patch submission.
14. The Guc variable "cache_scan_disabled" is missed in docs description.
OK,
Thanks, I'll submit a revised one within a couple of days.
--
KaiGai Kohei <kaigai@kaigai.gr.jp>
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 13, 2014 at 2:42 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
2014-02-12 14:59 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups);
can
go wrong when only one tuple present in the block
with the equal item pointer what we are searching in the forward scan
direction.It shouldn't happen, because the first or second ItemPointerCompare will
handle the condition. Please assume the cchunk->ntups == 1. In this case,
any given ctid shall match either of them, because any ctid is less, equal
or
larger to the tuple being only cached, thus, it moves to the right or left
node
according to the scan direction.
yes you are correct. sorry for the noise.
8. I am not able to find a protection mechanism in insert/delete and
etc of
a tuple in Ttree. As this is a shared memory it can cause problems.
For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to invocation of
ccache_insert_tuple / ccache_delete_tuple.
Correct. But this lock can be a bottleneck for the concurrency. Better to
analyze the same once we have the performance report.
Regards,
Hari Babu
Fujitsu Australia
8. I am not able to find a protection mechanism in insert/delete
and etc of
a tuple in Ttree. As this is a shared memory it can cause problems.
For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to
invocation of
ccache_insert_tuple / ccache_delete_tuple.Correct. But this lock can be a bottleneck for the concurrency. Better to
analyze the same once we have the performance report.
Well, concurrent updates towards a particular table may cause lock contention
due to a giant lock.
On the other hands, one my headache is how to avoid dead-locking if we try to
implement it using finer granularity locking. Please assume per-chunk locking.
It also needs to take a lock on the neighbor nodes when a record is moved out.
Concurrently, some other process may try to move another record with inverse
order. That is a ticket for dead-locking.
Is there idea or reference to implement concurrent tree structure updating?
Anyway, it is a good idea to measure the impact of concurrent updates on
cached tables, to find out the significance of lock splitting.
Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>
-----Original Message-----
From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
Sent: Thursday, February 13, 2014 8:31 AM
To: Kohei KaiGai
Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
table scan?)On Thu, Feb 13, 2014 at 2:42 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
2014-02-12 14:59 GMT+09:00 Haribabu Kommi
<kommi.haribabu@gmail.com>:7. In ccache_find_tuple function this Assert(i_min + 1 <
cchunk->ntups); can
go wrong when only one tuple present in the block
with the equal item pointer what we are searching in theforward scan
direction.
It shouldn't happen, because the first or second ItemPointerCompare
will
handle the condition. Please assume the cchunk->ntups == 1. In this
case,
any given ctid shall match either of them, because any ctid is less,
equal or
larger to the tuple being only cached, thus, it moves to the right
or left node
according to the scan direction.yes you are correct. sorry for the noise.
8. I am not able to find a protection mechanism in insert/delete
and etc of
a tuple in Ttree. As this is a shared memory it can cause problems.
For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to
invocation of
ccache_insert_tuple / ccache_delete_tuple.Correct. But this lock can be a bottleneck for the concurrency. Better to
analyze the same once we have the performance report.Regards,
Hari BabuFujitsu Australia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 13, 2014 at 3:27 PM, Kouhei Kaigai wrote:
8. I am not able to find a protection mechanism in insert/delete
and etc of
a tuple in Ttree. As this is a shared memory it can cause
problems.
For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to
invocation of
ccache_insert_tuple / ccache_delete_tuple.Correct. But this lock can be a bottleneck for the concurrency. Better to
analyze the same once we have the performance report.Well, concurrent updates towards a particular table may cause lock
contention
due to a giant lock.
On the other hands, one my headache is how to avoid dead-locking if we try
to
implement it using finer granularity locking. Please assume per-chunk
locking.
It also needs to take a lock on the neighbor nodes when a record is moved
out.
Concurrently, some other process may try to move another record with
inverse
order. That is a ticket for dead-locking.Is there idea or reference to implement concurrent tree structure updating?
Anyway, it is a good idea to measure the impact of concurrent updates on
cached tables, to find out the significance of lock splitting.
we can do some of the following things,
1. Let only insert can take the exclusive lock.
2. Always follow the locking order from root to the children.
3. For delete take exclusive lock only on the exact node where the delete
is happening.
and etc, we will identify some more based on the performance data.
And one more interesting document i found in the net while searching for
the concurrency in Ttree,
which says that B-tree can outperform Ttree as an in-memory index also with
an little bit expensive of more memory
usage. The document is attached in the mail.
Regards,
Hari Babu
Fujitsu Australia
Attachments:
T-Tree_or_B-Tree_Main_Memory_Database_Index_Structure_Revisited.pdfapplication/pdf; name=T-Tree_or_B-Tree_Main_Memory_Database_Index_Structure_Revisited.pdfDownload
%PDF-1.3
%����
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
x��}[��u����+��i�����>�(�%��p(lY� ��p!;��;��2���TVu7��bc�����<��������*�����W�~�e�{�������8�^M������|��E������G��?���}���o��Q�C��o������g���CU5m�{|�&v�>t����#3�����0���rlc���zt������|8W�������
���������x���m=���<j�U��m�� ����U�N?����n�8t;���nSi���{c�������V�������`�.E��M3������Cv����f���������������6}�<�,�[��t#V���
��
��� W ��W�����x�n�z��-��q8����}��!����m�5 ���
=��H4���U3
V�������W��_������[r�����j*�����y L����[}8��[z��D�����Zu������i�������3xG��g���-|8�)���o2D����'}O���K*d_�~����|I�q�jS��~�
{����SJ@�9�T��BO>�Us������]���@]C���ur�����5�V/��d��NJ�����f�9��Q��v<�i0�{�&���������d\s��}w�[���2�_���~��N����������;�l���{��������m7L3�C�O8���E���G��?��BV�����y^���w�������Z�[������c����}�%<_��_���;8�}��< �n�����!����O�����u�s�Mc�C����uw�85��3��e�I3��O�W�'[U�������z�h��w];:J;j=��;P�0�����!��9u����v��$��3�(��{;��\��~5TM����}��~���|�(�4F� �~.~y3�;$`���77�I� {@BN+if���.h�@���{T��{AA�� 5��������Q��G�4�E�� ����F[G���:@<�g�'^�����t��u5��O<w �9�2�'��m~O�����jW�
�!�e� ������lwh���^aR�����kU<<�b����i�n�
�`:��)��Z��=��uW���J���r�i����\���\i��+H� ��2��������[����ss�� ���D���&DSG�<X�-`��S�����S�)�����Fv;�a����i�.�zIc�g���A j[5����{���Uae�y��M�����7��P�;j
��5=6����~������+_���������1��B*"�o���:z�vp��kw�Gh�>�y���?&�a�4�*=�@n ���5V|�u dF?���rNH�z�����f����_A��CTK�'M~n����=4��(wk���n��W����B��h����5��m�������Cc5� �����SCv��?\9/�=�jZ����������
`����\$�z�le]�[l��P���#������{�,�OS�?���M|Q_�+������)�q�hZ����(cQ>���0��j�������MO��m�ERwc�3,� ��]�����g�����lePnj��=�x�v 8�1N�[�hhV��N7y������������*�tV���W�&7�0�[x<�7����:�Ei���K�w���'{�"����Z;���zO����v��]�'����iTR�1{+;�`��7\R���J�F��+!oH�����<{�|��=B�g� ���/�j6<����Z����2w�|9u�8�:��z?���{g{������}������k�8�w���r���������*�PN�����=��Q��o{����'6�d�(���g�9��������M���W���A�?� ��*������O��W��qg�x�������GMSwQU~����p6��9�}�����N�6�k��U�}����������� ��?�����kC.�KZ�'�{H��[A"LB�=B�tx��{w=�=L��H���^;t B��L��W�0D1at��b8�F���9���.����YG����Cd�N
�u����6������+x�CXA}�i�������'�h
��&f}���&��9�$��o���(����a��H_/v������X��}"E����X�P��4V��7dq�o6�x�=�d�L���� � )%h�Nv7�]����I+��� ��pg���M '1�E �����z�2a�3�c��7�{>e�BD����3N��vG
��!�f3-�#�����d���h��T`u����tw��:������d,�3rB������!�Vwi���"$7���vx-=���qY�����{p���y8��Kzs�V"��)n���:���]��F�������1�[�q��e�C��DK9�stL@b����W��� ��'���#�7�����ID_�9�$��O<���E:o����i-����}���|�vOV6��9=%2��?����m�D���_�7�2L��O���\p���{N��Yo��S��)������Kl�CY�5�-B)���$Y��J����U�R.����,���3t
g�I������QH��_`O��a���K�Q-�0Pg��Y���&*��UP@OK6�5Q�H��+�y,"��{p.����r �m��Nq��n?�X=�V����8$����[|����-����0�"�)���@�;������s"I7��l���\
�����Ew"�m�|��
.�eU�^i:��q�������u���A�v�d�-S�%���.i�'Q�r�F0����8l�y�u;|L��K���k|Q��,Rn��Y������D��
��)�~2�9z�����0hX�1������N�Q�
D�j��4tf��4��
����
�,t�"HB�������%c�vy�D>� ��f�SC�[���K��o)Iu�8JFk��Cf���sL?n�1�)��x�y*"z]����
R ��&��4��'��R?�a�T�=b�`�S���X��������W�2q^���.!i*w9��F0�C�bx��v���^���r����[lU���,~xRM�>��:��m��
�B�f�
��<����G`�n��`�/0�����v����MK+� ���<��:�~AW7���_�7��S�
��N/1����(���w����7t>J38�W���'�E;�l�r�hS�q��1�T��� ����"����J���
>���d^�Y�1cZ}�t9A�)��[Q��������������\�3Og-D
�
�@!�x��]����
�
�OZdK}O��j�[B��zZ�����Y���T�� ����%��h��1�����I���������P���X�dA��������'I�Gw^u������h5U�j��y�VczE��/k�P �?��������/_���8���
/Tt�b�m��_���W�|C^h��#������CAd�p�V 8����q��:�P9�A������L�Yh8gF{�eoHT\�^��n��
)�8�n��"�� 5���zpM���_�S������+��X
��w�g�C ��HBU���A���AC�nI���,(A\��d�l
�����|��&z�5T ����iL
t��'������j*������k�@d�=~����������jE.�$D���(�(dy}2�~=��+���2���U��b�k���������%P
����nF�u��A���+�"��0cp42d�U
B��%�c��C���!�CR��m2�`,�6�\;�go��$^��M�]���������Jw�L�a����`l9�Wt �?o���0H�"��Mm��Y;[�Q����jb���������S�}{L��\+.�4-�����S��Zp_�"����X��:4 ���C�G��
���������h��'p��%ZX��X��H���"u"3��fR4Me���
�?Fj�d�*n�%��z�h'���.��(��i�x!E11u ����K��K��~u7d��?���O��_�����/��YI/m��]~��O;�\6���
wg��HTP����_��k�RK*������stRXT�$��[@��qdE��P�{M,&��� Y�~O&%sX:�kmO�W��>�NT��u���V{m�
���;�����WO'0,_A�[�;/d��ZW�KX��j����):���m�Rs$�����7�o�> 8�i���Th��� L�2�_�eP A }M��>�Bp5�����a��������W^��D%��$+�
����{:�H������w����= ��������B ���{��j���������}��&��wIT&�Wdb��y��9�x�5c�����S|fI���
�P�#����g�
UX+}+����T�����k�G�U���$.a��z����@��h��2� fQH�� ��@Z��M��CD �R����T��Y�QC ��N�
G����,p����
�#|����kB;"��b�h[��og�@��t��]�^��x�����f��L-����w�"G�s�mQ��*T^��x��rt��s�� dlC!I�&�1��(�����9�?����� ��)����]�vT��������a�&XX�~�'?=����[~E� O_(�W������C d|7�',������-�r�Z������y�*oW�
d$f!k�"���:��a-�{v��f��%Y1aw��t�����c,t��oU�B;��� �K~g���t�tzs�F�J�i\�V�3X����gv��p8�8��$�=�A�w�ah� ()TO$6�;RO|m?V���Yo�k��j=G�������,������ @����0��H[���<V:���,�Hr��@D"@*�I���!��C�h�c� �7�S�2��2�u+�j
��-����c&g�F�C��z}�
%m��Nf�d�m0&��-��+E��"AFb����sp}`���Z�d�Og��)�Q�3b�E�_�cm����ll�����X�C���W����xXm�jZ�*~/H���z��1>qA�����b@���='Ra���� �b{�
&�������Z���J�`�!���p;�X ,k�����R*"!��S
�t���;��C��%��^<��0�����5���]8wz�8Gy���yz" #=�.?�}#����)
m��0��P(\����{I�|���f�k_�>F�<��c�E!�HJ:��]*�'0� ����1����t�s�|-��8��sD���1��ddI'Ek�Dgf� ���]*�t���k�i���5�[�,���k��$��="Ro��o��S��b��9T����F{2���"TP +e�#%A#T/���51�{[�H�O0U6(�}�JIC^����lM�������Yx���Kf�[����<��t
�L��_�:l��R9'y�� "��];c�Mr�]W0�����wS*���7�F���9��w�l2�Ung}';�q�S�h��m`W�7����I��eM���;�A��sY�eU��Q�t���T����9�\��M_�Z �l���o��!]: �R[,E,*��6KF��Xf\�<���r�,+�\i� xx ��-e�D�H�`M��j�5�n�_���e
�&dMc[��2t�Z5w�����d��� ���78����gU�26�n�=fV��"m���R2F3nT�C{��Cl&l��(�2��By�,�W�B�B��Rb9(��#�����G�)'+��FN���LI�["6��3�E������=a H����9��`�^`[pB�SR]>���Q��
�Gli�<}A|�$ 7<tL�&��9��P�B�u��&�>�%���u(��%�E �o%��b��E�8���y�){�)8^��W9e<�{����Z�vR���v�V?��R��55���s��gE3�J�35���0����pty�:(�5�-�@%cP�|+!�<n��-�>�0<�[�>���]X����5���a�������s5e���>� I�aS���];��u�o���C���H���d��Di/��K�<���F�A�hr������p��7h�9�����rP���d�b�'�;#�K��-p���m,T����6��q�!�fB�}������0Y������4�v����QTp��_rabY��U��\����h��y�hx4O�Py���L.u�L9� Y-`�9p���
���k
���\���v,N��3���������q��D�����-�a\�����4��������)%H�����=� ��������D���Up��D�s��8�O�!/�eem�&��R|+Vc.���7����j��tG�L�7I�pT��#�����@ft���r���3IH�TO��3)T�/�����B��# �����j\-/�8.��/����������&�f�:�I��j�W�W�c�q���5@-�� ����k�3��y%nSO��
�:�Wx��,����Z(������^3�f���UOl��?G]���` �@�{s$��tu��K]��W�2c�\��d1�"�0R�b�"�ba��3WuBe�� !�PL��K����,[IQ#�+��������>���B�.bc�X��L@(�hd ����a7��L�Y����>5�0�ot�
�ux�
|#���T%�O��aX�\�HZ���*V�v��;�x�sjV��O�y��)��c$������K�������J!0}�7/��ty@���c(�r}�{O��z��y�<����0dK����b3��T�U��4V�������cI��=t >aj����Z�i��j�nR��a ��������������n0��Gq���n*>l�����A��\p�0����}<��� 8�4�\����0����2�d������0�5d
�_C�����WB�q,���1"�x��������>��K-OMeFNP����ogN�O�p�����"
��b�7L����N�:�"������U\������ k�ya��������$��E���VxB��m��R��6�|jG*��`�w"�|��,�_S�3����n��v�����L���HES��� 2����~�_�vdLdFl��|���cwB'�^x��~G���p)=y�A�� rR���>X���S�17\'��yR��5�X��!O���bh�j^����_NJz����w�$���E�}�nOL�IFSE_�G%G�'���I��->��V�������D4��%.@�0�������.�s�l��Wtv�l�
���v1�XP��r�����T���g�X��:��_]�����B�-�!��������w�{+�V�c���Q�=f�u���zg+���A9^ s�<�Z���y/��R�?�����~���m
;D�
��,�Hx���j#�'I�L9p�� �[�J��9B�������w����tIi�5K��_�`��^�g�=P��}v��C�,�!�km�x
�@���e��P�)|l���&k�`���mLf;z {Y2�H�`d�0b�q]%b|�t��KpN�)�`��S(E#"{|���{{�-��[�AE����7FD|�� OYV��j�sF�=S[�PL��;P����f�z��m��sey#"e� �_R�jC��/��0hb���$i��V���������N3��
���p4����1��> �kk��������F�ze�)\�\/%A����qZ�r|���'��8�\��]���{�����W`f`�je�|`"K���E�h����9D[P�� :�;LX�J�l������P�ZWc�T�,6-��{59���������}8q^�D�����6�K��hSHZ��P���`[�'��d=Q'U�{2$p6d$�S�VQB���Ozb���&�l�_�iN�!�9� %��k���
�q��+#YW��A�4��t�J�J02�OfBU�E��9�]=4O}*���t���eD����D\��K�c�:�B�e��B��M�\L���f�]P?=3d�������o-��g{Qv�3��}�R��#���t���POe@"Bel���f��T;Lp����o���
\%��=G�f"�jA�l���Q�w�k4��/�LLhp���'�x��0�\���
~�gc1irn���~ �x�B�1��B����OeBt`|�����6]~����<�S���,�32y��@cc@�P�f5p"�F�~����^,9�`mJ#�&%sT�'6���� �p�i�)�(��Xo9�?S���d���lJe�M�"d�u��2���Fs{d���Ij�/!l�'�^=}Sf�y=�l�����
�PmQ��T���7�F^S c��������2pv���?C
_� ������wlARRm�v����&�[+�*�x,
��Ab��Bo�~��������$����:*��NdD*���� ^M��
3�2�W T[���a!�C�te��BS2��I��c��T')�jAF����������
Ad.���
�o���]t��%���3��!�^*Z��$�8���=���
�n9��:P�
~�o����L��� �`b����^34�T�$B04�N34���\=�3��Zg�h�`X������q��H�IY��nAY��!����L�����1{��^���d*�c��5��-sG����r��ZX�Q&���2i���$1��w��������k�T��Y��#�B�'O��� >�:�� �S�2�&�rh�s��Cg���!.����O�?�'���lY�a(�'��>�)(�L��2�SW�
1P�g�9ae_������L��S��I!c6��u���rQ48�O���~���*�����z\��?�u�b���w)!$n�'W��)����O���s��!R+}X�]�W��>�O)gO/i����%��C���� ��'9�X���,a�$+���8���i���?.-��s���Cj7�����=��f�����zZ@yRX��%�����K����ZP�
K
u���2��t�0�����%�g�������D�iRo������e��#�bh%�����a��`x����f3qN������o��`\/������s&� x��p��J�Bl}�/�yRy/W�Z@��e�VQ�����/�`#O��Aa���/��pEZ��m�Xl��3��u9���)������v��P��#�Pyo�$�Lk��b_�����/��E����z��r;�r���2�m���'�B��������]SuS�n>_�]�1P���~)�6K����V�zM��6T���/����pI���_)T�����Y;1���8&_m�r��� ��<lX�`G�;��r�����.�8��0�W�n�8�wY��'7@$F�[zrI&�Z�M�&F+������qG������N��T��qk�9U6�
���l�������i���%}o��pj�~�+t����3���W�Z�)��~TG�|78���~&8� 8��|��]�ynZ4�i�}��O�&cY���+b�k����>������
n�h�%$��a��:�����[w��NR�g*`3��U�Ax�����d)�r�V��^�1�
�N��T�������R�X!��`@����O�����Gl�Q��O���L
W��T@����3G�#9�����e�u�s� r�e|
K_m�@M��`z-���+U5<�C�0x)��@��^m[���-��4���|>�p��D<WmC�a�]�vc{�oW��2�����8���
��lH��L*��2&T<a�"�`X���S�������4N^��fn��-nn��`�U�#��p{[(G��83g���H��xX������m)�2:�L���A���\���az��<�W�eXXN1Fq}��i���IM�h~�MQ���8�Li�z�� *1�U"c�*�1�]�!���7i��d[��_��V���9���&uH*��OP�&��6�B���,�p��,�����`UQ`��|@�x�[9����@�T��������S6M��}E�� _���5FQj���Qj@k����W��TG�Y�}RT�7\��t`�t��AD�!W$�-��p�c�Ke_$��0se�!s0��gz<�ton�����*��n����.�|i&%�3�����BW�O�5��nG29`�^��[Q�|�8U��Y�^�3����\�y9'����FT�����������������nY�������NR����VVT� 'yy�Lx�-��F
�����Gs��9:&0T��&D���� �T��-��.��F�����$?�B�������r��#guH��J����/�V���\sk���1O�� !�q��R<N��cf��2��bqs���(�������p �L��J
��O��OH��AB�8�x���M��^��+)M����}������/�1��_570�ur��~��
["���vfHL/��^����mT�!���qP�� K�k(�5
[5.��e��i��U}�T�*u2r��TA��,����.w��`��� ���W���[m��JK�H-Cb� ������Fw*\���Z��j��������fOe����x>8��K�k#PY��?V��h���*��h�[V$�)xE��W<�+9���9z�e��,BT���F��g�����yA�J�����:� Q��{ �I=�������
����Y^�-��Uc����yIQ��m�`��'�����;u��l����B�m������3����bV�����;�&��L�z��W-jZ}J6�jPs!�L/?����CU'��V��&����&�$ /��/��wM,j-����^�����*X��It��%Q�^�j�����GnB�3[f��^�����#����Z��\��{>�,�LX�l���I�H�qe��-��~���
�U���0�r��
m�e�6d���)T�k�����������IR�{T%u!����%�\C�P81w��"T��4�O�C8N�R�.(b���J�55�Y�&AZi���F�j��Sz,��o�( 1��,H*JX�24XH&���+J�f\�Z��^j����hA��Y0�3�Sz����xP �`\sv�Q�#��]�#���f��$������6Qj<>U���h��_��9��-D��6�aC��#4����<��r�d��Q����)(6�}���y�bZ"�k��Tf�]SXU�l�F#GL(�_T �m���
���y,A%���"��p��_��4��[�@P�>�"�����H���%�*2c��dV��A�����F�8L9�O&S
�f������3G�7���Ksf��Y�#g��E�K���l��"JRM��FW��������n!�(z�YS'* � j�f��!��Q��{F���e��A�7�����<�����U���C5���ow��2mn�I'�w�=sU �dd54�(��EDB��x"]~�D�S+3G��N�g�e{�&sL#G����hx��U���2E���A�3�����W����t
�n����-� u��6���d�*ePiY�|} �n�m��r�����&z�g�A��XV�'�L}MQ������)Kv�
<��b�I�1�����
���W�l���������r����[I�P�w�}��O�X��G��7{���0J�������{@! �|�m�����Y�xd�=Z/����3v��]��`����Q0
_���)�
Z����ga��/i��V�Q���+T�9&