[RFC][PATCH] wal decoding, attempt #2

Started by Andres Freundover 13 years ago111 messages
#1Andres Freund
andres@2ndquadrant.com

Hi,

It took me far longer than I planned, its not finished, but time is running
out. I would like some feedback that I am not going astray at this point...
*I* think the general approach is sound and a good way forward that provides
the basic infrastructure for many (all?) of the scenarios we talked about
before.

Anyway, here is my next attempt at $TOPIC.

Lets start with a quick demo (via psql):

/* just so we keep a sensible xmin horizon */
ROLLBACK PREPARED 'f';
BEGIN;
CREATE TABLE keepalive();
PREPARE TRANSACTION 'f';

DROP TABLE IF EXISTS replication_example;

SELECT pg_current_xlog_insert_location();
CHECKPOINT;
CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text
varchar(120));
begin;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
INSERT INTO replication_example(somedata, text) VALUES (1, 2);
commit;

ALTER TABLE replication_example ADD COLUMN bar int;

INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);

BEGIN;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);

commit;
ALTER TABLE replication_example DROP COLUMN bar;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 2);
INSERT INTO replication_example(somedata, text) VALUES (3, 3);
commit;

ALTER TABLE replication_example RENAME COLUMN text TO somenum;

INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);

ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING
(somenum::int4);

INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);

SELECT pg_current_xlog_insert_location();

---- Somewhat later ----

SELECT decode_xlog('0/1893D78', '0/18BE398');

WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
WARNING: tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
WARNING: tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
WARNING: tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
(null)
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
WARNING: tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
WARNING: COMMIT
decode_xlog
-------------
t
(1 row)

As you can see the patchset can decode several changes made to a table even
though we used DDL on it. Not everything is handled yet, but its a prototype
after all ;)

The way this works is:

A new component called SnapshotBuilder analyzes the xlog and build a special
kind of Snapshot. This works in a somewhat similar way to the
KnownAssignedXids machinery for Hot Standby.
Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback
for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal
SnapshotNow semantics used for catalog access with one of the previously built
snapshots. They should behave just the same as a normal SnapshotNow would have
behaved when the tuple change was written to the xlog.

This patch doesn't provide anything that uses the new infrastructure for
anything real, but I think thats good. Lets get this into something
committable and then add new things using it!

Small overview over the individual patches that will come as separate mails:

old, Alvaro is doing this properly right now, separate thread
[01]: Add embedded list interface (header only)

A new piece of infrastructure (for k-way mergesort), pretty much untested,
good idea in general I think, not very interesting:
[02]: Add minimal binary heap implementation

Boring, old.:
[03]: Add support for a generic wal reading facility dubbed XLogReader

Boring, old, borked:
[04]: add simple xlogdump tool

Slightly changed to use (tablespace, relfilenode), possibly similar problems
to earlier, not interesting at this point.
[05]: Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode)
relfilenode)

Unchanged:
[06]: Log enough data into the wal to reconstruct logical changes from it if wal_level=logical
wal_level=logical

I didn't implement proper cache handling, so I need to use the big hammer...:
[07]: Make InvalidateSystemCaches public

The major piece:
[08]: has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling

[08]: has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling
The snapshot building has the most critical infrastructure but misses
several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
* don't build them all the time, cache them
* don't increase ->xmax so slowly, its inefficient
* refcount
* actually free them
* proper cache handling
* we can probably reuse xl_xact_commit->nmsgs
* generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
* command_id handling
* combocid loggin/handling
* Add support for declaring tables as catalog tables that are not
pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction
commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling

The decode_xlog() function is *purely* a debugging tool that I do not want to
keep in the long run. I introduced it so we can concentrate on the topic at
hand without involving even more moving parts (see the next paragraph)...

Some parts of this I would like to only discuss later, in separate threads, to
avoid cluttering this one more than neccessary:
* how do we integrate this into walsender et al
* in which format do we transport changes
* how do we always keep enough wal

I have some work ontop of this, that handles ComboCid's and CommandId's
correctly (and thus mixed ddl/dml transactions), but its simply not finished
enough. I am pretty sure by now that it works even with those additional
complexities.

So, I am unfortunately too tired to write more than this... It will have to
suffice. I plan to release a newer version with more documentation soon.

Comments about the approach or even the general direction of the
implementation? Questions?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#2Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
1 attachment(s)
[PATCH 1/8] Add embedded list interface (header only)

Adds a single and a double linked list which can easily embedded into other
datastructures and can be used without any additional allocations.

Problematic: It requires USE_INLINE to be used. It could be remade to fallback
to to externally defined functions if that is not available but that hardly
seems sensibly at this day and age. Besides, the speed hit would be noticeable
and its only used in new code which could be disabled on machines - given they
still exists - without proper support for inline functions
---
src/include/utils/ilist.h | 253 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 253 insertions(+)
create mode 100644 src/include/utils/ilist.h

Attachments:

0001-Add-embedded-list-interface-header-only.patchtext/x-patch; name=0001-Add-embedded-list-interface-header-only.patchDownload
diff --git a/src/include/utils/ilist.h b/src/include/utils/ilist.h
new file mode 100644
index 0000000..03dae63
--- /dev/null
+++ b/src/include/utils/ilist.h
@@ -0,0 +1,253 @@
+#ifndef ILIST_H
+#define ILIST_H
+
+#ifdef __GNUC__
+#define unused_attr __attribute__((unused))
+#else
+#define unused_attr
+#endif
+
+#ifndef USE_INLINE
+#error "a compiler supporting static inlines is required"
+#endif
+
+#include <assert.h>
+
+typedef struct ilist_d_node ilist_d_node;
+
+struct ilist_d_node
+{
+	ilist_d_node* prev;
+	ilist_d_node* next;
+};
+
+typedef struct
+{
+	ilist_d_node head;
+} ilist_d_head;
+
+typedef struct ilist_s_node ilist_s_node;
+
+struct ilist_s_node
+{
+	ilist_s_node* next;
+};
+
+typedef struct
+{
+	ilist_s_node head;
+} ilist_s_head;
+
+#ifdef ILIST_DEBUG
+void ilist_d_check(ilist_d_head* head);
+#else
+static inline void ilist_d_check(ilist_d_head* head)
+{
+}
+#endif
+
+static inline void ilist_d_init(ilist_d_head *head)
+{
+	head->head.next = head->head.prev = &head->head;
+	ilist_d_check(head);
+}
+
+/*
+ * adds a node at the beginning of the list
+ */
+static inline void ilist_d_push_front(ilist_d_head *head, ilist_d_node *node)
+{
+	node->next = head->head.next;
+	node->prev = &head->head;
+	node->next->prev = node;
+	head->head.next = node;
+	ilist_d_check(head);
+}
+
+
+/*
+ * adds a node at the end of the list
+ */
+static inline void ilist_d_push_back(ilist_d_head *head, ilist_d_node *node)
+{
+	node->next = &head->head;
+	node->prev = head->head.prev;
+	node->prev->next = node;
+	head->head.prev = node;
+	ilist_d_check(head);
+}
+
+
+/*
+ * adds a node after another *in the same list*
+ */
+static inline void ilist_d_add_after(unused_attr ilist_d_head *head, ilist_d_node *after, ilist_d_node *node)
+{
+	node->prev = after;
+	node->next = after->next;
+	after->next = node;
+	node->next->prev = node;
+	ilist_d_check(head);
+}
+
+/*
+ * adds a node after another *in the same list*
+ */
+static inline void ilist_d_add_before(unused_attr ilist_d_head *head, ilist_d_node *before, ilist_d_node *node)
+{
+	node->prev = before->prev;
+	node->next = before;
+	before->prev = node;
+	node->prev->next = node;
+	ilist_d_check(head);
+}
+
+
+/*
+ * removes a node from a list
+ */
+static inline void ilist_d_remove(unused_attr ilist_d_head *head, ilist_d_node *node)
+{
+	ilist_d_check(head);
+	node->prev->next = node->next;
+	node->next->prev = node->prev;
+	ilist_d_check(head);
+}
+
+/*
+ * removes the first node from a list or returns NULL
+ */
+static inline ilist_d_node* ilist_d_pop_front(ilist_d_head *head)
+{
+	ilist_d_node* ret;
+
+	if (&head->head == head->head.next)
+		return NULL;
+
+	ret = head->head.next;
+	ilist_d_remove(head, head->head.next);
+	return ret;
+}
+
+
+static inline bool ilist_d_has_next(ilist_d_head *head, ilist_d_node *node)
+{
+	return node->next != &head->head;
+}
+
+static inline bool ilist_d_has_prev(ilist_d_head *head, ilist_d_node *node)
+{
+	return node->prev != &head->head;
+}
+
+static inline bool ilist_d_is_empty(ilist_d_head *head)
+{
+	return head->head.next == &head->head;
+}
+
+#define ilist_d_front(type, membername, ptr) (&((ptr)->head) == (ptr)->head.next) ? \
+	NULL : ilist_container(type, membername, (ptr)->head.next)
+
+#define ilist_d_front_unchecked(type, membername, ptr) ilist_container(type, membername, (ptr)->head.next)
+
+#define ilist_d_back(type, membername, ptr)  (&((ptr)->head) == (ptr)->head.prev) ? \
+	NULL : ilist_container(type, membername, (ptr)->head.prev)
+
+#define ilist_container(type, membername, ptr) ((type*)((char*)(ptr) - offsetof(type, membername)))
+
+#define ilist_d_foreach(name, ptr) for(name = (ptr)->head.next;	\
+                                     name != &(ptr)->head;	\
+                                     name = name->next)
+
+#define ilist_d_foreach_modify(name, nxt, ptr) for(name = (ptr)->head.next,    \
+	                                                   nxt = name->next;       \
+                                                   name != &(ptr)->head        \
+	                                                   ;                       \
+                                                   name = nxt, nxt = name->next)
+
+static inline void ilist_s_init(ilist_s_head *head)
+{
+	head->head.next = NULL;
+}
+
+static inline void ilist_s_push_front(ilist_s_head *head, ilist_s_node *node)
+{
+	node->next = head->head.next;
+	head->head.next = node;
+}
+
+/*
+ * fails if the list is empty
+ */
+static inline ilist_s_node* ilist_s_pop_front(ilist_s_head *head)
+{
+	ilist_s_node* front = head->head.next;
+	head->head.next = head->head.next->next;
+	return front;
+}
+
+/*
+ * removes a node from a list
+ * Attention: O(n)
+ */
+static inline void ilist_s_remove(ilist_s_head *head,
+                                  ilist_s_node *node)
+{
+	ilist_s_node *last = &head->head;
+	ilist_s_node *cur;
+#ifndef NDEBUG
+	bool found = false;
+#endif
+	while ((cur = last->next))
+	{
+		if (cur == node)
+		{
+			last->next = cur->next;
+#ifndef NDEBUG
+			found = true;
+#endif
+			break;
+		}
+		last = cur;
+	}
+	assert(found);
+}
+
+
+static inline void ilist_s_add_after(unused_attr ilist_s_head *head,
+                                     ilist_s_node *after, ilist_s_node *node)
+{
+	node->next = after->next;
+	after->next = node;
+}
+
+
+static inline bool ilist_s_is_empty(ilist_s_head *head)
+{
+	return head->head.next == NULL;
+}
+
+static inline bool ilist_s_has_next(unused_attr ilist_s_head* head,
+                                    ilist_s_node *node)
+{
+	return node->next != NULL;
+}
+
+
+#define ilist_s_front(type, membername, ptr) (ilist_s_is_empty(ptr) ? \
+	ilist_container(type, membername, (ptr).next) : NULL
+
+#define ilist_s_front_unchecked(type, membername, ptr) \
+	ilist_container(type, membername, (ptr)->head.next)
+
+#define ilist_s_foreach(name, ptr) for(name = (ptr)->head.next;         \
+                                       name != NULL;                    \
+                                       name = name->next)
+
+#define ilist_s_foreach_modify(name, nxt, ptr) for(name = (ptr)->head.next, \
+	                                                   nxt = name ? name->next : NULL; \
+                                                   name != NULL;            \
+                                                   name = nxt, nxt = name ? name->next : NULL)
+
+
+#endif
#3Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
1 attachment(s)
[PATCH 2/8] Add minimal binary heap implementation

This is basically untested.
---
src/backend/lib/Makefile | 2 +-
src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++
src/include/lib/simpleheap.h | 91 +++++++++++++++
3 files changed, 347 insertions(+), 1 deletion(-)
create mode 100644 src/backend/lib/simpleheap.c
create mode 100644 src/include/lib/simpleheap.h

Attachments:

0002-Add-minimal-binary-heap-implementation.patchtext/x-patch; name=0002-Add-minimal-binary-heap-implementation.patchDownload
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 2e1061e..1e1bd5c 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dllist.o stringinfo.o
+OBJS = dllist.o simpleheap.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/simpleheap.c b/src/backend/lib/simpleheap.c
new file mode 100644
index 0000000..825d0a8
--- /dev/null
+++ b/src/backend/lib/simpleheap.c
@@ -0,0 +1,255 @@
+/*-------------------------------------------------------------------------
+ *
+ * simpleheap.c
+ *	  A simple binary heap implementaion
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/simpleheap.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <math.h>
+
+#include "lib/simpleheap.h"
+
+static inline int
+simpleheap_left_off(size_t i)
+{
+	return 2 * i + 1;
+}
+
+static inline int
+simpleheap_right_off(size_t i)
+{
+	return 2 * i + 2;
+}
+
+static inline int
+simpleheap_parent_off(size_t i)
+{
+	return floor((i - 1) / 2);
+}
+
+/* sift up */
+static void
+simpleheap_sift_up(simpleheap *heap, size_t node_off);
+
+/* sift down */
+static void
+simpleheap_sift_down(simpleheap *heap, size_t node_off);
+
+static inline void
+simpleheap_swap(simpleheap *heap, size_t a, size_t b)
+{
+	simpleheap_kv swap;
+	swap.value = heap->values[a].value;
+	swap.key = heap->values[a].key;
+
+	heap->values[a].value = heap->values[b].value;
+	heap->values[a].key = heap->values[b].key;
+
+	heap->values[b].key = swap.key;
+	heap->values[b].value = swap.value;
+}
+
+/* sift down */
+static void
+simpleheap_sift_down(simpleheap *heap, size_t node_off)
+{
+	/* manually unrolled tail recursion */
+	while (true)
+	{
+		size_t left_off = simpleheap_left_off(node_off);
+		size_t right_off = simpleheap_right_off(node_off);
+		size_t swap_off = 0;
+
+		/* only one child can violate the heap property after a change */
+
+		/* check left child */
+		if (left_off < heap->size &&
+		    heap->compare(&heap->values[left_off],
+		                  &heap->values[node_off]) < 0)
+		{
+			/* heap condition violated */
+			swap_off = left_off;
+		}
+
+		/* check right child */
+		if (right_off < heap->size &&
+		    heap->compare(&heap->values[right_off],
+		                  &heap->values[node_off]) < 0)
+		{
+			/* heap condition violated */
+
+			/* swap with the smaller child */
+			if (!swap_off ||
+			    heap->compare(&heap->values[right_off],
+			                  &heap->values[left_off]) < 0)
+			{
+				swap_off = right_off;
+			}
+		}
+
+		if (!swap_off)
+		{
+			/* heap condition fullfilled, abort */
+			break;
+		}
+
+		/* swap node with the child violating the property */
+		simpleheap_swap(heap, swap_off, node_off);
+
+		/* recurse, check child subtree */
+		node_off = swap_off;
+	}
+}
+
+/* sift up */
+static void
+simpleheap_sift_up(simpleheap *heap, size_t node_off)
+{
+	/* manually unrolled tail recursion */
+	while (true)
+	{
+		size_t parent_off = simpleheap_parent_off(node_off);
+
+		if (heap->compare(&heap->values[parent_off],
+		                  &heap->values[node_off]) < 0)
+		{
+			/* heap property violated */
+			simpleheap_swap(heap, node_off, parent_off);
+
+			/* recurse */
+			node_off = parent_off;
+		}
+		else
+			break;
+	}
+}
+
+simpleheap*
+simpleheap_allocate(size_t allocate)
+{
+	simpleheap* heap = palloc(sizeof(simpleheap));
+	heap->values = palloc(sizeof(simpleheap_kv) * allocate);
+	heap->size = 0;
+	heap->space = allocate;
+	return heap;
+}
+
+void
+simpleheap_free(simpleheap* heap)
+{
+	pfree(heap->values);
+	pfree(heap);
+}
+
+/* initial building of a heap */
+void
+simpleheap_build(simpleheap *heap)
+{
+	int i;
+
+	for (i = simpleheap_parent_off(heap->size - 1); i >= 0; i--)
+	{
+		simpleheap_sift_down(heap, i);
+	}
+}
+
+/*
+ * Change the
+ */
+void
+simpleheap_change_key(simpleheap *heap, void* key)
+{
+	size_t next_off = 0;
+	int ret;
+	simpleheap_kv* kv;
+
+	heap->values[0].key = key;
+
+	/* no need to do anything if there is only one element */
+	if (heap->size == 1)
+	{
+		return;
+	}
+	else if (heap->size == 2)
+	{
+		next_off = 1;
+	}
+	else
+	{
+		ret = heap->compare(
+			&heap->values[simpleheap_left_off(0)],
+			&heap->values[simpleheap_right_off(0)]);
+
+		if (ret == -1)
+			next_off = simpleheap_left_off(0);
+		else
+			next_off = simpleheap_right_off(0);
+	}
+
+	/*
+	 * compare with the next key. If were still smaller we can skip
+	 * restructuring heap
+	 */
+	ret = heap->compare(
+		&heap->values[0],
+		&heap->values[next_off]);
+
+	if (ret == -1)
+		return;
+
+	kv = simpleheap_remove_first(heap);
+	simpleheap_add(heap, kv->key, kv->value);
+}
+
+void
+simpleheap_add_unordered(simpleheap* heap, void *key, void *value)
+{
+	if (heap->size + 1 == heap->space)
+		Assert("Cannot resize heaps");
+	heap->values[heap->size].key = key;
+	heap->values[heap->size++].value = value;
+}
+
+void
+simpleheap_add(simpleheap* heap, void *key, void *value)
+{
+	simpleheap_add_unordered(heap, key, value);
+	simpleheap_sift_up(heap, heap->size - 1);
+}
+
+simpleheap_kv*
+simpleheap_first(simpleheap* heap)
+{
+	if (!heap->size)
+		Assert("heap is empty");
+	return &heap->values[0];
+}
+
+
+simpleheap_kv*
+simpleheap_remove_first(simpleheap* heap)
+{
+	if (heap->size == 0)
+		Assert("heap is empty");
+
+	if (heap->size == 1)
+	{
+		heap->size--;
+		return &heap->values[0];
+	}
+
+	simpleheap_swap(heap, 0, heap->size - 1);
+	simpleheap_sift_down(heap, 0);
+
+	heap->size--;
+	return &heap->values[heap->size];
+}
diff --git a/src/include/lib/simpleheap.h b/src/include/lib/simpleheap.h
new file mode 100644
index 0000000..ab2d2ea
--- /dev/null
+++ b/src/include/lib/simpleheap.h
@@ -0,0 +1,91 @@
+/*
+ * simpleheap.h
+ *
+ * A simple binary heap implementation
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * src/include/lib/simpleheap.h
+ */
+
+#ifndef SIMPLEHEAP_H
+#define SIMPLEHEAP_H
+
+typedef struct simpleheap_kv
+{
+	void* key;
+	void* value;
+} simpleheap_kv;
+
+typedef struct simpleheap
+{
+	size_t size;
+	size_t space;
+	/*
+	 * Has to return:
+	 * -1 iff a < b
+	 * 0 iff a == b
+	 * +1 iff a > b
+	 */
+	int (*compare)(simpleheap_kv* a, simpleheap_kv* b);
+
+	simpleheap_kv *values;
+} simpleheap;
+
+simpleheap*
+simpleheap_allocate(size_t capacity);
+
+void
+simpleheap_free(simpleheap* heap);
+
+/*
+ * Add values without enforcing the heap property.
+ *
+ * simpleheap_build has to be called before relying on anything that needs a
+ * valid heap. This is mostly useful for initially filling a heap and staying
+ * in O(n) instead of O(n log n).
+ */
+void
+simpleheap_add_unordered(simpleheap* heap, void *key, void *value);
+
+/*
+ * Insert key/value pair
+ *
+ * O(log n)
+ */
+void
+simpleheap_add(simpleheap* heap, void *key, void *value);
+
+/*
+ * Returns the first element as indicated by comparisons of the ->compare()
+ * operator
+ *
+ * O(1)
+ */
+simpleheap_kv*
+simpleheap_first(simpleheap* heap);
+
+/*
+ * Returns and removes the first element as indicated by comparisons of the
+ * ->compare() operator
+ *
+ * O(log n)
+ */
+simpleheap_kv*
+simpleheap_remove_first(simpleheap* heap);
+
+void
+simpleheap_change_key(simpleheap *heap, void* newkey);
+
+
+/*
+ * make the heap fullfill the heap condition. Only needed if elements were
+ * added with simpleheap_add_unordered()
+ *
+ * O(n)
+ */
+void
+simpleheap_build(simpleheap *heap);
+
+
+#endif //SIMPLEHEAP_H
#4Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
1 attachment(s)
[PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
src/backend/access/transam/Makefile | 2 +-
src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++
src/include/access/xlogreader.h | 264 ++++++++
3 files changed, 1297 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogreader.c
create mode 100644 src/include/access/xlogreader.h

Attachments:

0003-Add-support-for-a-generic-wal-reading-facility-dubbe.patchtext/x-patch; name=0003-Add-support-for-a-generic-wal-reading-facility-dubbe.patchDownload
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index f82f10e..660b5fc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
-	twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o
+	twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 0000000..4392b29
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,1032 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogreader.c
+ *		Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/readxlog.c
+ *
+ * NOTES
+ *		Documentation about how do use this interface can be found in
+ *		xlogreader.h, more specifically in the definition of the
+ *		XLogReaderState struct where all parameters are documented.
+ *
+ * TODO:
+ * * more extensive validation of read records
+ * * separation of reader/writer
+ * * customizable error response
+ * * usable without backend code around
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog_internal.h"
+#include "access/transam.h"
+#include "catalog/pg_control.h"
+#include "access/xlogreader.h"
+
+/* If (very) verbose debugging is needed:
+ * #define VERBOSE_DEBUG
+ */
+
+XLogReaderState*
+XLogReaderAllocate(void)
+{
+	XLogReaderState* state = (XLogReaderState*)malloc(sizeof(XLogReaderState));
+	int i;
+
+	if (!state)
+		goto oom;
+
+	memset(&state->buf.record, 0, sizeof(XLogRecord));
+	state->buf.record_data_size = XLOG_BLCKSZ*8;
+	state->buf.record_data =
+			malloc(state->buf.record_data_size);
+
+	if (!state->buf.record_data)
+		goto oom;
+
+	memset(state->buf.record_data, 0, state->buf.record_data_size);
+	state->buf.origptr = InvalidXLogRecPtr;
+
+	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+	{
+		state->buf.bkp_block_data[i] =
+			malloc(BLCKSZ);
+
+		if (!state->buf.bkp_block_data[i])
+			goto oom;
+	}
+
+	state->is_record_interesting = NULL;
+	state->writeout_data = NULL;
+	state->finished_record = NULL;
+	state->private_data = NULL;
+	state->output_buffer_size = 0;
+
+	XLogReaderReset(state);
+	return state;
+
+oom:
+	ereport(ERROR,
+	        (errcode(ERRCODE_OUT_OF_MEMORY),
+	         errmsg("out of memory"),
+	         errdetail("failed while allocating an XLogReader")));
+	return NULL;
+}
+
+void
+XLogReaderFree(XLogReaderState* state)
+{
+	int i;
+
+	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+	{
+		free(state->buf.bkp_block_data[i]);
+	}
+
+	free(state->buf.record_data);
+
+	free(state);
+}
+
+void
+XLogReaderReset(XLogReaderState* state)
+{	state->in_record = false;
+	state->in_record_header = false;
+	state->do_reassemble_record = false;
+	state->in_bkp_blocks = 0;
+	state->in_bkp_block_header = false;
+	state->in_skip = false;
+	state->remaining_size = 0;
+	state->already_written_size = 0;
+	state->incomplete = false;
+	state->initialized = false;
+	state->needs_input = false;
+	state->needs_output = false;
+	state->stop_at_record_boundary = false;
+}
+
+static inline bool
+XLogReaderHasInput(XLogReaderState* state, Size size)
+{
+	XLogRecPtr tmp = state->curptr;
+	XLByteAdvance(tmp, size);
+	if (XLByteLE(state->endptr, tmp))
+		return false;
+	return true;
+}
+
+static inline bool
+XLogReaderHasOutput(XLogReaderState* state, Size size){
+	/* if we don't do output or have no limits in the output size */
+	if (state->writeout_data == NULL || state->output_buffer_size == 0)
+		return true;
+
+	if (state->already_written_size + size > state->output_buffer_size)
+		return false;
+
+	return true;
+}
+
+static inline bool
+XLogReaderHasSpace(XLogReaderState* state, Size size)
+{
+	if (!XLogReaderHasInput(state, size))
+		return false;
+
+	if (!XLogReaderHasOutput(state, size))
+		return false;
+
+	return true;
+}
+
+/* ----------------------------------------------------------------------------
+ * Write out data iff
+ * 1. we have a writeout_data callback
+ * 2. were currently behind startptr
+ *
+ * The 2nd condition requires that we will never start a write before startptr
+ * and finish after it. The code needs to guarantee this.
+ * ----------------------------------------------------------------------------
+ */
+static void
+XLogReaderInternalWrite(XLogReaderState* state, char* data, Size size)
+{
+	/* no point in doing any checks if we don't have a write callback */
+	if (!state->writeout_data)
+		return;
+
+	if (XLByteLT(state->curptr, state->startptr))
+		return;
+
+	state->writeout_data(state, data, size);
+}
+
+/*
+ * Change state so we read the next bkp block if there is one. If there is none
+ * return false so that the caller can consider the record finished.
+ */
+static bool
+XLogReaderInternalNextBkpBlock(XLogReaderState* state)
+{
+	Assert(state->in_record);
+	Assert(state->remaining_size == 0);
+
+	/*
+	 * only continue with in_record=true if we have bkp block
+	 */
+	while (state->in_bkp_blocks)
+	{
+		if (state->buf.record.xl_info &
+		    XLR_SET_BKP_BLOCK(XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks))
+		{
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "reading bkp block %u", XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks);
+#endif
+			break;
+		}
+		state->in_bkp_blocks--;
+	}
+
+	if (!state->in_bkp_blocks)
+		return false;
+
+	/* bkp blocks are stored without regard for alignment */
+
+	state->in_bkp_block_header = true;
+	state->remaining_size = sizeof(BkpBlock);
+
+	return true;
+}
+
+void
+XLogReaderRead(XLogReaderState* state)
+{
+	state->needs_input = false;
+	state->needs_output = false;
+
+	/*
+	 * Do some basic sanity checking and setup if were starting anew.
+	 */
+	if (!state->initialized)
+	{
+		if (!state->read_page)
+			elog(ERROR, "The read_page callback needs to be set");
+
+		state->initialized = true;
+		/*
+		 * we need to start reading at the beginning of the page to understand
+		 * what we are currently reading. We will skip over that because we
+		 * check curptr < startptr later.
+		 */
+		state->curptr = state->startptr;
+		state->curptr -= state->startptr % XLOG_BLCKSZ;
+
+		Assert(state->curptr % XLOG_BLCKSZ == 0);
+
+		elog(LOG, "start reading from %X/%X, scrolled back to %X/%X",
+		     (uint32) (state->startptr >> 32), (uint32) state->startptr,
+		     (uint32) (state->curptr >> 32), (uint32) state->curptr);
+	}
+	else
+	{
+		/*
+		 * We didn't finish reading the last time round. Since then new data
+		 * could have been appended to the current page. So we need to update
+		 * our copy of that.
+		 *
+		 * XXX: We could tie that to state->needs_input but that doesn't seem
+		 * worth the complication atm.
+		 */
+		XLogRecPtr rereadptr = state->curptr;
+		rereadptr -= rereadptr % XLOG_BLCKSZ;
+
+		XLByteAdvance(rereadptr, SizeOfXLogShortPHD);
+
+		if(!XLByteLE(rereadptr, state->endptr))
+			goto not_enough_input;
+
+		rereadptr -= rereadptr % XLOG_BLCKSZ;
+
+		state->read_page(state, state->cur_page, rereadptr);
+
+		/*
+		 * we will only rely on this data being valid if we are allowed to read
+		 * that far, so its safe to just always read the header. read_page
+		 * always returns a complete page even though its contents may be
+		 * invalid.
+		 */
+		state->page_header = (XLogPageHeader)state->cur_page;
+		state->page_header_size = XLogPageHeaderSize(state->page_header);
+	}
+
+#ifdef VERBOSE_DEBUG
+	elog(LOG, "starting reading for %X/%X from %X/%X",
+	     (uint32)(state->startptr >> 32), (uint32) state->startptr,
+	     (uint32)(state->curptr >> 32), (uint32) state->curptr);
+#endif
+	/*
+	 * Iterate over the data and reassemble it until we reached the end of the
+	 * data. As we advance curptr inside the loop we need to recheck whether we
+	 * have space inside as well.
+	 */
+	while (XLByteLT(state->curptr, state->endptr))
+	{
+		/* how much space is left in the current block */
+		uint32 len_in_block;
+
+		/*
+		 * did we read a partial xlog record due to input/output constraints?
+		 * If yes, we need to signal that to the caller so it can be handled
+		 * sensibly there. E.g. by waiting on a latch till more xlog is
+		 * available.
+		 */
+		bool partial_read = false;
+		bool partial_write = false;
+
+#ifdef VERBOSE_DEBUG
+		elog(LOG, "one loop start: record: %u header %u, skip: %u bkb_block: %d in_bkp_header: %u curptr: %X/%X remaining: %u, off: %u",
+		     state->in_record, state->in_record_header, state->in_skip,
+		     state->in_bkp_blocks, state->in_bkp_block_header,
+		     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+		     state->remaining_size,
+		     (uint32)(state->curptr % XLOG_BLCKSZ));
+#endif
+
+		/*
+		 * at a page boundary, read the header
+		 */
+		if (state->curptr % XLOG_BLCKSZ == 0)
+		{
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "reading page header, at %X/%X",
+			     (uint32)(state->curptr >> 32), (uint32)state->curptr);
+#endif
+			/*
+			 * check whether we can read enough to see the short header, we
+			 * need to read the short header's xlp_info to know whether this is
+			 * a short or a long header.
+			 */
+			if (!XLogReaderHasInput(state, SizeOfXLogShortPHD))
+				goto not_enough_input;
+
+			state->read_page(state, state->cur_page, state->curptr);
+			state->page_header = (XLogPageHeader)state->cur_page;
+			state->page_header_size = XLogPageHeaderSize(state->page_header);
+
+			/* check that we have enough space to read/write the full header */
+			if (!XLogReaderHasInput(state, state->page_header_size))
+				goto not_enough_input;
+
+			if (!XLogReaderHasOutput(state, state->page_header_size))
+				goto not_enough_output;
+
+			XLogReaderInternalWrite(state, state->cur_page, state->page_header_size);
+
+			XLByteAdvance(state->curptr, state->page_header_size);
+
+			if (state->page_header->xlp_info & XLP_FIRST_IS_CONTRECORD)
+			{
+				if (!state->in_record)
+				{
+					/*
+					 * we need to support this case for initializing a cluster
+					 * because we need to read/writeout a full page but there
+					 * may be none without records being split across.
+					 *
+					 * If we are before startptr there is nothing special about
+					 * this case. Most pages start with a contrecord.
+					 */
+					if(!XLByteLT(state->curptr, state->startptr))
+					{
+						elog(WARNING, "contrecord although we are not in a record at %X/%X, starting at %X/%X",
+						     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+						     (uint32)(state->startptr >> 32), (uint32)state->startptr);
+					}
+					state->in_record = true;
+					state->check_crc = false;
+					state->do_reassemble_record = false;
+					state->remaining_size = state->page_header->xlp_rem_len;
+					continue;
+				}
+				else
+				{
+					if (state->page_header->xlp_rem_len < state->remaining_size)
+						elog(PANIC, "remaining length is smaller than to be read data. xlp_rem_len: %u needed: %u",
+						     state->page_header->xlp_rem_len, state->remaining_size
+							);
+				}
+			}
+			else if (state->in_record)
+			{
+				elog(PANIC, "no contrecord although were in a record that continued onto the next page. info %hhu at page %X/%X",
+				     state->page_header->xlp_info,
+				     (uint32)(state->page_header->xlp_pageaddr >> 32),
+				     (uint32)state->page_header->xlp_pageaddr);
+			}
+		}
+
+		/*
+		 * If a record will start next, skip over alignment padding.
+		 */
+		if (!state->in_record)
+		{
+			/*
+			 * a record must be stored aligned. So skip as far we need to
+			 * comply with that.
+			 */
+			Size skiplen;
+			skiplen = MAXALIGN(state->curptr) - state->curptr;
+
+			if (skiplen)
+			{
+				if (!XLogReaderHasSpace(state, skiplen))
+				{
+#ifdef VERBOSE_DEBUG
+					elog(LOG, "not aligning bc of space");
+#endif
+					/*
+					 * We don't have enough space to read/write the alignment
+					 * bytes, so fake up a skip-state
+					 */
+					state->in_record = true;
+					state->check_crc = false;
+					state->in_skip = true;
+					state->remaining_size = skiplen;
+
+					if (!XLogReaderHasInput(state, skiplen))
+						goto not_enough_input;
+					goto not_enough_output;
+				}
+#ifdef VERBOSE_DEBUG
+				elog(LOG, "aligning from %X/%X to %X/%X, skips %lu",
+				     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+				     (uint32)((state->curptr + skiplen) >> 32),
+				     (uint32)(state->curptr + skiplen),
+				     skiplen
+					);
+#endif
+				XLogReaderInternalWrite(state, NULL, skiplen);
+
+				XLByteAdvance(state->curptr, skiplen);
+
+				/*
+				 * full pages are not treated as continuations, so restart on
+				 * the beginning of the new page.
+				 */
+				if ((state->curptr % XLOG_BLCKSZ) == 0)
+					continue;
+			}
+		}
+
+		/*
+		 * --------------------------------------------------------------------
+		 * Start to read a record
+		 * --------------------------------------------------------------------
+		 */
+		if (!state->in_record)
+		{
+			state->in_record = true;
+			state->in_record_header = true;
+			state->check_crc = true;
+
+			/*
+			 * If the record starts before startptr were not interested in its
+			 * contents. There is also no point in reassembling if were not
+			 * analyzing the contents.
+			 *
+			 * If every record needs to be processed by finish_record restarts
+			 * need to be started after the end of the last record.
+			 *
+			 * See state->restart_ptr for that point.
+			 */
+			if ((state->finished_record == NULL &&
+			     !state->stop_at_record_boundary) ||
+				XLByteLT(state->curptr, state->startptr)){
+				state->do_reassemble_record = false;
+			}
+			else
+				state->do_reassemble_record = true;
+
+			state->remaining_size = SizeOfXLogRecord;
+
+			/*
+			 * we quickly loose the original address of a record as we can skip
+			 * records and such, so keep the original addresses.
+			 */
+			state->buf.origptr = state->curptr;
+
+			INIT_CRC32(state->next_crc);
+		}
+
+		Assert(state->in_record);
+
+		/*
+		 * Compute how much space on the current page is left and how much of
+		 * that we actually are interested in.
+		 */
+
+		/* amount of space on page */
+		if (state->curptr % XLOG_BLCKSZ == 0)
+			len_in_block = 0;
+		else
+			len_in_block = XLOG_BLCKSZ - (state->curptr % XLOG_BLCKSZ);
+
+		/* we have more data available than we need, so read only as much as needed */
+		if (len_in_block > state->remaining_size)
+			len_in_block = state->remaining_size;
+
+		/*
+		 * Handle constraints set by startptr, endptr and the size of the
+		 * output buffer.
+		 *
+		 * Normally we use XLogReaderHasSpace for that, but thats not
+		 * convenient here because we want to read data in parts. It also
+		 * doesn't handle splitting around startptr. So, open-code the logic
+		 * for that.
+		 */
+
+		/* to make sure we always writeout in the same chunks, split at startptr */
+		if (XLByteLT(state->curptr, state->startptr) &&
+		    (state->curptr + len_in_block) > state->startptr )
+		{
+#ifdef VERBOSE_DEBUG
+			Size cur_len = len_in_block;
+#endif
+			len_in_block = state->startptr - state->curptr;
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "truncating len_in_block due to startptr from %lu to %u",
+			     cur_len, len_in_block);
+#endif
+		}
+
+		/* do we have enough valid data to read the current block? */
+		if (state->curptr + len_in_block > state->endptr)
+		{
+#ifdef VERBOSE_DEBUG
+			Size cur_len = len_in_block;
+#endif
+			len_in_block = state->endptr - state->curptr;
+			partial_read = true;
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "truncating len_in_block due to endptr %X/%X %lu to %i at %X/%X",
+			     (uint32)(state->startptr >> 32), (uint32)state->startptr,
+			     cur_len, len_in_block,
+			     (uint32)(state->curptr >> 32), (uint32)state->curptr);
+#endif
+		}
+
+		/* can we write what we read? */
+		if (state->writeout_data != NULL && state->output_buffer_size != 0
+				&& len_in_block > (state->output_buffer_size - state->already_written_size))
+		{
+#ifdef VERBOSE_DEBUG
+			Size cur_len = len_in_block;
+#endif
+			len_in_block = state->output_buffer_size - state->already_written_size;
+			partial_write = true;
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "truncating len_in_block due to output_buffer_size %lu to %i",
+			     cur_len, len_in_block);
+#endif
+		}
+
+		/* --------------------------------------------------------------------
+		 * copy data of the size determined above to whatever we are currently
+		 * reading.
+		 * --------------------------------------------------------------------
+		 */
+
+		/* nothing to do if were skipping */
+		if (state->in_skip)
+		{
+			/* writeout zero data, original content is boring */
+			XLogReaderInternalWrite(state, NULL, len_in_block);
+
+			/*
+			 * we may not need this here because were skipping over something
+			 * really uninteresting but keeping track of that would be
+			 * unnecessarily complicated.
+			 */
+			COMP_CRC32(state->next_crc,
+			           state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			           len_in_block);
+		}
+		/* reassemble the XLogRecord struct, quite likely in one-go */
+		else if (state->in_record_header)
+		{
+			/*
+			 * Need to clampt o sizeof(XLogRecord), we don't have the padding
+			 * in buf.record...
+			 */
+			Size already_written = SizeOfXLogRecord - state->remaining_size;
+			Size padding_size = SizeOfXLogRecord - sizeof(XLogRecord);
+			Size copysize = len_in_block;
+
+			if (state->remaining_size - len_in_block < padding_size)
+				copysize = Max(0, state->remaining_size - (int)padding_size);
+
+			memcpy((char*)&state->buf.record + already_written,
+			       state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			       copysize);
+
+			XLogReaderInternalWrite(state,
+			                        state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			                        len_in_block);
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "copied part of the record. len_in_block %u, remaining: %u",
+			     len_in_block, state->remaining_size);
+#endif
+		}
+		/*
+		 * copy data into the current backup block header so we have enough
+		 * knowledge to read the actual backup block afterwards
+		 */
+		else if (state->in_bkp_block_header)
+		{
+			int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+			BkpBlock* bkpb = &state->buf.bkp_block[blockno];
+
+			Assert(state->in_bkp_blocks);
+
+			memcpy((char*)bkpb + sizeof(BkpBlock) - state->remaining_size,
+			       state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			       len_in_block);
+
+			XLogReaderInternalWrite(state,
+			                        state->cur_page + ((uint32)state->curptr % XLOG_BLCKSZ),
+			                        len_in_block);
+
+			COMP_CRC32(state->next_crc,
+			           state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			           len_in_block);
+
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "copying bkp header for block %d, %u bytes, complete %lu at %X/%X rem %u",
+			     blockno, len_in_block, sizeof(BkpBlock),
+			     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+			     state->remaining_size);
+
+			if (state->remaining_size == len_in_block)
+			{
+				elog(LOG, "block off %u len %u", bkpb->hole_offset, bkpb->hole_length);
+			}
+#endif
+		}
+		/*
+		 * Reassemble the current backup block, those usually are the biggest
+		 * parts of individual XLogRecords so this might take several rounds.
+		 */
+		else if (state->in_bkp_blocks)
+		{
+			int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+			BkpBlock* bkpb = &state->buf.bkp_block[blockno];
+			char* data = state->buf.bkp_block_data[blockno];
+
+			if (state->do_reassemble_record)
+			{
+				memcpy(data + BLCKSZ - bkpb->hole_length - state->remaining_size,
+				       state->cur_page + (state->curptr % XLOG_BLCKSZ),
+				       len_in_block);
+			}
+
+			XLogReaderInternalWrite(state,
+			                        state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			                        len_in_block);
+
+			COMP_CRC32(state->next_crc,
+			           state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			           len_in_block);
+
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "copying %u bytes of data for bkp block %d, complete %u",
+			     len_in_block, blockno, state->remaining_size);
+#endif
+		}
+		/*
+		 * read the (rest) of the XLogRecord's data. Note that this is not the
+		 * XLogRecord struct itself!
+		 */
+		else if (state->in_record)
+		{
+			if (state->do_reassemble_record)
+			{
+				if(state->buf.record_data_size < state->buf.record.xl_len){
+					state->buf.record_data_size = state->buf.record.xl_len;
+					state->buf.record_data =
+						realloc(state->buf.record_data,
+						        state->buf.record_data_size);
+					if(!state->buf.record_data)
+						elog(ERROR, "could not allocate memory for contents of an xlog record");
+				}
+
+				memcpy(state->buf.record_data
+				       + state->buf.record.xl_len
+				       - state->remaining_size,
+				       state->cur_page + (state->curptr % XLOG_BLCKSZ),
+				       len_in_block);
+			}
+			XLogReaderInternalWrite(state,
+			                        state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			                        len_in_block);
+
+
+			COMP_CRC32(state->next_crc,
+			           state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			           len_in_block);
+
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "copying %u bytes into a record at off %u",
+			     len_in_block, (uint32)(state->curptr % XLOG_BLCKSZ));
+#endif
+		}
+
+		/* should handle wrapping around to next page */
+		XLByteAdvance(state->curptr, len_in_block);
+
+		/* do the math of how much we need to read next round */
+		state->remaining_size -= len_in_block;
+
+		/*
+		 * --------------------------------------------------------------------
+		 * we completed whatever we were reading. So, handle going to the next
+		 * state.
+		 * --------------------------------------------------------------------
+		 */
+		if (state->remaining_size == 0)
+		{
+			/* completed reading - and potentially reassembling - the record */
+			if (state->in_record_header)
+			{
+				state->in_record_header = false;
+
+				/* ------------------------------------------------------------
+				 * normally we don't look at the content of xlog records here,
+				 * XLOG_SWITCH is a special case though, as everything left in
+				 * that segment won't be sensbible content.
+				 * So skip to the next segment.
+				 * ------------------------------------------------------------
+				 */
+				if (state->buf.record.xl_rmid == RM_XLOG_ID
+				    && (state->buf.record.xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+				{
+					/*
+					 * Pretend the current data extends to end of segment
+					 */
+					elog(LOG, "XLOG_SWITCH");
+					state->curptr += XLogSegSize - 1;
+					state->curptr -= state->curptr % XLogSegSize;
+
+					state->in_record = false;
+					Assert(!state->in_bkp_blocks);
+					Assert(!state->in_skip);
+					continue;
+				}
+				else if (state->is_record_interesting == NULL ||
+				         state->is_record_interesting(state, &state->buf.record))
+				{
+					state->remaining_size = state->buf.record.xl_len;
+					Assert(state->in_bkp_blocks == 0);
+					Assert(!state->in_bkp_block_header);
+					Assert(!state->in_skip);
+#ifdef VERBOSE_DEBUG
+					elog(LOG, "found interesting record at %X/%X, prev: %X/%X, rmid %hhu, tx %u, len %u tot %u",
+					     (uint32)(state->buf.origptr >> 32), (uint32)state->buf.origptr,
+					     (uint32)(state->buf.record.xl_prev >> 32), (uint32)(state->buf.record.xl_prev),
+					     state->buf.record.xl_rmid, state->buf.record.xl_xid,
+					     state->buf.record.xl_len, state->buf.record.xl_tot_len);
+#endif
+
+				}
+				/* ------------------------------------------------------------
+				 * ok, everybody aggrees, the content of the current record are
+				 * just plain boring. So fake-up a record that replaces it with
+				 * a NOOP record.
+				 *
+				 * FIXME: we should allow "compressing" the output here. That
+				 * is write something that shows how long the record should be
+				 * if everything is decompressed again. This can radically
+				 * reduce space-usage over the wire.
+				 * It could also be very useful for traditional SR by removing
+				 * unneded BKP blocks from being transferred.  For that we
+				 * would need to recompute CRCs though, which we currently
+				 * don't support.
+				 * ------------------------------------------------------------
+				 */
+				else
+				{
+					/*
+					 * we need to fix up a fake record with correct length that
+					 * can be written out.
+					 */
+					XLogRecord spacer;
+
+					elog(LOG, "found boring record at %X/%X, rmid %hhu, tx %u, len %u tot %u",
+					     (uint32)(state->buf.origptr >> 32), (uint32)state->buf.origptr,
+					     state->buf.record.xl_rmid, state->buf.record.xl_xid,
+					     state->buf.record.xl_len, state->buf.record.xl_tot_len);
+
+					/*
+					 * xl_tot_len contains the size of the XLogRecord itself,
+					 * we read that already though.
+					 */
+					state->remaining_size = state->buf.record.xl_tot_len
+						- SizeOfXLogRecord;
+
+					state->in_record = true;
+					state->check_crc = true;
+					state->in_bkp_blocks = 0;
+					state->in_skip = true;
+
+					spacer.xl_prev = state->buf.origptr;
+					spacer.xl_xid = InvalidTransactionId;
+					spacer.xl_tot_len = state->buf.record.xl_tot_len;
+					spacer.xl_len = state->buf.record.xl_tot_len - SizeOfXLogRecord;
+					spacer.xl_rmid = RM_XLOG_ID;
+					spacer.xl_info = XLOG_NOOP;
+
+					XLogReaderInternalWrite(state, (char*)&spacer,
+					                        sizeof(XLogRecord));
+
+					/*
+					 * write out the padding in a separate write, otherwise we
+					 * would overrun the stack
+					 */
+					XLogReaderInternalWrite(state, NULL,
+					                        SizeOfXLogRecord - sizeof(XLogRecord));
+
+				}
+			}
+			/*
+			 * in the in_skip case we already read backup blocks because we
+			 * likely read record->xl_tot_len, so everything is finished.
+			 */
+			else if (state->in_skip)
+			{
+				state->in_record = false;
+				state->in_bkp_blocks = 0;
+				state->in_skip = false;
+				/* alignment is handled when starting to read a record */
+			}
+			/*
+			 * We read the header of the current block. Start reading the
+			 * content of that now.
+			 */
+			else if (state->in_bkp_block_header)
+			{
+				BkpBlock* bkpb;
+				int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+
+				Assert(state->in_bkp_blocks);
+
+				bkpb = &state->buf.bkp_block[blockno];
+
+				if(bkpb->hole_length >= BLCKSZ)
+				{
+					elog(ERROR, "hole_length of block %u is %u but maximum is %u",
+					     blockno, bkpb->hole_length, BLCKSZ);
+				}
+
+				if(bkpb->hole_offset >= BLCKSZ)
+				{
+					elog(ERROR, "hole_offset of block %u is %u but maximum is %u",
+					     blockno, bkpb->hole_offset, BLCKSZ);
+				}
+
+				state->remaining_size = BLCKSZ - bkpb->hole_length;
+				state->in_bkp_block_header = false;
+
+#ifdef VERBOSE_DEBUG
+				elog(LOG, "completed reading of header for %d, reading data now %u hole %u, off %u",
+				     blockno, state->remaining_size, bkpb->hole_length,
+				     bkpb->hole_offset);
+#endif
+			}
+			/*
+			 * The current backup block is finished, more could be following
+			 */
+			else if (state->in_bkp_blocks)
+			{
+				int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+				BkpBlock* bkpb;
+				char* bkpb_data;
+
+				Assert(!state->in_bkp_block_header);
+
+				bkpb = &state->buf.bkp_block[blockno];
+				bkpb_data = state->buf.bkp_block_data[blockno];
+
+				/*
+				 * reassemble block to its entirety by removing the bkp_hole
+				 * "compression"
+				 */
+				if(bkpb->hole_length){
+					memmove(bkpb_data + bkpb->hole_offset,
+					        bkpb_data + bkpb->hole_offset + bkpb->hole_length,
+					        BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
+					memset(bkpb_data + bkpb->hole_offset,
+					       0,
+					       bkpb->hole_length);
+				}
+
+				state->in_bkp_blocks--;
+
+				state->in_skip = false;
+
+				if(!XLogReaderInternalNextBkpBlock(state))
+					goto all_bkp_finished;
+
+			}
+			/*
+			 * read a non-skipped record, start reading bkp blocks afterwards
+			 */
+			else if (state->in_record)
+			{
+				Assert(!state->in_skip);
+
+				state->in_bkp_blocks = XLR_MAX_BKP_BLOCKS;
+
+				if(!XLogReaderInternalNextBkpBlock(state))
+					goto all_bkp_finished;
+			}
+		}
+		/*
+		 * Something could only be partially read inside a single block because
+		 * of input or output space constraints..
+		 */
+		else if (partial_read)
+		{
+			partial_read = false;
+			goto not_enough_input;
+		}
+		else if (partial_write)
+		{
+			partial_write = false;
+			goto not_enough_output;
+		}
+		/*
+		 * Data continues into the next block.
+		 */
+		else
+		{
+		}
+
+#ifdef VERBOSE_DEBUG
+		elog(LOG, "one loop end: record: %u header: %u, skip: %u bkb_block: %d in_bkp_header: %u curpos: %X/%X remaining: %u, off: %u",
+		     state->in_record, state->in_record_header, state->in_skip,
+		     state->in_bkp_blocks, state->in_bkp_block_header,
+		     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+		     state->remaining_size,
+		     (uint32)(state->curptr % XLOG_BLCKSZ));
+#endif
+		continue;
+
+		/*
+		 * we fully read a record. Process its contents if needed and start
+		 * reading the next record afterwards
+		 */
+	all_bkp_finished:
+		{
+			Assert(state->in_record);
+			Assert(!state->in_skip);
+			Assert(!state->in_bkp_block_header);
+			Assert(!state->in_bkp_blocks);
+
+			state->in_record = false;
+
+			/* compute and verify crc */
+			COMP_CRC32(state->next_crc,
+			           &state->buf.record,
+			           offsetof(XLogRecord, xl_crc));
+
+			FIN_CRC32(state->next_crc);
+
+			if (state->check_crc &&
+			    state->next_crc != state->buf.record.xl_crc) {
+				elog(ERROR, "crc mismatch: newly computed : %x, existing is %x",
+				     state->next_crc, state->buf.record.xl_crc);
+			}
+
+			/*
+			 * if we haven't reassembled the record there is no point in
+			 * calling the finished callback because we do not have any
+			 * interesting data. do_reassemble_record is false if we don't have
+			 * a finished_record callback.
+			 */
+			if (state->do_reassemble_record)
+			{
+				/* in stop_at_record_boundary thats a valid case */
+				if (state->finished_record)
+				{
+					state->finished_record(state, &state->buf);
+				}
+
+				if (state->stop_at_record_boundary)
+					goto out;
+			}
+
+			/* alignment is handled when starting to read a record */
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "finished record at %X/%X to %X/%X, already_written_size: %lu, reas = %d",
+			     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+			     (uint32)(state->endptr >> 32), (uint32)state->endptr,
+			     state->already_written_size, state->do_reassemble_record);
+#endif
+
+		}
+	}
+out:
+	/*
+	 * we are finished, check whether we finished everything, this may be
+	 * useful for the caller.
+	 */
+	if (state->in_skip)
+	{
+		state->incomplete = true;
+	}
+	else if (state->in_record)
+	{
+		state->incomplete = true;
+	}
+	else
+	{
+		state->incomplete = false;
+	}
+	return;
+
+not_enough_input:
+	/* signal we need more xlog and finish */
+	state->needs_input = true;
+	goto out;
+
+not_enough_output:
+	/* signal we need more space to write output to */
+	state->needs_output = true;
+	goto out;
+}
+
+XLogRecordBuffer*
+XLogReaderReadOne(XLogReaderState* state)
+{
+	bool was_set_to_stop = state->stop_at_record_boundary;
+	XLogRecPtr last_record = state->buf.origptr;
+
+	if (!was_set_to_stop)
+		state->stop_at_record_boundary = true;
+
+	XLogReaderRead(state);
+
+	if (!was_set_to_stop)
+		state->stop_at_record_boundary = false;
+
+	/* check that we fully read it and that its not the same as the last one */
+	if (state->incomplete ||
+	    XLByteEQ(last_record, state->buf.origptr))
+		return NULL;
+
+	return &state->buf;
+}
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
new file mode 100644
index 0000000..f45c90b
--- /dev/null
+++ b/src/include/access/xlogreader.h
@@ -0,0 +1,264 @@
+/*-------------------------------------------------------------------------
+ *
+ * readxlog.h
+ *
+ *		Generic xlog reading facility.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/access/readxlog.h
+ *
+ * NOTES
+ *		Check the definition of the XLogReaderState struct for instructions on
+ *		how to use the XLogReader infrastructure.
+ *
+ *		The basic idea is to allocate an XLogReaderState via
+ *		XLogReaderAllocate, fill out the wanted callbacks, set startptr/endptr
+ *		and call XLogReaderRead(state). That will iterate over the record as
+ *		long as it has enough input to reassemble a record calling
+ *		is_interesting/finish_record for every record found.
+ *-------------------------------------------------------------------------
+ */
+#ifndef READXLOG_H
+#define READXLOG_H
+
+#include "access/xlog_internal.h"
+
+/*
+ * Used to store a reassembled record.
+ */
+typedef struct XLogRecordBuffer
+{
+	/* the record itself */
+	XLogRecord record;
+
+	/* at which LSN was that record found at */
+	XLogRecPtr origptr;
+
+	/* the data for xlog record */
+	char* record_data;
+	uint32 record_data_size;
+
+	BkpBlock bkp_block[XLR_MAX_BKP_BLOCKS];
+	char* bkp_block_data[XLR_MAX_BKP_BLOCKS];
+} XLogRecordBuffer;
+
+
+struct XLogReaderState;
+
+/*
+ * The callbacks are explained in more detail inside the XLogReaderState
+ * struct.
+ */
+typedef bool (*XLogReaderStateInterestingCB)(struct XLogReaderState* state,
+                                             XLogRecord* r);
+typedef void (*XLogReaderStateWriteoutCB)(struct XLogReaderState* state,
+                                          char* data, Size length);
+typedef void (*XLogReaderStateFinishedRecordCB)(struct XLogReaderState* state,
+                                                XLogRecordBuffer* buf);
+typedef void (*XLogReaderStateReadPageCB)(struct XLogReaderState* state,
+                                          char* cur_page, XLogRecPtr at);
+
+typedef struct XLogReaderState
+{
+	/* ----------------------------------------
+	 * Public parameters
+	 * ----------------------------------------
+	 */
+
+	/* callbacks */
+
+	/*
+	 * Called to decide whether a xlog record is interesting and should be
+	 * assembled, analyzed (finished_record) and written out or skipped.
+	 *
+	 * Gets passed the current state as the first parameter and and the record
+	 * *header* to decide over as the second.
+	 *
+	 * Return false to skip the record - and output a NOOP record instead - and
+	 * true to reassemble it fully.
+	 *
+	 * If set to NULL every record is considered to be interesting.
+	 */
+	XLogReaderStateInterestingCB is_record_interesting;
+
+	/*
+	 * Writeout xlog data.
+	 *
+	 * The 'state' parameter is passed as the first parameter and a pointer to
+	 * the 'data' and its 'length' as second and third paramter. If the 'data'
+	 * is NULL zeroes need to be written out.
+	 */
+	XLogReaderStateWriteoutCB writeout_data;
+
+	/*
+	 * If set to anything but NULL this callback gets called after a record,
+	 * including the backup blocks, has been fully reassembled.
+	 *
+	 * The first parameter is the current 'state'. 'buf', an XLogRecordBuffer,
+	 * gets passed as the second parameter and contains the record header, its
+	 * data, original position/lsn and backup block.
+	 */
+	XLogReaderStateFinishedRecordCB finished_record;
+
+	/*
+	 * Data input function.
+	 *
+	 * This callback *has* to be implemented.
+	 *
+	 * Has to read XLOG_BLKSZ bytes that are at the location 'at' into the
+	 * memory pointed at by cur_page although everything behind endptr does not
+	 * have to be valid.
+	 */
+	XLogReaderStateReadPageCB read_page;
+
+	/*
+	 * this can be used by the caller to pass state to the callbacks without
+	 * using global variables or such ugliness. It will neither be read or set
+	 * by anything but your code.
+	 */
+	void* private_data;
+
+
+	/* from where to where are we reading */
+
+	/* so we know where interesting data starts after scrolling back to the beginning of a page */
+	XLogRecPtr startptr;
+
+	/* continue up to here in this run */
+	XLogRecPtr endptr;
+
+	/*
+	 * size of the output buffer, if set to zero (default), there is no limit
+	 * in the output buffer size.
+	 */
+	Size output_buffer_size;
+
+	/*
+	 * Stop reading and return after every completed record.
+	 */
+	bool stop_at_record_boundary;
+
+	/* ----------------------------------------
+	 * output parameters
+	 * ----------------------------------------
+	 */
+
+	/* we need new input data - a later endptr - to continue reading */
+	bool needs_input;
+
+	/* we need new output space to continue reading */
+	bool needs_output;
+
+	/* track our progress */
+	XLogRecPtr curptr;
+
+	/*
+	 * are we in the middle of something? This is useful for the outside to
+	 * know whether to start reading anew
+	 */
+	bool incomplete;
+
+	/* ----------------------------------------
+	 * private/internal state
+	 * ----------------------------------------
+	 */
+
+	char cur_page[XLOG_BLCKSZ];
+	XLogPageHeader page_header;
+	uint32 page_header_size;
+	XLogRecordBuffer buf;
+	pg_crc32 next_crc;
+
+	/* ----------------------------------------
+	 * state machine variables
+	 * ----------------------------------------
+	 */
+
+	bool initialized;
+
+	/* are we currently reading a record? */
+	bool in_record;
+
+	/* are we currently reading a record header? */
+	bool in_record_header;
+
+	/* do we want to reassemble the record or just read/write it? */
+	bool do_reassemble_record;
+
+	/* how many bkp blocks remain to be read? */
+	int in_bkp_blocks;
+
+	/*
+	 * the header of a bkp block can be split across pages, so we need to
+	 * support reading that incrementally
+	 */
+	bool in_bkp_block_header;
+
+	/*
+	 * We are not interested in the contents of the `remaining_size` next
+	 * blocks. Don't read their contents and write out zeroes instead.
+	 */
+	bool in_skip;
+
+	/*
+	 * Should we check the crc of the currently read record? In some situations
+	 * - e.g. if we just skip till the start of a record - this doesn't make
+	 * sense.
+	 *
+	 * This needs to be separate from in_skip because we want to be able not
+	 * writeout records but still verify those. E.g. records that are "not
+	 * interesting".
+	 */
+	bool check_crc;
+
+	/* how much more to read in the current state */
+	uint32 remaining_size;
+
+	/* size of already written data */
+	Size already_written_size;
+
+} XLogReaderState;
+
+/*
+ * Get a new XLogReader
+ *
+ * At least the read_page callback, startptr and endptr have to be set before
+ * the reader can be used.
+ */
+extern XLogReaderState* XLogReaderAllocate(void);
+
+/*
+ * Free an XLogReader
+ */
+extern void XLogReaderFree(XLogReaderState*);
+
+/*
+ * Reset internal state so it can be used without continuing from the last
+ * state.
+ *
+ * The callbacks and private_data won't be reset
+ */
+extern void XLogReaderReset(XLogReaderState* state);
+
+/*
+ * Read the xlog and call the appropriate callbacks as far as possible within
+ * the constraints of input data (startptr, endptr) and output space.
+ */
+extern void XLogReaderRead(XLogReaderState* state);
+
+/*
+ * Read the next xlog record if enough input/output is available.
+ *
+ * This is a bit less efficient than XLogReaderRead.
+ *
+ * Returns NULL if the next record couldn't be read for some reason. Check
+ * state->incomplete, ->needs_input, ->needs_output.
+ *
+ * Be careful to check that there is anything further to read when using
+ * ->endptr, otherwise its easy to get in an endless loop.
+ */
+extern XLogRecordBuffer* XLogReaderReadOne(XLogReaderState* state);
+
+#endif /* READXLOG_H */
#5Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
1 attachment(s)
[PATCH 4/8] add simple xlogdump tool

---
src/bin/Makefile | 2 +-
src/bin/xlogdump/Makefile | 25 ++++
src/bin/xlogdump/xlogdump.c | 334 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 360 insertions(+), 1 deletion(-)
create mode 100644 src/bin/xlogdump/Makefile
create mode 100644 src/bin/xlogdump/xlogdump.c

Attachments:

0004-add-simple-xlogdump-tool.patchtext/x-patch; name=0004-add-simple-xlogdump-tool.patchDownload
diff --git a/src/bin/Makefile b/src/bin/Makefile
index b4dfdba..9992f7a 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -14,7 +14,7 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = initdb pg_ctl pg_dump \
-	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
+	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump
 
 ifeq ($(PORTNAME), win32)
 SUBDIRS += pgevent
diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
new file mode 100644
index 0000000..d54640a
--- /dev/null
+++ b/src/bin/xlogdump/Makefile
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/xlogdump
+#
+# Copyright (c) 1998-2012, PostgreSQL Global Development Group
+#
+# src/bin/pg_resetxlog/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "xlogdump"
+PGAPPICON=win32
+
+subdir = src/bin/xlogdump
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS= xlogdump.o \
+	 $(WIN32RES)
+
+all: xlogdump
+
+
+xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v /main.o|sed 's/^/..\/..\/..\//')
+	$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
new file mode 100644
index 0000000..8e13193
--- /dev/null
+++ b/src/bin/xlogdump/xlogdump.c
@@ -0,0 +1,334 @@
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlogreader.h"
+#include "access/rmgr.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+
+/*
+ * needs to be declared because otherwise its defined in main.c which we cannot
+ * link from here.
+ */
+const char *progname = "xlogdump";
+
+static void
+XLogDumpXLogRead(char *buf, TimeLineID timeline_id, XLogRecPtr startptr, Size count);
+
+static void
+XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id, XLogRecPtr startptr,
+                  char *buf, Size count);
+
+#define XLogFilePathWrite(path, base, tli, logSegNo)			\
+	snprintf(path, MAXPGPATH, "%s/%08X%08X%08X", base, tli,		\
+			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId),		\
+			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
+
+static void
+XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id, XLogRecPtr startptr,
+                  char *buf, Size count)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	static int	sendFile = -1;
+	static XLogSegNo sendSegNo = 0;
+	static uint32 sendOff = 0;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			writebytes;
+
+		startoff = recptr % XLogSegSize;
+
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		{
+			char		path[MAXPGPATH];
+
+			/* Switch to another logfile segment */
+			if (sendFile >= 0)
+				close(sendFile);
+
+			XLByteToSeg(recptr, sendSegNo);
+			XLogFilePathWrite(path, directory, timeline_id, sendSegNo);
+
+			sendFile = open(path, O_WRONLY|O_CREAT, S_IRUSR | S_IWUSR);
+			if (sendFile < 0)
+			{
+				ereport(ERROR,
+				        (errcode_for_file_access(),
+				         errmsg("could not open file \"%s\": %m",
+				                path)));
+			}
+			sendOff = 0;
+		}
+
+		/* Need to seek in the file? */
+		if (sendOff != startoff)
+		{
+			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0){
+				char fname[MAXPGPATH];
+				XLogFileName(fname, timeline_id, sendSegNo);
+
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not seek in log segment %s to offset %u: %m",
+						        fname,
+								startoff)));
+			}
+			sendOff = startoff;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (XLogSegSize - startoff))
+			segbytes = XLogSegSize - startoff;
+		else
+			segbytes = nbytes;
+
+		writebytes = write(sendFile, p, segbytes);
+		if (writebytes <= 0)
+		{
+			char fname[MAXPGPATH];
+			XLogFileName(fname, timeline_id, sendSegNo);
+
+			ereport(ERROR,
+					(errcode_for_file_access(),
+			errmsg("could not write to log segment %s, offset %u, length %lu: %m",
+				   fname,
+				   sendOff, (unsigned long) segbytes)));
+		}
+
+		/* Update state for read */
+		XLByteAdvance(recptr, writebytes);
+
+		sendOff += writebytes;
+		nbytes -= writebytes;
+		p += writebytes;
+	}
+}
+
+/* this should probably be put in a general implementation */
+static void
+XLogDumpXLogRead(char *buf, TimeLineID timeline_id, XLogRecPtr startptr, Size count)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	static int	sendFile = -1;
+	static XLogSegNo sendSegNo = 0;
+	static uint32 sendOff = 0;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			readbytes;
+
+		startoff = recptr % XLogSegSize;
+
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		{
+			char		path[MAXPGPATH];
+
+			/* Switch to another logfile segment */
+			if (sendFile >= 0)
+				close(sendFile);
+
+			XLByteToSeg(recptr, sendSegNo);
+			XLogFilePath(path, timeline_id, sendSegNo);
+
+			sendFile = open(path, O_RDONLY, 0);
+			if (sendFile < 0)
+			{
+				char fname[MAXPGPATH];
+				XLogFileName(fname, timeline_id, sendSegNo);
+				/*
+				 * If the file is not found, assume it's because the standby
+				 * asked for a too old WAL segment that has already been
+				 * removed or recycled.
+				 */
+				if (errno == ENOENT)
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("requested WAL segment %s has already been removed",
+									fname)));
+				else
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open file \"%s\": %m",
+									path)));
+			}
+			sendOff = 0;
+		}
+
+		/* Need to seek in the file? */
+		if (sendOff != startoff)
+		{
+			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0){
+				char fname[MAXPGPATH];
+				XLogFileName(fname, timeline_id, sendSegNo);
+
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not seek in log segment %s to offset %u: %m",
+						        fname,
+								startoff)));
+			}
+			sendOff = startoff;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (XLogSegSize - startoff))
+			segbytes = XLogSegSize - startoff;
+		else
+			segbytes = nbytes;
+
+		readbytes = read(sendFile, p, segbytes);
+		if (readbytes <= 0)
+		{
+			char fname[MAXPGPATH];
+			XLogFileName(fname, timeline_id, sendSegNo);
+
+			ereport(ERROR,
+					(errcode_for_file_access(),
+			errmsg("could not read from log segment %s, offset %u, length %lu: %m",
+				   fname,
+				   sendOff, (unsigned long) segbytes)));
+		}
+
+		/* Update state for read */
+		XLByteAdvance(recptr, readbytes);
+
+		sendOff += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+	}
+}
+
+static void
+XLogDumpReadPage(XLogReaderState* state, char* cur_page, XLogRecPtr startptr)
+{
+    XLogPageHeader page_header;
+    Assert((startptr % XLOG_BLCKSZ) == 0);
+
+    /* FIXME: more sensible/efficient implementation */
+    XLogDumpXLogRead(cur_page, 1, startptr, XLOG_BLCKSZ);
+
+    page_header = (XLogPageHeader)cur_page;
+
+    if (page_header->xlp_magic != XLOG_PAGE_MAGIC)
+    {
+        elog(FATAL, "page header magic %x, should be %x at %X/%X", page_header->xlp_magic,
+             XLOG_PAGE_MAGIC, (uint32)(startptr << 32), (uint32)startptr);
+    }
+}
+
+static void
+XLogDumpWrite(XLogReaderState* state, char* data, Size len)
+{
+	static char zero[XLOG_BLCKSZ];
+	if(data == NULL)
+		data = zero;
+
+	XLogDumpXLogWrite("/tmp/xlog", 1 /* FIXME */, state->curptr,
+	                  data, len);
+}
+
+static void
+XLogDumpFinishedRecord(XLogReaderState* state, XLogRecordBuffer* buf)
+{
+	XLogRecord *record = &buf->record;
+	const RmgrData *rmgr = &RmgrTable[record->xl_rmid];
+
+	StringInfo str = makeStringInfo();
+	initStringInfo(str);
+
+	rmgr->rm_desc(str, state->buf.record.xl_info, buf->record_data);
+
+	fprintf(stderr, "xlog record: rmgr: %-11s, record_len: %6u, tot_len: %6u, tx: %10u, lsn: %X/%-8X, prev %X/%-8X, bkp: %u%u%u%u, desc: %s\n",
+	       rmgr->rm_name,
+	       record->xl_len, record->xl_tot_len,
+	       record->xl_xid,
+	       (uint32)(buf->origptr >> 32), (uint32)buf->origptr,
+	       (uint32)(record->xl_prev >> 32), (uint32)record->xl_prev,
+	       !!(XLR_BKP_BLOCK_1 & buf->record.xl_info),
+	       !!(XLR_BKP_BLOCK_2 & buf->record.xl_info),
+	       !!(XLR_BKP_BLOCK_3 & buf->record.xl_info),
+	       !!(XLR_BKP_BLOCK_4 & buf->record.xl_info),
+	       str->data);
+
+}
+
+
+static void init()
+{
+	MemoryContextInit();
+	IsPostmasterEnvironment = false;
+	log_min_messages = DEBUG1;
+	Log_error_verbosity = PGERROR_TERSE;
+	pg_timezone_initialize();
+}
+
+int main(int argc, char **argv)
+{
+	uint32 xlogid;
+	uint32 xrecoff;
+	XLogReaderState *xlogreader_state;
+	XLogRecPtr from, to;
+
+	init();
+
+	/* FIXME: should use getopt */
+	if (argc < 4)
+		elog(ERROR, "xlogdump timeline_id start finish");
+
+	if (sscanf(argv[2], "%X/%X", &xlogid, &xrecoff) != 2)
+		elog(ERROR, "couldn't parse argv[2]");
+
+	from = (((uint64)xlogid) << 32) | xrecoff;
+
+	if (sscanf(argv[3], "%X/%X", &xlogid, &xrecoff) != 2)
+		elog(ERROR, "couldn't parse argv[2]");
+
+	to = (uint64)xlogid << 32 | xrecoff;
+
+	xlogreader_state = XLogReaderAllocate();
+
+	/*
+	 * not set because we want all records, perhaps we want filtering later?
+	 * xlogreader_state->is_record_interesting =
+	 */
+	xlogreader_state->finished_record = XLogDumpFinishedRecord;
+
+	/*
+	 * not set because we do not want to copy data to somewhere yet
+	 * xlogreader_state->writeout_data = ;
+	 */
+	xlogreader_state->writeout_data = XLogDumpWrite;
+
+	xlogreader_state->read_page = XLogDumpReadPage;
+
+	xlogreader_state->private_data = NULL;
+
+	xlogreader_state->startptr = from;
+	xlogreader_state->endptr = to;
+
+	XLogReaderRead(xlogreader_state);
+	XLogReaderFree(xlogreader_state);
+	return 0;
+}
#6Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
1 attachment(s)
[PATCH 5/8] Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode)

This patch is problematic because formally indexes used by syscaches needs to
be unique, this one is not though because of 0/InvalidOids relfilenode entries
for nailed/shared catalog entries. Those values cannot be sensibly queries from
the catalog anyway though (the relmapper infrastructure needs to be used).

It might be nicer to add infrastructure to do this properly, I just don't have
a clue what the best way for this would be.
---
src/backend/utils/cache/syscache.c | 11 +++++++++++
src/include/catalog/indexing.h | 2 ++
src/include/catalog/pg_proc.h | 1 +
src/include/utils/syscache.h | 1 +
4 files changed, 15 insertions(+)

Attachments:

0005-Add-a-new-syscache-to-fetch-a-pg_class-entry-via-rel.patchtext/x-patch; name=0005-Add-a-new-syscache-to-fetch-a-pg_class-entry-via-rel.patchDownload
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ca22efd..9d2f6b7 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -613,6 +613,17 @@ static const struct cachedesc cacheinfo[] = {
 		},
 		1024
 	},
+	{RelationRelationId,		/* RELFILENODE */
+		ClassTblspcRelfilenodeIndexId,
+		2,
+		{
+			Anum_pg_class_reltablespace,
+			Anum_pg_class_relfilenode,
+			0,
+			0
+		},
+		1024
+	},
 	{RewriteRelationId,			/* RULERELNAME */
 		RewriteRelRulenameIndexId,
 		2,
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 238fe58..c0a9339 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -106,6 +106,8 @@ DECLARE_UNIQUE_INDEX(pg_class_oid_index, 2662, on pg_class using btree(oid oid_o
 #define ClassOidIndexId  2662
 DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, on pg_class using btree(relname name_ops, relnamespace oid_ops));
 #define ClassNameNspIndexId  2663
+DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 2844, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops));
+#define ClassTblspcRelfilenodeIndexId  2844
 
 DECLARE_UNIQUE_INDEX(pg_collation_name_enc_nsp_index, 3164, on pg_collation using btree(collname name_ops, collencoding int4_ops, collnamespace oid_ops));
 #define CollationNameEncNspIndexId 3164
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 77a3b41..d88248a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4667,6 +4667,7 @@ DATA(insert OID = 3473 (  spg_range_quad_leaf_consistent	PGNSP PGUID 12 1 0 0 0
 DESCR("SP-GiST support for quad tree over range");
 
 
+
 /*
  * Symbolic values for provolatile column: these indicate whether the result
  * of a function is dependent *only* on the values of its explicit arguments,
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index d1a9855..9a39077 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -77,6 +77,7 @@ enum SysCacheIdentifier
 	RANGETYPE,
 	RELNAMENSP,
 	RELOID,
+	RELFILENODE,
 	RULERELNAME,
 	STATRELATTINH,
 	TABLESPACEOID,
#7Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
1 attachment(s)
[PATCH 6/8] Log enough data into the wal to reconstruct logical changes from it if wal_level=logical

This adds a new wal_level value 'logical'

Missing cases:
- heap_multi_insert
- primary key changes for updates
- no primary key
- LOG_NEWPAGE
---
src/backend/access/heap/heapam.c | 135 +++++++++++++++++++++++++++++---
src/backend/access/transam/xlog.c | 1 +
src/backend/catalog/index.c | 74 +++++++++++++++++
src/bin/pg_controldata/pg_controldata.c | 2 +
src/include/access/xlog.h | 3 +-
src/include/catalog/index.h | 4 +
6 files changed, 207 insertions(+), 12 deletions(-)

Attachments:

0006-Log-enough-data-into-the-wal-to-reconstruct-logical-.patchtext/x-patch; name=0006-Log-enough-data-into-the-wal-to-reconstruct-logical-.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f56b577..190ae03 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1938,10 +1939,19 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
 		XLogRecPtr	recptr;
-		XLogRecData rdata[3];
+		XLogRecData rdata[4];
 		Page		page = BufferGetPage(buffer);
 		uint8		info = XLOG_HEAP_INSERT;
 
+		/*
+		 * For the logical replication case we need the tuple even if were
+		 * doing a full page write. We could alternatively store a pointer into
+		 * the fpw though.
+		 * For that to work we add another rdata entry for the buffer in that
+		 * case.
+		 */
+		bool        need_tuple = wal_level == WAL_LEVEL_LOGICAL;
+
 		xlrec.all_visible_cleared = all_visible_cleared;
 		xlrec.target.node = relation->rd_node;
 		xlrec.target.tid = heaptup->t_self;
@@ -1961,18 +1971,32 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		 */
 		rdata[1].data = (char *) &xlhdr;
 		rdata[1].len = SizeOfHeapHeader;
-		rdata[1].buffer = buffer;
+		rdata[1].buffer = need_tuple ? InvalidBuffer : buffer;
 		rdata[1].buffer_std = true;
 		rdata[1].next = &(rdata[2]);
 
 		/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
 		rdata[2].data = (char *) heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits);
 		rdata[2].len = heaptup->t_len - offsetof(HeapTupleHeaderData, t_bits);
-		rdata[2].buffer = buffer;
+		rdata[2].buffer = need_tuple ? InvalidBuffer : buffer;
 		rdata[2].buffer_std = true;
 		rdata[2].next = NULL;
 
 		/*
+		 * add record for the buffer without actual content thats removed if
+		 * fpw is done for that buffer
+		 */
+		if(need_tuple){
+			rdata[2].next = &(rdata[3]);
+
+			rdata[3].data = NULL;
+			rdata[3].len = 0;
+			rdata[3].buffer = buffer;
+			rdata[3].buffer_std = true;
+			rdata[3].next = NULL;
+		}
+
+		/*
 		 * If this is the single and first tuple on page, we can reinit the
 		 * page instead of restoring the whole thing.  Set flag, and hide
 		 * buffer references from XLogInsert.
@@ -1981,7 +2005,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
 		{
 			info |= XLOG_HEAP_INIT_PAGE;
-			rdata[1].buffer = rdata[2].buffer = InvalidBuffer;
+			rdata[1].buffer = rdata[2].buffer = rdata[3].buffer = InvalidBuffer;
 		}
 
 		recptr = XLogInsert(RM_HEAP_ID, info, rdata);
@@ -2569,7 +2593,9 @@ l1:
 	{
 		xl_heap_delete xlrec;
 		XLogRecPtr	recptr;
-		XLogRecData rdata[2];
+		XLogRecData rdata[4];
+
+		bool need_tuple = wal_level == WAL_LEVEL_LOGICAL && relation->rd_id  >= FirstNormalObjectId;
 
 		xlrec.all_visible_cleared = all_visible_cleared;
 		xlrec.target.node = relation->rd_node;
@@ -2585,6 +2611,73 @@ l1:
 		rdata[1].buffer_std = true;
 		rdata[1].next = NULL;
 
+		/*
+		 * XXX: We could decide not to log changes when the origin is not the
+		 * local node, that should reduce redundant logging.
+		 */
+		if(need_tuple){
+			xl_heap_header xlhdr;
+
+			Oid indexoid = InvalidOid;
+			int16 pknratts;
+			int16 pkattnum[INDEX_MAX_KEYS];
+			Oid pktypoid[INDEX_MAX_KEYS];
+			Oid pkopclass[INDEX_MAX_KEYS];
+			TupleDesc desc = RelationGetDescr(relation);
+			Relation index_rel;
+			TupleDesc indexdesc;
+			int natt;
+
+			Datum idxvals[INDEX_MAX_KEYS];
+			bool idxisnull[INDEX_MAX_KEYS];
+			HeapTuple idxtuple;
+
+			MemSet(pkattnum, 0, sizeof(pkattnum));
+			MemSet(pktypoid, 0, sizeof(pktypoid));
+			MemSet(pkopclass, 0, sizeof(pkopclass));
+			MemSet(idxvals, 0, sizeof(idxvals));
+			MemSet(idxisnull, 0, sizeof(idxisnull));
+			relationFindPrimaryKey(relation, &indexoid, &pknratts, pkattnum, pktypoid, pkopclass);
+
+			if(!indexoid){
+				elog(WARNING, "Could not find primary key for table with oid %u",
+				     relation->rd_id);
+				goto no_index_found;
+			}
+
+			index_rel = index_open(indexoid, AccessShareLock);
+
+			indexdesc = RelationGetDescr(index_rel);
+
+			for(natt = 0; natt < indexdesc->natts; natt++){
+				idxvals[natt] =
+					fastgetattr(&tp, pkattnum[natt], desc, &idxisnull[natt]);
+				Assert(!idxisnull[natt]);
+			}
+
+			idxtuple = heap_form_tuple(indexdesc, idxvals, idxisnull);
+
+			xlhdr.t_infomask2 = idxtuple->t_data->t_infomask2;
+			xlhdr.t_infomask = idxtuple->t_data->t_infomask;
+			xlhdr.t_hoff = idxtuple->t_data->t_hoff;
+
+			rdata[1].next = &(rdata[2]);
+			rdata[2].data = (char*)&xlhdr;
+			rdata[2].len = SizeOfHeapHeader;
+			rdata[2].buffer = InvalidBuffer;
+			rdata[2].next = NULL;
+
+			rdata[2].next = &(rdata[3]);
+			rdata[3].data = (char *) idxtuple->t_data + offsetof(HeapTupleHeaderData, t_bits);
+			rdata[3].len = idxtuple->t_len - offsetof(HeapTupleHeaderData, t_bits);
+			rdata[3].buffer = InvalidBuffer;
+			rdata[3].next = NULL;
+
+			heap_close(index_rel, NoLock);
+		no_index_found:
+			;
+		}
+
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE, rdata);
 
 		PageSetLSN(page, recptr);
@@ -4414,9 +4507,14 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	xl_heap_header xlhdr;
 	uint8		info;
 	XLogRecPtr	recptr;
-	XLogRecData rdata[4];
+	XLogRecData rdata[5];
 	Page		page = BufferGetPage(newbuf);
 
+	/*
+	 * Just as for XLOG_HEAP_INSERT we need to make sure the tuple
+	 */
+	bool        need_tuple = wal_level == WAL_LEVEL_LOGICAL;
+
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
@@ -4447,28 +4545,43 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	xlhdr.t_hoff = newtup->t_data->t_hoff;
 
 	/*
-	 * As with insert records, we need not store the rdata[2] segment if we
-	 * decide to store the whole buffer instead.
+	 * As with insert's logging , we need not store the the Datum containing
+	 * tuples separately from the buffer if we do logical replication that
+	 * is...
 	 */
 	rdata[2].data = (char *) &xlhdr;
 	rdata[2].len = SizeOfHeapHeader;
-	rdata[2].buffer = newbuf;
+	rdata[2].buffer = need_tuple ? InvalidBuffer : newbuf;
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].buffer = newbuf;
+	rdata[3].buffer = need_tuple ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
 
+	/*
+	 * separate storage for the buffer reference of the new page in the
+	 * wal_level=logical case
+	*/
+	if(need_tuple){
+		rdata[3].next = &(rdata[4]);
+
+		rdata[4].data = NULL,
+		rdata[4].len = 0;
+		rdata[4].buffer = newbuf;
+		rdata[4].buffer_std = true;
+		rdata[4].next = NULL;
+	}
+
 	/* If new tuple is the single and first tuple on page... */
 	if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
 		PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
 	{
 		info |= XLOG_HEAP_INIT_PAGE;
-		rdata[2].buffer = rdata[3].buffer = InvalidBuffer;
+		rdata[2].buffer = rdata[3].buffer = rdata[4].buffer = InvalidBuffer;
 	}
 
 	recptr = XLogInsert(RM_HEAP_ID, info, rdata);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff56c26..53a0bc8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -107,6 +107,7 @@ const struct config_enum_entry wal_level_options[] = {
 	{"minimal", WAL_LEVEL_MINIMAL, false},
 	{"archive", WAL_LEVEL_ARCHIVE, false},
 	{"hot_standby", WAL_LEVEL_HOT_STANDBY, false},
+	{"logical", WAL_LEVEL_LOGICAL, false},
 	{NULL, 0, false}
 };
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 464950b..8145997 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -49,6 +49,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "parser/parser.h"
+#include "parser/parse_relation.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -3322,3 +3323,76 @@ ResetReindexPending(void)
 {
 	pendingReindexedIndexes = NIL;
 }
+
+/*
+ * relationFindPrimaryKey
+ *		Find primary key for a relation if it exists.
+ *
+ * If no primary key is found *indexOid is set to InvalidOid
+ *
+ * This is quite similar to tablecmd.c's transformFkeyGetPrimaryKey.
+ *
+ * XXX: It might be a good idea to change pg_class.relhaspkey into an bool to
+ * make this more efficient.
+ */
+void
+relationFindPrimaryKey(Relation pkrel, Oid *indexOid,
+                       int16 *nratts, int16 *attnums, Oid *atttypids,
+                       Oid *opclasses){
+	List *indexoidlist;
+	ListCell *indexoidscan;
+	HeapTuple indexTuple = NULL;
+	Datum indclassDatum;
+	bool isnull;
+	oidvector  *indclass;
+	int i;
+	Form_pg_index indexStruct = NULL;
+
+	*indexOid = InvalidOid;
+
+	indexoidlist = RelationGetIndexList(pkrel);
+
+	foreach(indexoidscan, indexoidlist)
+	{
+		Oid indexoid = lfirst_oid(indexoidscan);
+
+		indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(indexoid));
+		if(!HeapTupleIsValid(indexTuple))
+			elog(ERROR, "cache lookup failed for index %u", indexoid);
+
+		indexStruct = (Form_pg_index) GETSTRUCT(indexTuple);
+		if(indexStruct->indisprimary && indexStruct->indimmediate)
+		{
+			*indexOid = indexoid;
+			break;
+		}
+		ReleaseSysCache(indexTuple);
+
+	}
+	list_free(indexoidlist);
+
+	if (!OidIsValid(*indexOid))
+		return;
+
+	/* Must get indclass the hard way */
+	indclassDatum = SysCacheGetAttr(INDEXRELID, indexTuple,
+									Anum_pg_index_indclass, &isnull);
+	Assert(!isnull);
+	indclass = (oidvector *) DatumGetPointer(indclassDatum);
+
+	*nratts = indexStruct->indnatts;
+	/*
+	 * Now build the list of PK attributes from the indkey definition (we
+	 * assume a primary key cannot have expressional elements)
+	 */
+	for (i = 0; i < indexStruct->indnatts; i++)
+	{
+		int			pkattno = indexStruct->indkey.values[i];
+
+		attnums[i] = pkattno;
+		atttypids[i] = attnumTypeId(pkrel, pkattno);
+		opclasses[i] = indclass->values[i];
+	}
+
+	ReleaseSysCache(indexTuple);
+}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 129c4d0..10080d0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -77,6 +77,8 @@ wal_level_str(WalLevel wal_level)
 			return "archive";
 		case WAL_LEVEL_HOT_STANDBY:
 			return "hot_standby";
+		case WAL_LEVEL_LOGICAL:
+			return "logical";
 	}
 	return _("unrecognized wal_level");
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2893f3b..7d90416 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -200,7 +200,8 @@ typedef enum WalLevel
 {
 	WAL_LEVEL_MINIMAL = 0,
 	WAL_LEVEL_ARCHIVE,
-	WAL_LEVEL_HOT_STANDBY
+	WAL_LEVEL_HOT_STANDBY,
+	WAL_LEVEL_LOGICAL
 } WalLevel;
 extern int	wal_level;
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index eb417ce..3de0a29 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -102,4 +102,8 @@ extern bool ReindexIsProcessingHeap(Oid heapOid);
 extern bool ReindexIsProcessingIndex(Oid indexOid);
 extern Oid	IndexGetRelation(Oid indexId, bool missing_ok);
 
+extern void relationFindPrimaryKey(Relation pkrel, Oid *indexOid,
+                                   int16 *nratts, int16 *attnums, Oid *atttypids,
+                                   Oid *opclasses);
+
 #endif   /* INDEX_H */
#8Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
1 attachment(s)
[PATCH 7/8] Make InvalidateSystemCaches public

Pieces of this are in commit: make relfilenode lookup (tablespace, relfilenode
---
src/backend/utils/cache/inval.c | 2 +-
src/include/utils/inval.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

Attachments:

0007-Make-InvalidateSystemCaches-public.patchtext/x-patch; name=0007-Make-InvalidateSystemCaches-public.patchDownload
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index e26bf0b..c75c032 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -547,7 +547,7 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
  *		since that tells us we've lost some shared-inval messages and hence
  *		don't know what needs to be invalidated.
  */
-static void
+void
 InvalidateSystemCaches(void)
 {
 	int			i;
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index c5549a6..648bfdc 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -67,4 +67,6 @@ extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 extern void inval_twophase_postcommit(TransactionId xid, uint16 info,
 						  void *recdata, uint32 len);
 
+extern void InvalidateSystemCaches(void);
+
 #endif   /* INVAL_H */
#9Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
1 attachment(s)
[PATCH 8/8] Introduce wal decoding via catalog timetravel

This introduces several things:
* applycache module which reassembles transactions from a stream of interspersed changes
* snapbuilder which builds catalog snapshots so that tuples from wal can be understood
* wal decoding into an applycache
* decode_xlog(lsn, lsn) debugging function

The applycache provides 3 major callbacks:
* apply_begin
* apply_change
* apply_commit

It is missing several parts:
- spill-to-disk
- resource usage controls
- command id handling
- passing of the correct mvcc snapshot (already has it, just doesn't pass)

The snapshot building has the most critical infrastructure but misses several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
* don't build them all the time, cache them
* don't increase ->xmax so slowly, its inefficient
* refcount
* actually free them
* proper cache handling
* we can probably reuse xl_xact_commit->nmsgs
* generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
* command_id handling
* combocid loggin/handling
* Add support for declaring tables as catalog tables that are not pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling

The xlog decoding also misses several parts:
- HEAP_NEWPAGE support
- HEAP2_MULTI_INSERT support
- handling of table rewrites
---
src/backend/replication/Makefile | 2 +
src/backend/replication/logical/Makefile | 19 +
src/backend/replication/logical/applycache.c | 574 +++++++++++++
src/backend/replication/logical/decode.c | 366 +++++++++
src/backend/replication/logical/logicalfuncs.c | 237 ++++++
src/backend/replication/logical/snapbuild.c | 1045 ++++++++++++++++++++++++
src/backend/utils/time/tqual.c | 161 ++++
src/include/access/transam.h | 5 +
src/include/catalog/pg_proc.h | 3 +
src/include/replication/applycache.h | 239 ++++++
src/include/replication/decode.h | 26 +
src/include/replication/snapbuild.h | 119 +++
src/include/utils/tqual.h | 21 +-
13 files changed, 2816 insertions(+), 1 deletion(-)
create mode 100644 src/backend/replication/logical/Makefile
create mode 100644 src/backend/replication/logical/applycache.c
create mode 100644 src/backend/replication/logical/decode.c
create mode 100644 src/backend/replication/logical/logicalfuncs.c
create mode 100644 src/backend/replication/logical/snapbuild.c
create mode 100644 src/include/replication/applycache.h
create mode 100644 src/include/replication/decode.h
create mode 100644 src/include/replication/snapbuild.h

Attachments:

0008-Introduce-wal-decoding-via-catalog-timetravel.patchtext/x-patch; name=0008-Introduce-wal-decoding-via-catalog-timetravel.patchDownload
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 9d9ec87..ae7f6b1 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -17,6 +17,8 @@ override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
 OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
 	repl_gram.o syncrep.o
 
+SUBDIRS = logical
+
 include $(top_srcdir)/src/backend/common.mk
 
 # repl_scanner is compiled as part of repl_gram
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
new file mode 100644
index 0000000..4e56769
--- /dev/null
+++ b/src/backend/replication/logical/Makefile
@@ -0,0 +1,19 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/replication/logical
+#
+# IDENTIFICATION
+#    src/backend/replication/logical/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/logical
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
+
+OBJS = applycache.o decode.o snapbuild.o logicalfuncs.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/applycache.c b/src/backend/replication/logical/applycache.c
new file mode 100644
index 0000000..1e08371
--- /dev/null
+++ b/src/backend/replication/logical/applycache.c
@@ -0,0 +1,574 @@
+/*-------------------------------------------------------------------------
+ *
+ * applycache.c
+ *
+ * PostgreSQL logical replay "cache" management
+ *
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/applycache.c
+ *
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/xact.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"
+#include "replication/applycache.h"
+
+#include "lib/simpleheap.h"
+
+#include "utils/ilist.h"
+#include "utils/memutils.h"
+#include "utils/relcache.h"
+#include "utils/tqual.h"
+#include "utils/syscache.h"
+
+
+const Size max_memtries = 1<<16;
+
+const size_t max_cached_changes = 1024;
+const size_t max_cached_tuplebufs = 1024; /* ~8MB */
+const size_t max_cached_transactions = 512;
+
+typedef struct ApplyCacheTXNByIdEnt
+{
+	TransactionId xid;
+	ApplyCacheTXN* txn;
+} ApplyCacheTXNByIdEnt;
+
+static ApplyCacheTXN* ApplyCacheGetTXN(ApplyCache *cache);
+static void ApplyCacheReturnTXN(ApplyCache *cache, ApplyCacheTXN* txn);
+
+static ApplyCacheTXN* ApplyCacheTXNByXid(ApplyCache*, TransactionId xid,
+                                         bool create, bool* is_new);
+
+
+ApplyCache*
+ApplyCacheAllocate(void)
+{
+	ApplyCache* cache = (ApplyCache*)malloc(sizeof(ApplyCache));
+	HASHCTL         hash_ctl;
+
+	if (!cache)
+		elog(ERROR, "Could not allocate the ApplyCache");
+
+	cache->build_snapshots = true;
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+	cache->context = AllocSetContextCreate(TopMemoryContext,
+	                                       "ApplyCache",
+	                                       ALLOCSET_DEFAULT_MINSIZE,
+	                                       ALLOCSET_DEFAULT_INITSIZE,
+	                                       ALLOCSET_DEFAULT_MAXSIZE);
+
+	hash_ctl.keysize = sizeof(TransactionId);
+	hash_ctl.entrysize = sizeof(ApplyCacheTXNByIdEnt);
+	hash_ctl.hash = tag_hash;
+	hash_ctl.hcxt = cache->context;
+
+	cache->by_txn = hash_create("ApplyCacheByXid", 1000, &hash_ctl,
+	                            HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
+
+	cache->nr_cached_transactions = 0;
+	cache->nr_cached_changes = 0;
+	cache->nr_cached_tuplebufs = 0;
+
+	ilist_d_init(&cache->cached_transactions);
+	ilist_d_init(&cache->cached_changes);
+	ilist_s_init(&cache->cached_tuplebufs);
+
+	return cache;
+}
+
+void ApplyCacheFree(ApplyCache* cache)
+{
+	/* FIXME: check for in-progress transactions */
+	/* FIXME: clean up cached transaction */
+	/* FIXME: clean up cached changes */
+	/* FIXME: clean up cached tuplebufs */
+	hash_destroy(cache->by_txn);
+	free(cache);
+}
+
+static ApplyCacheTXN* ApplyCacheGetTXN(ApplyCache *cache)
+{
+	ApplyCacheTXN* txn;
+
+	if (cache->nr_cached_transactions)
+	{
+		cache->nr_cached_transactions--;
+		txn = ilist_container(ApplyCacheTXN, node,
+		                      ilist_d_pop_front(&cache->cached_transactions));
+	}
+	else
+	{
+		txn = (ApplyCacheTXN*)
+			malloc(sizeof(ApplyCacheTXN));
+
+		if (!txn)
+			elog(ERROR, "Could not allocate a ApplyCacheTXN struct");
+	}
+
+	memset(txn, 0, sizeof(ApplyCacheTXN));
+	ilist_d_init(&txn->changes);
+	ilist_d_init(&txn->subtxns);
+	ilist_d_init(&txn->snapshots);
+	ilist_d_init(&txn->commandids);
+
+	return txn;
+}
+
+void ApplyCacheReturnTXN(ApplyCache *cache, ApplyCacheTXN* txn)
+{
+	if(cache->nr_cached_transactions < max_cached_transactions){
+		cache->nr_cached_transactions++;
+		ilist_d_push_front(&cache->cached_transactions, &txn->node);
+	}
+	else{
+		free(txn);
+	}
+}
+
+ApplyCacheChange*
+ApplyCacheGetChange(ApplyCache* cache)
+{
+	ApplyCacheChange* change;
+
+	if (cache->nr_cached_changes)
+	{
+		cache->nr_cached_changes--;
+		change = ilist_container(ApplyCacheChange, node,
+		                         ilist_d_pop_front(&cache->cached_changes));
+	}
+	else
+	{
+		change = (ApplyCacheChange*)malloc(sizeof(ApplyCacheChange));
+
+		if (!change)
+			elog(ERROR, "Could not allocate a ApplyCacheChange struct");
+	}
+
+
+	memset(change, 0, sizeof(ApplyCacheChange));
+	return change;
+}
+
+void
+ApplyCacheReturnChange(ApplyCache* cache, ApplyCacheChange* change)
+{
+	switch(change->action){
+		case APPLY_CACHE_CHANGE_INSERT:
+		case APPLY_CACHE_CHANGE_UPDATE:
+		case APPLY_CACHE_CHANGE_DELETE:
+			if (change->newtuple)
+			{
+				ApplyCacheReturnTupleBuf(cache, change->newtuple);
+				change->newtuple = NULL;
+			}
+
+			if (change->oldtuple)
+			{
+				ApplyCacheReturnTupleBuf(cache, change->oldtuple);
+				change->oldtuple = NULL;
+			}
+
+			if (change->table)
+			{
+				heap_freetuple(change->table);
+				change->table = NULL;
+			}
+			break;
+		case APPLY_CACHE_CHANGE_SNAPSHOT:
+			if (change->snapshot)
+			{
+				/* FIXME: free snapshot */
+				change->snapshot = NULL;
+			}
+		case APPLY_CACHE_CHANGE_COMMAND_ID:
+			break;
+	}
+
+	if(cache->nr_cached_changes < max_cached_changes){
+		cache->nr_cached_changes++;
+		ilist_d_push_front(&cache->cached_changes, &change->node);
+	}
+	else{
+		free(change);
+	}
+}
+
+ApplyCacheTupleBuf*
+ApplyCacheGetTupleBuf(ApplyCache* cache)
+{
+	ApplyCacheTupleBuf* tuple;
+
+	if (cache->nr_cached_tuplebufs)
+	{
+		cache->nr_cached_tuplebufs--;
+		tuple = ilist_container(ApplyCacheTupleBuf, node,
+		                        ilist_s_pop_front(&cache->cached_tuplebufs));
+	}
+	else
+	{
+		tuple =
+			(ApplyCacheTupleBuf*)malloc(sizeof(ApplyCacheTupleBuf));
+
+		if (!tuple)
+			elog(ERROR, "Could not allocate a ApplyCacheTupleBuf struct");
+	}
+
+	return tuple;
+}
+
+void
+ApplyCacheReturnTupleBuf(ApplyCache* cache, ApplyCacheTupleBuf* tuple)
+{
+	if(cache->nr_cached_tuplebufs < max_cached_tuplebufs){
+		cache->nr_cached_tuplebufs++;
+		ilist_s_push_front(&cache->cached_tuplebufs, &tuple->node);
+	}
+	else{
+		free(tuple);
+	}
+}
+
+
+static
+ApplyCacheTXN*
+ApplyCacheTXNByXid(ApplyCache* cache, TransactionId xid, bool create, bool* is_new)
+{
+	ApplyCacheTXNByIdEnt* ent;
+	bool found;
+
+	/* FIXME: add one entry fast-path cache */
+
+	ent = (ApplyCacheTXNByIdEnt*)
+		hash_search(cache->by_txn,
+		            (void *)&xid,
+		            (create ? HASH_ENTER : HASH_FIND),
+		            &found);
+
+	if (found)
+	{
+#ifdef VERBOSE_DEBUG
+		elog(LOG, "found cache entry for %u at %p", xid, ent);
+#endif
+	}
+	else
+	{
+#ifdef VERBOSE_DEBUG
+		elog(LOG, "didn't find cache entry for %u in %p at %p, creating %u",
+		     xid, cache, ent, create);
+#endif
+	}
+
+	if (!found && !create)
+		return NULL;
+
+	if (!found)
+	{
+		ent->txn = ApplyCacheGetTXN(cache);
+		ent->txn->xid = xid;
+	}
+
+	if (is_new)
+		*is_new = !found;
+
+	return ent->txn;
+}
+
+void
+ApplyCacheAddChange(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn,
+                    ApplyCacheChange* change)
+{
+	ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, true, NULL);
+	txn->lsn = lsn;
+	ilist_d_push_back(&txn->changes, &change->node);
+}
+
+
+void
+ApplyCacheCommitChild(ApplyCache* cache, TransactionId xid,
+                      TransactionId subxid, XLogRecPtr lsn)
+{
+	ApplyCacheTXN* txn;
+	ApplyCacheTXN* subtxn;
+
+	subtxn = ApplyCacheTXNByXid(cache, subxid, false, NULL);
+
+	/*
+	 * No need to do anything if that subtxn didn't contain any changes
+	 */
+	if (!subtxn)
+		return;
+
+	subtxn->lsn = lsn;
+
+	txn = ApplyCacheTXNByXid(cache, xid, true, NULL);
+
+	ilist_d_push_back(&txn->subtxns, &subtxn->node);
+}
+
+typedef struct ApplyCacheIterTXNState
+{
+	simpleheap *heap;
+} ApplyCacheIterTXNState;
+
+static int
+ApplyCacheIterCompare(simpleheap_kv* a, simpleheap_kv* b)
+{
+	ApplyCacheChange *change_a = ilist_container(ApplyCacheChange, node, a->key);
+	ApplyCacheChange *change_b = ilist_container(ApplyCacheChange, node, b->key);
+
+	if (change_a->lsn < change_b->lsn)
+		return -1;
+
+	else if (change_a->lsn == change_b->lsn)
+		return 0;
+
+	return 1;
+}
+
+static ApplyCacheIterTXNState*
+ApplyCacheIterTXNInit(ApplyCache* cache, ApplyCacheTXN* txn);
+
+static ApplyCacheChange*
+ApplyCacheIterTXNNext(ApplyCache* cache, ApplyCacheIterTXNState* state);
+
+static void
+ApplyCacheIterTXNFinish(ApplyCache* cache, ApplyCacheIterTXNState* state);
+
+
+
+static ApplyCacheIterTXNState*
+ApplyCacheIterTXNInit(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+	size_t nr_txns = 0; /* main txn */
+	ApplyCacheIterTXNState *state;
+	ilist_d_node* cur_txn_i;
+	ApplyCacheTXN *cur_txn;
+	ApplyCacheChange *cur_change;
+
+	if (!ilist_d_is_empty(&txn->changes))
+		nr_txns++;
+
+	/* count how large our heap must be */
+	ilist_d_foreach(cur_txn_i, &txn->subtxns)
+	{
+		cur_txn = ilist_container(ApplyCacheTXN, node, cur_txn_i);
+
+		if (!ilist_d_is_empty(&cur_txn->changes))
+			nr_txns++;
+	}
+
+	/* allocate array for our heap */
+	state = palloc0(sizeof(ApplyCacheIterTXNState));
+
+	state->heap = simpleheap_allocate(nr_txns);
+	state->heap->compare = ApplyCacheIterCompare;
+
+	/* fill array with elements, heap condition not yet fullfilled */
+	if (!ilist_d_is_empty(&txn->changes))
+	{
+		cur_change = ilist_d_front_unchecked(ApplyCacheChange, node, &txn->changes);
+
+		simpleheap_add_unordered(state->heap, &cur_change->node, txn);
+	}
+
+	ilist_d_foreach(cur_txn_i, &txn->subtxns)
+	{
+		cur_txn = ilist_container(ApplyCacheTXN, node, cur_txn_i);
+
+		if (!ilist_d_is_empty(&cur_txn->changes))
+		{
+			cur_change = ilist_d_front_unchecked(ApplyCacheChange, node, &cur_txn->changes);
+
+			simpleheap_add_unordered(state->heap, &cur_change->node, txn);
+		}
+	}
+
+	/* make the array fullfill the heap property */
+	simpleheap_build(state->heap);
+	return state;
+}
+
+static ApplyCacheChange*
+ApplyCacheIterTXNNext(ApplyCache* cache, ApplyCacheIterTXNState* state)
+{
+	ApplyCacheTXN *txn = NULL;
+	ApplyCacheChange *change;
+	simpleheap_kv *kv;
+
+	/*
+	 * Do a k-way merge between transactions/subtransactions to extract changes
+	 * merged by the lsn of their change. For that we model the current heads
+	 * of the different transactions as a binary heap so we easily know which
+	 * (sub-)transaction has the change with the smalles lsn next.
+	 */
+
+	/* nothing there anymore */
+	if (state->heap->size == 0)
+		return NULL;
+
+	kv = simpleheap_first(state->heap);
+
+	change = ilist_container(ApplyCacheChange, node, kv->key);
+
+	txn = (ApplyCacheTXN*)kv->value;
+
+	if (!ilist_d_has_next(&txn->changes, &change->node))
+	{
+		simpleheap_remove_first(state->heap);
+	}
+	else
+	{
+		simpleheap_change_key(state->heap, change->node.next);
+	}
+	return change;
+}
+
+static void
+ApplyCacheIterTXNFinish(ApplyCache* cache, ApplyCacheIterTXNState* state)
+{
+	simpleheap_free(state->heap);
+	pfree(state);
+}
+
+
+static void
+ApplyCacheCleanupTXN(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+	bool found;
+	ilist_d_node* cur_change, *next_change;
+	ilist_d_node* cur_txn, *next_txn;
+
+	/* cleanup transactions & changes */
+	ilist_d_foreach_modify (cur_txn, next_txn, &txn->subtxns)
+	{
+		ApplyCacheTXN* subtxn = ilist_container(ApplyCacheTXN, node, cur_txn);
+
+		ilist_d_foreach_modify (cur_change, next_change, &subtxn->changes)
+		{
+			ApplyCacheChange* change =
+				ilist_container(ApplyCacheChange, node, cur_change);
+
+			ApplyCacheReturnChange(cache, change);
+		}
+		ApplyCacheReturnTXN(cache, subtxn);
+	}
+
+	ilist_d_foreach_modify (cur_change, next_change, &txn->changes)
+	{
+		ApplyCacheChange* change =
+			ilist_container(ApplyCacheChange, node, cur_change);
+
+		ApplyCacheReturnChange(cache, change);
+	}
+
+	/* now remove reference from cache */
+	hash_search(cache->by_txn,
+	            (void *)&txn->xid,
+	            HASH_REMOVE,
+	            &found);
+	Assert(found);
+
+	ApplyCacheReturnTXN(cache, txn);
+}
+void
+ApplyCacheCommit(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn)
+{
+	ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, false, NULL);
+	ApplyCacheIterTXNState* iterstate;
+	ApplyCacheChange* change;
+	CommandId command_id;
+	Snapshot snapshot_mvcc = NULL;
+
+	if (!txn)
+		return;
+
+	txn->lsn = lsn;
+
+	cache->begin(cache, txn);
+
+	PG_TRY();
+	{
+		iterstate = ApplyCacheIterTXNInit(cache, txn);
+		while((change = ApplyCacheIterTXNNext(cache, iterstate)))
+		{
+			switch(change->action){
+				case APPLY_CACHE_CHANGE_INSERT:
+				case APPLY_CACHE_CHANGE_UPDATE:
+				case APPLY_CACHE_CHANGE_DELETE:
+					Assert(snapshot_mvcc != NULL);
+					cache->apply_change(cache, txn, txn /*FIXME*/, change);
+					break;
+				case APPLY_CACHE_CHANGE_SNAPSHOT:
+					/*
+					 * the first snapshot seen in a transaction is its mvcc
+					 * snapshot
+					 */
+					if (!snapshot_mvcc)
+						snapshot_mvcc = change->snapshot;
+					SetupDecodingSnapshots(change->snapshot);
+					break;
+				case APPLY_CACHE_CHANGE_COMMAND_ID:
+					/* FIXME */
+					command_id = change->command_id;
+					break;
+			}
+		}
+
+		ApplyCacheIterTXNFinish(cache, iterstate);
+
+		cache->commit(cache, txn);
+
+		ApplyCacheCleanupTXN(cache, txn);
+		RevertFromDecodingSnapshots();
+	}
+	PG_CATCH();
+	{
+		RevertFromDecodingSnapshots();
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+void
+ApplyCacheAbort(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn)
+{
+	ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, false, NULL);
+
+	/* no changes in this commit */
+	if (!txn)
+		return;
+
+	ApplyCacheCleanupTXN(cache, txn);
+}
+
+bool
+ApplyCacheIsXidKnown(ApplyCache* cache, TransactionId xid)
+{
+	bool is_new;
+	/* FIXME: for efficiency reasons we create the xid here, that doesn't seem
+	 * like a good idea though */
+	ApplyCacheTXNByXid(cache, xid, true, &is_new);
+
+	/* no changes in this commit */
+	return !is_new;
+}
+
+void
+ApplyCacheAddBaseSnapshot(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn, Snapshot snap)
+{
+	ApplyCacheChange *change = ApplyCacheGetChange(cache);
+	change->snapshot = snap;
+	change->action = APPLY_CACHE_CHANGE_SNAPSHOT;
+
+	ApplyCacheAddChange(cache, xid, lsn, change);
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
new file mode 100644
index 0000000..244dd7b
--- /dev/null
+++ b/src/backend/replication/logical/decode.c
@@ -0,0 +1,366 @@
+/*-------------------------------------------------------------------------
+ *
+ * decode.c
+ *
+ * Decodes wal records from an xlogreader.h callback into an applycache.
+ *
+ * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
+ *
+ * NOTE:
+ *
+ *    Its possible that the separation between decode.c and snapbuild.c is a
+ *    bit too strict, in the end they just about have the same struct.
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/logical/decode.c
+ *
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xact.h"
+#include "access/heapam_xlog.h"
+
+#include "catalog/pg_control.h"
+
+#include "replication/applycache.h"
+#include "replication/decode.h"
+#include "replication/snapbuild.h"
+
+#include "utils/memutils.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+
+static void DecodeXLogTuple(char* data, Size len, ApplyCacheTupleBuf* tuple);
+
+static void DecodeInsert(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeUpdate(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeDelete(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeNewpage(ApplyCache *cache, XLogRecordBuffer* buf);
+static void DecodeMultiInsert(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeCommit(ApplyCache* cache, XLogRecordBuffer* buf, TransactionId xid,
+	                     TransactionId *sub_xids, int nsubxacts);
+
+
+void DecodeRecordIntoApplyCache(ReaderApplyState *state, XLogRecordBuffer* buf)
+{
+	XLogRecord* r = &buf->record;
+	uint8 info = r->xl_info & ~XLR_INFO_MASK;
+	ApplyCache *cache = state->apply_cache;
+	SnapBuildAction action;
+
+	/*
+	 * FIXME: The existance of the snapshot builder is pretty obvious to the
+	 * outside right now, that doesn't seem to be very good...
+	 */
+	if(!state->snapstate)
+	{
+		state->snapstate = AllocateSnapshotBuilder(cache);
+	}
+
+	/*
+	 * Call the snapshot builder. It needs to be called before we analyze
+	 * tuples for two reasons:
+	 *
+	 * * Only in the snapshot building logic we know whether we have enough
+	 *   information to decode a particular tuple
+     *
+     * * The Snapshot/CommandIds computed by the SnapshotBuilder need to be
+     *   added to the ApplyCache before we add tuples using them
+	 */
+	action = SnapBuildCallback(cache, state->snapstate, buf);
+
+	if (action == SNAPBUILD_SKIP)
+		return;
+
+	switch (r->xl_rmid)
+	{
+		case RM_HEAP_ID:
+		{
+			info &= XLOG_HEAP_OPMASK;
+			switch (info)
+			{
+				case XLOG_HEAP_INSERT:
+					DecodeInsert(cache, buf);
+					break;
+
+				/* no guarantee that we get an HOT update again, so handle it as a normal update*/
+				case XLOG_HEAP_HOT_UPDATE:
+				case XLOG_HEAP_UPDATE:
+					DecodeUpdate(cache, buf);
+					break;
+
+				case XLOG_HEAP_NEWPAGE:
+					DecodeNewpage(cache, buf);
+					break;
+
+				case XLOG_HEAP_DELETE:
+					DecodeDelete(cache, buf);
+					break;
+				default:
+					break;
+			}
+			break;
+		}
+		case RM_HEAP2_ID:
+		{
+			info &= XLOG_HEAP_OPMASK;
+			switch (info)
+			{
+				case XLOG_HEAP2_MULTI_INSERT:
+					DecodeMultiInsert(cache, buf);
+					break;
+				default:
+					/* everything else here is just physical stuff were not interested in */
+					break;
+			}
+			break;
+		}
+
+		case RM_XACT_ID:
+		{
+			switch (info)
+			{
+				case XLOG_XACT_COMMIT:
+				{
+					TransactionId *sub_xids;
+					xl_xact_commit *xlrec = (xl_xact_commit*)buf->record_data;
+
+					/* FIXME: this is not really allowed if there is no subtransactions */
+					sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
+					DecodeCommit(cache, buf, r->xl_xid, sub_xids, xlrec->nsubxacts);
+
+					break;
+				}
+				case XLOG_XACT_COMMIT_PREPARED:
+				{
+					TransactionId *sub_xids;
+					xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared*)buf->record_data;
+
+					sub_xids = (TransactionId *) &(xlrec->crec.xnodes[xlrec->crec.nrels]);
+
+					DecodeCommit(cache, buf, r->xl_xid, sub_xids,
+					             xlrec->crec.nsubxacts);
+
+					break;
+				}
+				case XLOG_XACT_COMMIT_COMPACT:
+				{
+					xl_xact_commit_compact *xlrec = (xl_xact_commit_compact*)buf->record_data;
+					DecodeCommit(cache, buf, r->xl_xid, xlrec->subxacts,
+					             xlrec->nsubxacts);
+					break;
+				}
+				case XLOG_XACT_ABORT:
+				case XLOG_XACT_ABORT_PREPARED:
+				{
+					TransactionId *sub_xids;
+					xl_xact_abort *xlrec = (xl_xact_abort*)buf->record_data;
+					int i;
+
+					/* FIXME: this is not really allowed if there is no subtransaction */
+					sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
+
+					for(i = 0; i < xlrec->nsubxacts; i++)
+					{
+						ApplyCacheAbort(cache, *sub_xids, buf->origptr);
+						sub_xids += 1;
+					}
+
+					/* TODO: check that this also contains not-yet-aborted subtxns */
+					ApplyCacheAbort(cache, r->xl_xid, buf->origptr);
+
+					elog(WARNING, "ABORT %u", r->xl_xid);
+					break;
+				}
+				case XLOG_XACT_ASSIGNMENT:
+					/*
+					 * XXX: We could reassign transactions to the parent here
+					 * to save space and effort when merging transactions at
+					 * commit.
+					 */
+					break;
+				case XLOG_XACT_PREPARE:
+					/*
+					 * FXIME: we should replay the transaction and prepare it
+					 * as well.
+					 */
+					break;
+				default:
+					break;
+					;
+			}
+			break;
+		}
+		case RM_XLOG_ID:
+		{
+			switch (info)
+			{
+				/* this is also used in END_OF_RECOVERY checkpoints */
+				case XLOG_CHECKPOINT_SHUTDOWN:
+					/*
+					 * abort all transactions that still are in progress, they
+					 * aren't in progress anymore.
+					 * do not abort prepared transactions that have been
+					 * prepared for commit.
+					 * FIXME: implement.
+					 */
+					break;
+			}
+		}
+		default:
+			break;
+	}
+}
+
+static void
+DecodeCommit(ApplyCache* cache, XLogRecordBuffer* buf, TransactionId xid,
+             TransactionId *sub_xids, int nsubxacts)
+{
+	int i;
+
+	for (i = 0; i < nsubxacts; i++)
+	{
+		ApplyCacheCommitChild(cache, xid, *sub_xids, buf->origptr);
+		sub_xids++;
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ApplyCacheCommit(cache, xid, buf->origptr);
+}
+
+static void DecodeInsert(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+	XLogRecord* r = &buf->record;
+	xl_heap_insert *xlrec = (xl_heap_insert *) buf->record_data;
+
+	ApplyCacheChange* change;
+
+	if (r->xl_info & XLR_BKP_BLOCK_1
+	    && r->xl_len < (SizeOfHeapUpdate + SizeOfHeapHeader))
+	{
+		elog(FATAL, "huh, no tuple data on wal_level = logical?");
+	}
+
+	change = ApplyCacheGetChange(cache);
+	change->action = APPLY_CACHE_CHANGE_INSERT;
+
+	memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
+
+	change->newtuple = ApplyCacheGetTupleBuf(cache);
+
+	DecodeXLogTuple((char*)xlrec + SizeOfHeapInsert,
+	                r->xl_len - SizeOfHeapInsert,
+	                change->newtuple);
+
+	ApplyCacheAddChange(cache, r->xl_xid, buf->origptr, change);
+}
+
+static void
+DecodeUpdate(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+	XLogRecord* r = &buf->record;
+	xl_heap_update *xlrec = (xl_heap_update *) buf->record_data;
+
+
+	ApplyCacheChange* change;
+
+	if ((r->xl_info & XLR_BKP_BLOCK_1 || r->xl_info & XLR_BKP_BLOCK_2) &&
+	    (r->xl_len < (SizeOfHeapUpdate + SizeOfHeapHeader)))
+	{
+		elog(FATAL, "huh, no tuple data on wal_level = logical?");
+	}
+
+	change = ApplyCacheGetChange(cache);
+	change->action = APPLY_CACHE_CHANGE_UPDATE;
+
+	memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
+
+	/* FIXME: need to save the old tuple as well if we want primary key changes to work. */
+	change->newtuple = ApplyCacheGetTupleBuf(cache);
+
+	DecodeXLogTuple((char*)xlrec + SizeOfHeapUpdate,
+	                r->xl_len - SizeOfHeapUpdate,
+	                change->newtuple);
+
+	ApplyCacheAddChange(cache, r->xl_xid, buf->origptr, change);
+}
+
+static void DecodeDelete(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+	XLogRecord* r = &buf->record;
+
+	xl_heap_delete *xlrec = (xl_heap_delete *) buf->record_data;
+
+	ApplyCacheChange* change;
+
+	change = ApplyCacheGetChange(cache);
+	change->action = APPLY_CACHE_CHANGE_DELETE;
+
+	memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
+
+	if (r->xl_len <= (SizeOfHeapDelete + SizeOfHeapHeader))
+	{
+		elog(FATAL, "huh, no primary key for a delete on wal_level = logical?");
+	}
+
+	change->oldtuple = ApplyCacheGetTupleBuf(cache);
+
+	DecodeXLogTuple((char*)xlrec + SizeOfHeapDelete,
+	                r->xl_len - SizeOfHeapDelete,
+	                change->oldtuple);
+
+	ApplyCacheAddChange(cache, r->xl_xid, buf->origptr, change);
+}
+
+
+static void
+DecodeNewpage(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+	elog(WARNING, "skipping XLOG_HEAP_NEWPAGE record because we are too dumb");
+}
+
+static void
+DecodeMultiInsert(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+	elog(WARNING, "skipping XLOG_HEAP2_MULTI_INSERT record because we are too dumb");
+}
+
+
+static void DecodeXLogTuple(char* data, Size len, ApplyCacheTupleBuf* tuple)
+{
+	xl_heap_header xlhdr;
+	int datalen = len - SizeOfHeapHeader;
+
+	Assert(datalen >= 0);
+	Assert(datalen <= MaxHeapTupleSize);
+
+	tuple->tuple.t_len = datalen + offsetof(HeapTupleHeaderData, t_bits);
+
+	/* not a disk based tuple */
+	ItemPointerSetInvalid(&tuple->tuple.t_self);
+
+	tuple->tuple.t_tableOid = InvalidOid;
+	tuple->tuple.t_data = &tuple->header;
+
+	/* data is not stored aligned */
+	memcpy((char *) &xlhdr,
+	       data,
+	       SizeOfHeapHeader);
+
+	memset(&tuple->header, 0, sizeof(HeapTupleHeaderData));
+
+	memcpy((char *) &tuple->header + offsetof(HeapTupleHeaderData, t_bits),
+	       data + SizeOfHeapHeader,
+	       datalen);
+
+	tuple->header.t_infomask = xlhdr.t_infomask;
+	tuple->header.t_infomask2 = xlhdr.t_infomask2;
+	tuple->header.t_hoff = xlhdr.t_hoff;
+}
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
new file mode 100644
index 0000000..035c48a
--- /dev/null
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -0,0 +1,237 @@
+/*-------------------------------------------------------------------------
+ *
+ * logicalfuncs.c
+ *
+ *     Support functions for using xlog decoding
+ *
+ * NOTE:
+ *     Nothing in here should be sued for anythign but debugging!
+ *
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/snapbuild.c
+ *
+ */
+
+#include "postgres.h"
+
+#include "access/xlogreader.h"
+
+#include "catalog/pg_class.h"
+#include "catalog/pg_type.h"
+
+#include "replication/applycache.h"
+#include "replication/decode.h"
+#include "replication/walreceiver.h"
+/*FIXME: XLogRead*/
+#include "replication/walsender_private.h"
+
+#include "utils/inval.h"
+#include "utils/lsyscache.h"
+#include "utils/syscache.h"
+#include "utils/typcache.h"
+
+
+
+/* We don't need no header */
+extern Datum
+decode_xlog(PG_FUNCTION_ARGS);
+
+
+static bool
+replay_record_is_interesting(XLogReaderState* state, XLogRecord* r)
+{
+	return true;
+}
+
+static void
+replay_writeout_data(XLogReaderState* state, char* data, Size len)
+{
+	return;
+}
+
+static void
+replay_finished_record(XLogReaderState* state, XLogRecordBuffer* buf)
+{
+	ReaderApplyState* apply_state = state->private_data;
+	DecodeRecordIntoApplyCache(apply_state, buf);
+}
+
+static void
+replay_read_page(XLogReaderState* state, char* cur_page, XLogRecPtr startptr)
+{
+	XLogPageHeader page_header;
+
+	Assert((startptr % XLOG_BLCKSZ) == 0);
+
+	/* FIXME: more sensible/efficient implementation */
+	XLogRead(cur_page, startptr, XLOG_BLCKSZ);
+
+	page_header = (XLogPageHeader)cur_page;
+
+	if (page_header->xlp_magic != XLOG_PAGE_MAGIC)
+	{
+		elog(FATAL, "page header magic %x, should be %x at %X/%X", page_header->xlp_magic,
+		     XLOG_PAGE_MAGIC, (uint32)(startptr >> 32), (uint32)startptr);
+	}
+}
+
+static
+void decode_begin_txn(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+	elog(WARNING, "BEGIN");
+}
+
+static void
+decode_commit_txn(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+	elog(WARNING, "COMMIT");
+}
+
+/* don't want to include that header */
+extern HeapTuple
+LookupTableByRelFileNode(RelFileNode* r);
+
+
+/* This is is just for demonstration, don't ever use this code for anything real! */
+static void
+decode_change(ApplyCache* cache, ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change)
+{
+	InvalidateSystemCaches();
+
+	if (change->action == APPLY_CACHE_CHANGE_INSERT)
+	{
+		StringInfoData s;
+		HeapTuple table = LookupTableByRelFileNode(&change->relnode);
+		Form_pg_class class_form;
+		HeapTuple	typeTuple;
+		Form_pg_type pt;
+		TupleDesc	tupdesc;
+		int			i;
+
+		if (!table)
+		{
+			elog(LOG, "couldn't lookup %u", change->relnode.relNode);
+			return;
+		}
+
+		class_form = (Form_pg_class) GETSTRUCT(table);
+
+		initStringInfo(&s);
+
+		tupdesc = lookup_rowtype_tupdesc(class_form->reltype, -1);
+
+		for (i = 0; i < tupdesc->natts; i++)
+		{
+			Oid			typid, typoutput;
+			bool		typisvarlena;
+			Datum		origval, val;
+			char        *outputstr;
+			bool        isnull;
+			if (tupdesc->attrs[i]->attisdropped)
+				continue;
+			if (tupdesc->attrs[i]->attnum < 0)
+				continue;
+
+			typid = tupdesc->attrs[i]->atttypid;
+
+			typeTuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid));
+			if (!HeapTupleIsValid(typeTuple))
+				elog(ERROR, "cache lookup failed for type %u", typid);
+			pt = (Form_pg_type) GETSTRUCT(typeTuple);
+
+			appendStringInfo(&s, " %s[%s]",
+			                 NameStr(tupdesc->attrs[i]->attname),
+			                 NameStr(pt->typname));
+
+			getTypeOutputInfo(typid,
+			                  &typoutput, &typisvarlena);
+
+			ReleaseSysCache(typeTuple);
+
+			origval = heap_getattr(&change->newtuple->tuple, i + 1, tupdesc, &isnull);
+
+			if (typisvarlena && !isnull)
+				val = PointerGetDatum(PG_DETOAST_DATUM(origval));
+			else
+				val = origval;
+
+			outputstr = OidOutputFunctionCall(typoutput, val);
+
+			appendStringInfo(&s, ":%s", isnull ? "(null)" : outputstr);
+		}
+		ReleaseTupleDesc(tupdesc);
+
+		elog(WARNING, "tuple is:%s", s.data);
+	}
+}
+
+/* test the xlog decoding infrastructure from lsn, to lsn */
+Datum
+decode_xlog(PG_FUNCTION_ARGS)
+{
+	char* start = PG_GETARG_CSTRING(0);
+	char* end = PG_GETARG_CSTRING(1);
+
+	ApplyCache *apply_cache;
+	XLogReaderState *xlogreader_state = XLogReaderAllocate();
+	ReaderApplyState *apply_state;
+
+	XLogRecPtr startpoint;
+	XLogRecPtr endpoint;
+
+	uint32		hi,
+				lo;
+
+	if (sscanf(start, "%X/%X",
+	           &hi, &lo) != 2)
+		elog(ERROR, "unparseable xlog pos");
+	startpoint = ((uint64) hi) << 32 | lo;
+
+	elog(LOG, "starting to parse at %X/%X", hi, lo);
+
+	if (sscanf(end, "%X/%X",
+	           &hi, &lo) != 2)
+		elog(ERROR, "unparseable xlog pos");
+	endpoint = ((uint64) hi) << 32 | lo;
+
+	elog(LOG, "end parse at %X/%X", hi, lo);
+
+	xlogreader_state->is_record_interesting = replay_record_is_interesting;
+	xlogreader_state->finished_record = replay_finished_record;
+	xlogreader_state->writeout_data = replay_writeout_data;
+	xlogreader_state->read_page = replay_read_page;
+	xlogreader_state->private_data = calloc(1, sizeof(ReaderApplyState));
+
+
+	if (!xlogreader_state->private_data)
+		elog(ERROR, "Could not allocate the ReaderApplyState struct");
+
+	xlogreader_state->startptr = startpoint;
+	xlogreader_state->curptr = startpoint;
+	xlogreader_state->endptr = endpoint;
+
+	apply_state = (ReaderApplyState*)xlogreader_state->private_data;
+
+	/*
+	 * allocate an ApplyCache that will apply data using lowlevel calls
+	 * without type conversion et al. This requires binary compatibility
+	 * between both systems.
+	 * XXX: This would be the place too hook different apply methods, like
+	 * producing sql and applying it.
+	 */
+	apply_cache = ApplyCacheAllocate();
+	apply_cache->begin = decode_begin_txn;
+	apply_cache->apply_change = decode_change;
+	apply_cache->commit = decode_commit_txn;
+
+	apply_state->apply_cache = apply_cache;
+
+	XLogReaderRead(xlogreader_state);
+
+	PG_RETURN_BOOL(true);
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
new file mode 100644
index 0000000..05b176d
--- /dev/null
+++ b/src/backend/replication/logical/snapbuild.c
@@ -0,0 +1,1045 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapbuild.c
+ *
+ *     Support for building timetravel snapshots based on the contents of the
+ *     wal
+ *
+ * NOTE:
+ *     This is complex, in-progress and underdocumented.
+ *
+ *     We build snapshots which can *only* be used to read catalog contents by
+ *     reading the wal stream. The aim is to provide mvcc and SnapshotNow
+ *     snapshots that behave the same as their respective counterparts would
+ *     have at the time the XLogRecord was generated. This is done to provide a
+ *     reliable environment for decoding those records into every format that
+ *     pleases the user of an ApplyCache.
+ *
+ *     The percentage of transactions modifying the catalog should be fairly
+ *     small, so instead of keeping track of all running transactions an
+ *     treating everything inside (xmin, xmax) thats not running as commited we
+ *     do the contrary. That, and other implementation details, neccisate using
+ *     our own ->satisfies visibility routine.
+ *     In contrast to a class SnapshotNow which doesn't need any data this
+ *     module provides something that *behaves* like a SnapshotNow would have
+ *     back then (minus some races). Minus some minor things a SnapshotNow
+ *     behaves like a SnapshotMVCC taken exactly in the moment the SnapshotNow
+ *     was used. Because of that we simply model our timetravel-SnapshotNow's
+ *     as mvcc Snapshots.
+ *
+ *     To replace the normal handling of SnapshotNow snapshots use the
+ *     SetupDecodingSnapshots/RevertFromDecodingSnapshots functions. Be careful
+ *     to handle errors properly, otherwise the rest of the session will have
+ *     very strange behaviour.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/snapbuild.c
+ *
+ */
+
+#include "postgres.h"
+
+#include "access/heapam_xlog.h"
+#include "access/rmgr.h"
+#include "access/transam.h"
+#include "access/xlogreader.h"
+#include "access/xact.h"
+
+#include "catalog/catalog.h"
+#include "catalog/pg_control.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_tablespace.h"
+
+#include "miscadmin.h"
+
+#include "replication/applycache.h"
+#include "replication/snapbuild.h"
+
+#include "utils/builtins.h"
+#include "utils/catcache.h"
+#include "utils/inval.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/snapshot.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+
+#include "storage/standby.h"
+
+typedef struct SnapstateTxnEnt
+{
+	TransactionId xid;
+	bool does_timetravel;
+} SnapstateTxnEnt;
+
+
+static bool
+SnapBuildHasCatalogChanges(Snapstate* snapstate, TransactionId xid, RelFileNode* relfilenode);
+
+/* transaction state manipulation functions */
+static void
+SnapBuildEndTxn(Snapstate* snapstate, TransactionId xid);
+
+static void
+SnapBuildAbortTxn(Snapstate* state, TransactionId xid, int nsubxacts,
+                  TransactionId* subxacts);
+
+static void
+SnapBuildCommitTxn(Snapstate* snapstate, TransactionId xid, int nsubxacts,
+                   TransactionId* subxacts);
+
+/* ->running manipulation */
+static bool
+SnapBuildTxnRunning(Snapstate* snapstate, TransactionId xid);
+
+static void
+SnapBuildReserveRunning(Snapstate *snapstate, Size count);
+
+static void
+SnapBuildSortRunning(Snapstate *snapstate);
+
+static void
+SnapBuildAddRunningTxn(Snapstate *snapstate, TransactionId xid);
+
+
+/* ->committed manipulation */
+static void
+SnapBuildPurgeCommittedTxn(Snapstate* snapstate);
+
+static void
+SnapBuildCommitTxn(Snapstate* snapstate, TransactionId xid, int nsubxacts,
+                   TransactionId* subxacts);
+
+
+/* snapshot building/manipulation/distribution functions */
+static void
+SnapBuildDistributeSnapshotNow(Snapstate* snapstate, TransactionId xid);
+
+static Snapshot
+SnapBuildBuildSnapshot(Snapstate *snapstate, TransactionId xid);
+
+
+HeapTuple
+LookupTableByRelFileNode(RelFileNode* relfilenode)
+{
+	Oid spc;
+
+	InvalidateSystemCaches();
+
+	/*
+	 * relations in the default tablespace are stored with a reltablespace = 0
+	 * for some reason.
+	 */
+	spc = relfilenode->spcNode == DEFAULTTABLESPACE_OID ?
+		0 : relfilenode->spcNode;
+
+	return SearchSysCacheCopy2(RELFILENODE,
+	                           spc,
+	                           relfilenode->relNode);
+}
+
+Snapstate*
+AllocateSnapshotBuilder(ApplyCache *applycache)
+{
+	Snapstate *snapstate = malloc(sizeof(Snapstate));
+	HASHCTL hash_ctl;
+
+	snapstate->state = SNAPBUILD_START;
+	snapstate->valid_after = InvalidTransactionId;
+
+	snapstate->nrrunning = 0;
+	snapstate->nrrunning_initial = 0;
+	snapstate->nrrunning_space = 0;
+	snapstate->running = NULL;
+
+	snapstate->nrcommitted = 0;
+	snapstate->nrcommitted_space = 128;
+	snapstate->committed = malloc(snapstate->nrcommitted_space * sizeof(TransactionId));
+	if (!snapstate->committed)
+		elog(ERROR, "could not allocate memory for snapstate->committed");
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+	hash_ctl.keysize = sizeof(TransactionId);
+	hash_ctl.entrysize = sizeof(SnapstateTxnEnt);
+	hash_ctl.hash = tag_hash;
+	hash_ctl.hcxt = TopMemoryContext;
+
+	snapstate->by_txn = hash_create("SnapstateByXid", 1000, &hash_ctl,
+	                            HASH_ELEM | HASH_FUNCTION);
+
+	elog(LOG, "allocating snapshotbuilder");
+	return snapstate;
+}
+
+void
+FreeSnapshotBuilder(Snapstate* snapstate)
+{
+	hash_destroy(snapstate->by_txn);
+	free(snapstate);
+}
+
+SnapBuildAction
+SnapBuildCallback(ApplyCache *applycache, Snapstate* snapstate, XLogRecordBuffer* buf)
+{
+	XLogRecord* r = &buf->record;
+	uint8 info = r->xl_info & ~XLR_INFO_MASK;
+	TransactionId xid = buf->record.xl_xid;
+
+	/*  relfilenode with the table changes have happened in */
+	bool found_changes = false;
+
+	RelFileNode *relfilenode;
+	SnapBuildAction ret = SNAPBUILD_SKIP;
+
+	{
+		StringInfoData s;
+
+		initStringInfo(&s);
+		RmgrTable[r->xl_rmid].rm_desc(&s,
+		                              r->xl_info,
+		                              buf->record_data);
+
+		/* don't bother emitting empty description */
+		if (s.len > 0)
+			elog(LOG,"xlog redo %u: %s", xid, s.data);
+	}
+
+	if (snapstate->state <= SNAPBUILD_FULL_SNAPSHOT)
+	{
+		if (r->xl_rmid == RM_STANDBY_ID &&
+		   info == XLOG_RUNNING_XACTS)
+		{
+			xl_running_xacts *running = (xl_running_xacts*)buf->record_data;
+
+			if (!running->subxid_overflow)
+			{
+				snapstate->state = SNAPBUILD_FULL_SNAPSHOT;
+
+
+				snapstate->xmin = running->oldestRunningXid;
+				TransactionIdRetreat(snapstate->xmin);
+				snapstate->xmax = running->latestCompletedXid;
+
+				snapstate->nrrunning = running->xcnt;
+				snapstate->nrrunning_initial = running->xcnt;
+				snapstate->nrrunning_space = running->xcnt;
+
+				SnapBuildReserveRunning(snapstate, snapstate->nrrunning_space);
+
+				memcpy(snapstate->running, running->xids,
+				       snapstate->nrrunning_initial * sizeof(TransactionId));
+
+				/* sort so we can do a binary search */
+				SnapBuildSortRunning(snapstate);
+
+				if (running->xcnt)
+				{
+					snapstate->xmin_running = snapstate->running[0];
+					snapstate->xmax_running = snapstate->running[running->xcnt - 1];
+				}
+				else
+				{
+					snapstate->xmin_running = InvalidTransactionId;
+					snapstate->xmax_running = InvalidTransactionId;
+					/* FIXME: abort everything considered running */
+					snapstate->state = SNAPBUILD_CONSISTENT;
+				}
+				elog(LOG, "built initial snapshot (via running xacts). Done: %i",
+				     snapstate->state == SNAPBUILD_CONSISTENT);
+			}
+			else if (TransactionIdIsValid(snapstate->valid_after))
+			{
+				if (NormalTransactionIdPrecedes(snapstate->valid_after, running->oldestRunningXid))
+				{
+					snapstate->state = SNAPBUILD_FULL_SNAPSHOT;
+					snapstate->xmin_running = InvalidTransactionId;
+					snapstate->xmax_running = InvalidTransactionId;
+					/* FIXME: copy all transactions we have seen starting to ->running */
+				}
+			}
+			else
+			{
+				snapstate->state = SNAPBUILD_INITIAL_POINT;
+
+				snapstate->valid_after = running->nextXid;
+				elog(INFO, "starting to build snapshot, valid_after xid: %u",
+				     snapstate->valid_after);
+			}
+		}
+		/* we know nothing has been in progress at this point... */
+		else if (r->xl_rmid == RM_XLOG_ID &&
+		        info == XLOG_CHECKPOINT_SHUTDOWN)
+		{
+			CheckPoint* checkpoint = (CheckPoint*)buf->record_data;
+
+			snapstate->xmin = checkpoint->nextXid;
+			snapstate->xmax = checkpoint->nextXid;
+
+			snapstate->nrrunning = 0;
+			snapstate->nrrunning_initial = 0;
+			snapstate->nrrunning_space = 0;
+			free(snapstate->running);
+			snapstate->running = NULL;
+
+			snapstate->state = SNAPBUILD_CONSISTENT;
+
+			elog(LOG, "built initial snapshot (via shutdown)!!!!");
+			/*FIXME: cleanup state */
+		}
+		else if(r->xl_rmid == RM_XLOG_ID &&
+		        info == XLOG_CHECKPOINT_ONLINE)
+		{
+			/* FIXME: Check whether there is a valid state dumped to disk */
+		}
+	}
+
+	if (snapstate->state == SNAPBUILD_START)
+		return SNAPBUILD_SKIP;
+
+	switch (r->xl_rmid)
+	{
+		case RM_XLOG_ID:
+		{
+			switch (info)
+			{
+				case XLOG_CHECKPOINT_SHUTDOWN:
+				{
+					CheckPoint* checkpoint = (CheckPoint*)buf->record_data;
+
+					/*
+					 * we know nothing can be running anymore, normal
+					 * transaction state is sufficient
+					 */
+
+					/* no need to have any transaction state anymore */
+#ifdef NOT_YES
+					for (/*FIXME*/)
+					{
+						SnapBuildAbortTxn(snapstate, xid);
+					}
+#endif
+					snapstate->xmin = checkpoint->nextXid;
+					TransactionIdRetreat(snapstate->xmin);
+					snapstate->xmax = checkpoint->nextXid;
+
+					free(snapstate->running);
+					snapstate->running = NULL;
+					snapstate->nrrunning = 0;
+					snapstate->nrrunning_initial = 0;
+					snapstate->nrrunning_space = 0;
+
+					/*FIXME: cleanup state */
+
+
+					ret = SNAPBUILD_DECODE;
+
+					break;
+				}
+				case XLOG_CHECKPOINT_ONLINE:
+				{
+					/* FIXME: dump state to disk so we can restart from here later */
+					break;
+				}
+			}
+			break;
+		}
+		case RM_STANDBY_ID:
+		{
+			switch (info)
+			{
+				case XLOG_RUNNING_XACTS:
+				{
+					xl_running_xacts *running = (xl_running_xacts*)buf->record_data;
+					snapstate->xmin = running->oldestRunningXid;
+					TransactionIdRetreat(snapstate->xmin);
+					snapstate->xmax = running->latestCompletedXid;
+					TransactionIdAdvance(snapstate->xmax);
+
+					SnapBuildPurgeCommittedTxn(snapstate);
+
+					break;
+				}
+				case XLOG_STANDBY_LOCK:
+					break;
+			}
+			break;
+		}
+		case RM_XACT_ID:
+		{
+			switch (info)
+			{
+				case XLOG_XACT_COMMIT:
+				{
+					xl_xact_commit* xlrec =
+						(xl_xact_commit*)buf->record_data;
+
+					SnapBuildCommitTxn(snapstate, xid, xlrec->nsubxacts,
+					                   (TransactionId*)&xlrec->xnodes);
+					ret = SNAPBUILD_DECODE;
+
+					break;
+				}
+				case XLOG_XACT_COMMIT_COMPACT:
+				{
+					xl_xact_commit_compact* xlrec =
+						(xl_xact_commit_compact*)buf->record_data;
+
+					SnapBuildCommitTxn(snapstate, xid, xlrec->nsubxacts,
+					                   xlrec->subxacts);
+					ret = SNAPBUILD_DECODE;
+					break;
+				}
+				case XLOG_XACT_COMMIT_PREPARED:
+				{
+					xl_xact_commit_prepared* xlrec =
+						(xl_xact_commit_prepared*)buf->record_data;
+
+					SnapBuildCommitTxn(snapstate, xid, xlrec->crec.nsubxacts,
+					                   (TransactionId*)&xlrec->crec.xnodes);
+					ret = SNAPBUILD_DECODE;
+					break;
+				}
+				case XLOG_XACT_ABORT:
+				{
+					xl_xact_abort* xlrec =
+						(xl_xact_abort*)buf->record_data;
+
+					SnapBuildAbortTxn(snapstate, xid, xlrec->nsubxacts,
+					                  (TransactionId*)&xlrec->xnodes);
+					ret = SNAPBUILD_DECODE;
+
+				}
+				case XLOG_XACT_ABORT_PREPARED:
+				{
+					xl_xact_abort_prepared* xlrec =
+						(xl_xact_abort_prepared*)buf->record_data;
+
+					SnapBuildAbortTxn(snapstate, xid, xlrec->arec.nsubxacts,
+					                  (TransactionId*)&xlrec->arec.xnodes);
+					ret = SNAPBUILD_DECODE;
+				}
+				case XLOG_XACT_ASSIGNMENT:
+				case XLOG_XACT_PREPARE: /* boring? */
+				default:
+					break;
+					;
+			}
+			break;
+		}
+		case RM_HEAP_ID:
+		{
+			switch (info & XLOG_HEAP_OPMASK)
+			{
+				/* XXX: this only happens for "irrelevant" changes? Ignore for now */
+				case XLOG_HEAP_INPLACE:
+				{
+					xl_heap_inplace *xlrec = (xl_heap_inplace*)buf->record_data;
+					relfilenode = &xlrec->target.node;
+					found_changes = false; /* <----- LOOK */
+					break;
+				}
+				/*
+				 * we only ever read changes, so row level locks aren't
+				 * interesting
+				 */
+				case XLOG_HEAP_LOCK:
+					break;
+
+				case XLOG_HEAP_INSERT:
+				{
+					xl_heap_insert *xlrec = (xl_heap_insert*)buf->record_data;
+					relfilenode = &xlrec->target.node;
+					found_changes = true;
+					break;
+				}
+				case XLOG_HEAP_UPDATE:
+				case XLOG_HEAP_HOT_UPDATE:
+				{
+					xl_heap_update *xlrec = (xl_heap_update*)buf->record_data;
+					relfilenode = &xlrec->target.node;
+					found_changes = true;
+					break;
+				}
+				case XLOG_HEAP_DELETE:
+				{
+					xl_heap_delete *xlrec = (xl_heap_delete*)buf->record_data;
+					relfilenode = &xlrec->target.node;
+					found_changes = true;
+					break;
+				}
+				default:
+					;
+			}
+			break;
+		}
+		case RM_HEAP2_ID:
+		{
+			/* some HEAP2 things don't necessarily happen in a transaction? */
+			if (!TransactionIdIsValid(xid))
+				break;
+
+			switch (info)
+			{
+				case XLOG_HEAP2_MULTI_INSERT:
+				{
+					xl_heap_multi_insert *xlrec =
+						(xl_heap_multi_insert*)buf->record_data;
+
+					relfilenode = &xlrec->node;
+
+					found_changes = true;
+
+					/*
+					 * we only decode the first tuple as all the following ones
+					 * will have the same cmin (and no cmax)
+					 */
+					break;
+				}
+				default:
+					;
+			}
+		}
+		break;
+	}
+
+
+
+	if (found_changes)
+	{
+		/*
+		 * we unfortunately cannot access the catalog of other databases, so
+		 * don't think about changes in them
+		 */
+		if (relfilenode->dbNode != MyDatabaseId)
+			;
+		/*
+		 * we need to keep track of new transactions while we didn't know what
+		 * was already running. Only actual data changes are relevant, so its
+		 * fine to track them here.
+		 */
+		else if (snapstate->state < SNAPBUILD_FULL_SNAPSHOT)
+			SnapBuildAddRunningTxn(snapstate, xid);
+		/*
+		 * No point in keeping track of changes in transactions that we don't
+		 * have enough information about to decode.
+		 */
+		else if (snapstate->state < SNAPBUILD_CONSISTENT &&
+		         SnapBuildTxnRunning(snapstate, xid))
+			;
+		else
+		{
+			bool does_timetravel;
+			bool old_tx = ApplyCacheIsXidKnown(applycache, xid);
+			bool found;
+			SnapstateTxnEnt *ent;
+
+			Assert(TransactionIdIsNormal(xid));
+			Assert(!SnapBuildTxnRunning(snapstate, xid));
+
+
+
+			ent = hash_search(snapstate->by_txn,
+			                  (void *)&xid,
+			                  HASH_FIND,
+			                  &found);
+
+			/* FIXME: For now skip transactions with catalog changes entirely */
+			if (ent && ent->does_timetravel)
+				does_timetravel = true;
+			else
+				does_timetravel = SnapBuildHasCatalogChanges(snapstate, xid, relfilenode);
+
+			/*
+			 * we don't add catalog changes to the applycache, we could use
+			 * them to queue local cache inval messages for catalog tables if
+			 * the relmapper would map from relfilenode to relid with correct
+			 * visibility rules.
+			 */
+			if (!does_timetravel)
+				ret = SNAPBUILD_DECODE;
+
+			elog(LOG, "found changes in xid %u (known: %u), timetravel: %i",
+			     xid, old_tx, does_timetravel);
+
+			/*
+			 * FIXME: At this point we have might have a problem if somebody
+			 * would CLUSTER, REINDEX or similar a system table inside a
+			 * transaction and *also* does other catalog modifications because
+			 * we can only build proper snapshots to look at the catalog after
+			 * we have reached the commit record because only then we know the
+			 * subxids of a toplevel txid. Because we wouldn't notice the
+			 * changed system table relfilenodes we wouldn't see the any of
+			 * those catalog changes.
+			 *
+			 * So we need to forbid that.
+			 */
+
+			if (!old_tx)
+			{
+				/* update global snapshot information */
+				if (does_timetravel)
+				{
+					ent = hash_search(snapstate->by_txn,
+					                  (void *)&xid,
+					                  HASH_FIND|HASH_ENTER,
+					                  &found);
+
+					elog(LOG, "found catalog change in tx %u without changes, did we know it: %u",
+					     xid, found);
+
+					ent->does_timetravel = true;
+
+				}
+				else
+				{
+					elog(LOG, "adding initial snapshot to xid %u", xid);
+				}
+
+				/* add initial snapshot*/
+				{
+					Snapshot snap = SnapBuildBuildSnapshot(snapstate, xid);
+
+					elog(LOG, "adding base snap");
+					ApplyCacheAddBaseSnapshot(applycache, xid,
+					                          InvalidXLogRecPtr,
+					                          snap);
+				}
+
+			}
+			/* update already distributed snapshots */
+			else if (does_timetravel && old_tx)
+			{
+				/*
+				 * check whether we already know the xid as a catalog modifying
+				 * one
+				 */
+				SnapstateTxnEnt *ent =
+					hash_search(snapstate->by_txn,
+					            (void *)&xid,
+					            HASH_FIND|HASH_ENTER,
+				            &found);
+
+				elog(LOG, "found catalog change in tx %u with changes, did we know it: %u",
+				     xid, found);
+
+				ent->does_timetravel = true;
+
+				/* FIXME: add a new CommandId to the applycache's ->changes queue */
+			}
+		}
+	}
+
+	return ret;
+}
+
+
+/* Does this relation carry catalog information */
+static bool
+SnapBuildHasCatalogChanges(Snapstate* snapstate, TransactionId xid, RelFileNode* relfilenode)
+{
+	/* FIXME: build snapshot for transaction */
+	HeapTuple table;
+	Form_pg_class class_form;
+
+	Snapshot snap = SnapBuildBuildSnapshot(snapstate, xid);
+
+	if (relfilenode->spcNode == GLOBALTABLESPACE_OID)
+		return true;
+
+	SetupDecodingSnapshots(snap);
+
+	InvalidateSystemCaches();
+
+	table = LookupTableByRelFileNode(relfilenode);
+
+	RevertFromDecodingSnapshots();
+	InvalidateSystemCaches();
+
+	/*
+	 * tables in the default tablespace are stored in pg_class with 0 as their
+	 * reltablespace
+	 */
+	if (!HeapTupleIsValid(table))
+	{
+		if (relfilenode->relNode >= FirstNormalObjectId)
+		{
+			elog(WARNING, "failed pg_class lookup for %u:%u with a oid in >= FirstNormalObjectId",
+			     relfilenode->spcNode, relfilenode->relNode);
+		}
+		return true;
+	}
+
+	class_form = (Form_pg_class) GETSTRUCT(table);
+
+	return IsSystemClass(class_form);
+}
+
+/* build a new snapshot, based on currently committed transactions */
+static Snapshot
+SnapBuildBuildSnapshot(Snapstate *snapstate, TransactionId xid)
+{
+	Snapshot snapshot = malloc(sizeof(SnapshotData) +
+	                           sizeof(TransactionId) * snapstate->nrcommitted +
+	                           sizeof(TransactionId) * 1 /* toplevel xid */);
+
+	snapshot->satisfies = HeapTupleSatisfiesMVCCDuringDecoding;
+	/*
+	 * we copy all currently in progress transaction to ->xip, all transactions
+	 * added to the transaction that committed during running - which thus need
+	 * to be considered visible in SnapshotNow semantics - get copied to
+	 * ->subxip.
+	 * XXX: Do we want extra fileds for those two instead?
+	 */
+	snapshot->xmin = snapstate->xmin;
+	snapshot->xmax = snapstate->xmax;
+
+	/* store all transaction to be treated as committed */
+	snapshot->xip = (TransactionId*)((char*)snapshot + sizeof(SnapshotData));
+
+	snapshot->xcnt = snapstate->nrcommitted;
+	memcpy(snapshot->xip, snapstate->committed,
+	       snapstate->nrcommitted * sizeof(TransactionId));
+
+	/* sort so we can bsearch() */
+	qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+
+	/* store toplevel xid */
+	/*
+	 * FIXME: subtransaction handling currently needs to be done in
+	 * applycache. Yuck.
+	 */
+	snapshot->subxip = (TransactionId*)(
+		(char*)snapshot
+		+ sizeof(SnapshotData) /* offset to ->xip's data */
+		+ sizeof(TransactionId) * snapstate->nrcommitted /* data */
+		);
+
+	snapshot->subxcnt = 1;
+	snapshot->subxip[0] = xid;
+
+	snapshot->suboverflowed = false;
+	snapshot->takenDuringRecovery = false;
+	snapshot->copied = false;
+	snapshot->curcid = 0;
+	snapshot->active_count = 0;
+	snapshot->regd_count = 0;
+
+	return snapshot;
+}
+
+/* check whether `xid` is currently running */
+static bool
+SnapBuildTxnRunning(Snapstate* snapstate, TransactionId xid)
+{
+	if (snapstate->nrrunning &&
+	    NormalTransactionIdFollows(xid, snapstate->xmin_running) &&
+	    NormalTransactionIdPrecedes(xid, snapstate->xmax_running))
+	{
+		TransactionId* xid =
+			bsearch(&xid, snapstate->running, snapstate->nrrunning_initial,
+			        sizeof(TransactionId), xidComparator);
+
+		if (xid != NULL)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * add a new SnapshotNow to all transactions were decoding that are currently
+ * in-progress so they can see new catalog contents.
+ */
+static void
+SnapBuildDistributeSnapshotNow(Snapstate* snapstate, TransactionId xid)
+{
+	/* FIXME: implement */
+}
+
+/*
+ * Keep track of a new catalog changing transaction that has committed
+ */
+static void
+SnapBuildAddCommittedTxn(Snapstate* snapstate, TransactionId xid)
+{
+	if (snapstate->nrcommitted == snapstate->nrcommitted_space)
+	{
+		elog(WARNING, "increasing space for committed transactions");
+
+		snapstate->nrcommitted_space *= 2;
+		snapstate->committed = realloc(snapstate->committed,
+		                               snapstate->nrcommitted_space * sizeof(TransactionId));
+		if (!snapstate->committed)
+			elog(ERROR, "couldn't enlarge space for committed transactions");
+	}
+	snapstate->committed[snapstate->nrcommitted++] = xid;
+}
+
+/*
+ * Remove all transactions we treat as committed that are smaller than
+ * ->xmin. Those won't ever get checked via the ->commited array anyway.
+ */
+static void
+SnapBuildPurgeCommittedTxn(Snapstate* snapstate)
+{
+	int off;
+	TransactionId *workspace;
+	int surviving_xids = 0;
+
+	/* FIXME: Neater algorithm? */
+	workspace = malloc(snapstate->nrcommitted * sizeof(TransactionId));
+
+	if (!workspace)
+		elog(ERROR, "could not allocate memory for workspace during xmin purging");
+
+	for (off = 0; off < snapstate->nrcommitted; off++)
+	{
+		if (snapstate->committed[off] > snapstate->xmin)
+			workspace[surviving_xids++] = snapstate->committed[off];
+	}
+
+	memcpy(snapstate->committed, workspace,
+	       surviving_xids * sizeof(TransactionId));
+
+	snapstate->nrcommitted = surviving_xids;
+	free(workspace);
+}
+
+/*
+ * makes sure we have enough space for at least `count` additional txn's,
+ * reallocates if necessary
+ */
+static void
+SnapBuildReserveRunning(Snapstate *snapstate, Size count)
+{
+	const Size reserve = 100;
+
+	if (snapstate->nrrunning_initial + count < snapstate->nrrunning_space)
+		return;
+
+	if (snapstate->running)
+	{
+		snapstate->nrrunning_space += count + reserve;
+		snapstate->running =
+			realloc(snapstate->running,
+			        snapstate->nrrunning_space *
+			        sizeof(TransactionId));
+		if (!snapstate->running)
+			elog(ERROR, "could not reallocate ->running");
+	}
+	else
+	{
+		snapstate->nrrunning_space = count + reserve;
+		snapstate->running = malloc(snapstate->nrrunning_space
+		                            * sizeof(TransactionId));
+	}
+}
+
+/*
+ * To allow binary search in the set of running transactions, sort them with
+ * xidComparator.
+ */
+static void
+SnapBuildSortRunning(Snapstate *snapstate)
+{
+	qsort(snapstate->running, snapstate->nrrunning_initial,
+	      sizeof(TransactionId), xidComparator);
+}
+
+/*
+ * Add transaction to the set of currently runnign transactions.
+ */
+static void
+SnapBuildAddRunningTxn(Snapstate *snapstate, TransactionId xid)
+{
+	Assert(snapstate->state == SNAPBUILD_INITIAL_POINT &&
+	       TransactionIdIsValid(snapstate->valid_after));
+
+	/*
+	 * we only need those running txn's if were switching state due to reaching
+	 * the xmin horizon. Transactions before we reached that are not
+	 * interesting.
+	 */
+	if (NormalTransactionIdPrecedes(xid, snapstate->valid_after) )
+		return;
+
+	if (SnapBuildTxnRunning(snapstate, xid))
+		return;
+
+	Assert(!TransactionIdPrecedesOrEquals(xid, snapstate->xmin_running));
+
+	if (TransactionIdFollowsOrEquals(xid, snapstate->xmax_running))
+		snapstate->xmax_running = xid;
+
+	SnapBuildReserveRunning(snapstate, 1);
+
+	/* FIXME: inefficient insertion logic, should at least be insertion sort */
+	snapstate->running[snapstate->nrrunning_initial++] = xid;
+	snapstate->nrrunning++;
+	SnapBuildSortRunning(snapstate);
+}
+
+/*
+ * Common logic for SnapBuildAbortTxn and SnapBuildCommitTxn dealing with
+ * keeping track of the amount of running transactions.
+ */
+static void
+SnapBuildEndTxn(Snapstate* snapstate, TransactionId xid)
+{
+	if (snapstate->state == SNAPBUILD_CONSISTENT)
+		return;
+
+	if (SnapBuildTxnRunning(snapstate, xid))
+	{
+		if (!--snapstate->nrrunning)
+		{
+			/*
+			 * none of the originally running transaction is running
+			 * anymore. Due to that our incrementaly built snapshot now is
+			 * complete.
+			 */
+			elog(LOG, "found consistent point due to SnapBuildEndTxn + running: %u", xid);
+			snapstate->state = SNAPBUILD_CONSISTENT;
+		}
+	}
+}
+
+/* Abort a transaction, throw away all state we kept */
+static void
+SnapBuildAbortTxn(Snapstate* snapstate, TransactionId xid, int nsubxacts, TransactionId* subxacts)
+{
+	bool found;
+	int i;
+
+	for(i = 0; i < nsubxacts; i++)
+	{
+		TransactionId subxid = subxacts[i];
+		SnapBuildEndTxn(snapstate, subxid);
+
+		hash_search(snapstate->by_txn,
+		            (void *)&subxid,
+		            HASH_REMOVE,
+		            &found);
+
+	}
+
+	SnapBuildEndTxn(snapstate, xid);
+
+	hash_search(snapstate->by_txn,
+	            (void *)&xid,
+	            HASH_REMOVE,
+	            &found);
+}
+
+/* Handle everything that needs to be done when a transaction commits */
+static void
+SnapBuildCommitTxn(Snapstate* snapstate, TransactionId xid, int nsubxacts,
+                   TransactionId* subxacts)
+{
+	int off;
+	bool found;
+	bool forced_timetravel = false;
+	bool sub_does_timetravel = false;
+	SnapstateTxnEnt *ent;
+
+	/*
+	 * If we couldn't observe every change of a transaction because it was
+	 * already running at the point we started to observe we have to assume it
+	 * made catalog changes.
+	 */
+	if (snapstate->state < SNAPBUILD_CONSISTENT && SnapBuildTxnRunning(snapstate, xid))
+	{
+		elog(LOG, "forced to assume catalog changes for xid %u because it was running to early", xid);
+		SnapBuildAddCommittedTxn(snapstate, xid);
+		forced_timetravel = true;
+	}
+
+	for(off = 0; off < nsubxacts; off++)
+	{
+		TransactionId subxid = subxacts[off];
+
+		SnapBuildEndTxn(snapstate, subxid);
+
+		ent = hash_search(snapstate->by_txn,
+		                  (void *)&subxid,
+		                  HASH_FIND,
+		                  &found);
+
+		if (forced_timetravel)
+		{
+			SnapBuildAddCommittedTxn(snapstate, subxid);
+		}
+		/* add subtransaction to base snapshot, we don't distinguish after that */
+		else if (found && ent->does_timetravel)
+		{
+			sub_does_timetravel = true;
+
+			elog(WARNING, "found subtransaction %u:%u with catalog changes",
+			     xid, subxid);
+
+			SnapBuildAddCommittedTxn(snapstate, subxid);
+		}
+
+		/* make sure its not tracked in running txn's anymore, switch state */
+		SnapBuildEndTxn(snapstate, subxid);
+
+		if (found)
+		{
+			hash_search(snapstate->by_txn,
+			            (void *)&xid,
+			            HASH_REMOVE,
+			            &found);
+			Assert(found);
+		}
+
+		if (NormalTransactionIdFollows(subxid, snapstate->xmax))
+		{
+			snapstate->xmax = subxid;
+			TransactionIdAdvance(snapstate->xmax);
+		}
+	}
+
+	/* make sure its not tracked in running txn's anymore, switch state */
+	SnapBuildEndTxn(snapstate, xid);
+
+	ent =
+		hash_search(snapstate->by_txn,
+		            (void *)&xid,
+		            HASH_FIND,
+		            &found);
+
+	/* add toplevel transaction to base snapshot */
+	if (found && ent->does_timetravel)
+	{
+		elog(DEBUG1, "found top level transaction %u, with catalog changes !!!!", xid);
+		SnapBuildAddCommittedTxn(snapstate, xid);
+	}
+
+	if ((found && ent->does_timetravel) || sub_does_timetravel || forced_timetravel)
+	{
+		elog(DEBUG1, "found transaction %u, with catalog changes !!!!", xid);
+
+		/* add a new SnapshotNow to all currently running transactions */
+		SnapBuildDistributeSnapshotNow(snapstate, xid);
+	}
+
+	if (found)
+	{
+		/* now we don't need the contents anymore, remove */
+		hash_search(snapstate->by_txn,
+		            (void *)&xid,
+		            HASH_REMOVE,
+		            &found);
+		Assert(found);
+	}
+
+	if (NormalTransactionIdFollows(xid, snapstate->xmax))
+	{
+		snapstate->xmax = xid;
+		TransactionIdAdvance(snapstate->xmax);
+	}
+}
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index b531db5..25af26a 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -65,6 +65,7 @@
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
 #include "utils/tqual.h"
+#include "utils/builtins.h"
 
 
 /* Static variables representing various special snapshot semantics */
@@ -73,6 +74,8 @@ SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
 SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
 SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
 
+static Snapshot SnapshotNowDecoding;
+
 /* local functions */
 static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
@@ -1375,3 +1378,161 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 
 	return false;
 }
+
+static bool
+TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
+{
+	return bsearch(&xid, xip, num,
+	               sizeof(TransactionId), xidComparator) != NULL;
+}
+
+
+/*
+ * See the comments for HeapTupleSatisfiesMVCC for the semantics this function
+ * obeys.
+ *
+ * Only usable on tuples from catalog tables!
+ *
+ * We don't need to support HEAP_MOVED_(IN|OFF) for now because we only support
+ * reading catalog pages which couldn't have been created in an older version.
+ *
+ * Basically we record all transactions that are in progress when the
+ * transaction starts and treat them as in-progress for the duration of the
+ * snapshot, everything below xmin is comitted, everything above xmax is
+ * in-progress, and everything thats not in our in-progress array is committed
+ * as well.
+ */
+bool
+HeapTupleSatisfiesMVCCDuringDecoding(HeapTupleHeader tuple, Snapshot snapshot,
+                                     Buffer buffer)
+{
+	TransactionId xmin = HeapTupleHeaderGetXmin(tuple);
+	TransactionId xmax = HeapTupleHeaderGetXmax(tuple);
+
+	/*
+	 * FIXME: The not yet existing decoding infrastructure will need to force
+	 * the xmin to stay lower than what they are currently decoding.
+	 */
+	bool fixme_xmin_horizon = false;
+
+	if (fixme_xmin_horizon && tuple->t_infomask & HEAP_XMIN_INVALID)
+	{
+		return false;
+	}
+	/* normal transaction state counts */
+	else if (TransactionIdPrecedes(xmin, snapshot->xmin))
+	{
+		if (!TransactionIdDidCommit(xmin))
+			return false;
+	}
+	/* beyond our xmax horizon, i.e. invisible */
+	else if (TransactionIdFollows(xmin, snapshot->xmax))
+	{
+		return false;
+	}
+    /* check if its one of our txids, toplevel is also in there */
+	else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt))
+	{
+		CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple);
+		/* no support for that yet */
+		if (tuple->t_infomask & HEAP_COMBOCID){
+			elog(WARNING, "combocids not yet supported");
+			return false;
+		}
+		if (cmin >= snapshot->curcid)
+			return false;	/* inserted after scan started */
+	}
+	/* check if we know the transaction has committed */
+	else if(TransactionIdInArray(xmin, snapshot->xip, snapshot->xcnt))
+	{
+	}
+	else
+	{
+		return false;
+	}
+
+	/* at this point we know xmin is visible */
+
+	/* why should those be in catalog tables? */
+	Assert(!(tuple->t_infomask & HEAP_XMAX_IS_MULTI));
+
+	if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid or aborted */
+		return true;
+
+	if (tuple->t_infomask & HEAP_IS_LOCKED)
+		return true;
+
+	/* we cannot possibly see the deleting transaction */
+	if (TransactionIdFollows(xmax, snapshot->xmax))
+	{
+		return true;
+	}
+	/* normal transaction state is valid */
+	else if (TransactionIdPrecedes(xmax, snapshot->xmin))
+	{
+		return !TransactionIdDidCommit(xmax);
+	}
+    /* check if its one of our txids, toplevel is also in there */
+	else if (TransactionIdInArray(xmax, snapshot->subxip, snapshot->subxcnt))
+	{
+		CommandId cmax = HeapTupleHeaderGetRawCommandId(tuple);
+		/* no support for that yet */
+		if (tuple->t_infomask & HEAP_COMBOCID){
+			elog(WARNING, "combocids not yet supported");
+			return true;
+		}
+
+		if (cmax >= snapshot->curcid)
+			return true;	/* deleted after scan started */
+		else
+			return false;	/* deleted before scan started */
+	}
+	/* do we know that the deleting txn is valid? */
+	else if (TransactionIdInArray(xmax, snapshot->xip, snapshot->xcnt))
+	{
+		return false;
+	}
+	else
+	{
+		return true;
+	}
+}
+
+static bool
+FailsSatisfies(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
+{
+	elog(ERROR, "should not be called after SetupDecodingSnapshots!");
+	return false;
+}
+
+static bool
+RedirectSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
+{
+	Assert(SnapshotNowDecoding != NULL);
+	return HeapTupleSatisfiesMVCCDuringDecoding(tuple, SnapshotNowDecoding,
+	                                            buffer);
+}
+
+void
+SetupDecodingSnapshots(Snapshot snapshot_now)
+{
+	SnapshotNowData.satisfies = RedirectSatisfiesNow;
+	SnapshotSelfData.satisfies = FailsSatisfies;
+	SnapshotAnyData.satisfies = FailsSatisfies;
+	SnapshotToastData.satisfies = FailsSatisfies;
+
+	SnapshotNowDecoding = snapshot_now;
+}
+
+
+void
+RevertFromDecodingSnapshots(void)
+{
+	SnapshotNowDecoding = NULL;
+
+	SnapshotNowData.satisfies = HeapTupleSatisfiesNow;
+	SnapshotSelfData.satisfies = HeapTupleSatisfiesSelf;
+	SnapshotAnyData.satisfies = HeapTupleSatisfiesAny;
+	SnapshotToastData.satisfies = HeapTupleSatisfiesToast;
+
+}
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 228f6a1..915b2cd 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -63,6 +63,11 @@
 	(AssertMacro(TransactionIdIsNormal(id1) && TransactionIdIsNormal(id2)), \
 	(int32) ((id1) - (id2)) < 0)
 
+/* compare two XIDs already known to be normal; this is a macro for speed */
+#define NormalTransactionIdFollows(id1, id2) \
+	(AssertMacro(TransactionIdIsNormal(id1) && TransactionIdIsNormal(id2)), \
+	(int32) ((id1) - (id2)) > 0)
+
 /* ----------
  *		Object ID (OID) zero is InvalidOid.
  *
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d88248a..b5b886b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4655,6 +4655,9 @@ DESCR("SP-GiST support for suffix tree over text");
 DATA(insert OID = 4031 (  spg_text_leaf_consistent	PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2281 2281" _null_ _null_ _null_ _null_  spg_text_leaf_consistent _null_ _null_ _null_ ));
 DESCR("SP-GiST support for suffix tree over text");
 
+DATA(insert OID = 4033 (  decode_xlog	PGNSP PGUID 12 1  0 0 0 f f f f t f i 2 0 16 "2275 2275" _null_ _null_ _null_ _null_ decode_xlog _null_ _null_ _null_ ));
+DESCR("decode xlog");
+
 DATA(insert OID = 3469 (  spg_range_quad_config	PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2278 "2281 2281" _null_ _null_ _null_ _null_  spg_range_quad_config _null_ _null_ _null_ ));
 DESCR("SP-GiST support for quad tree over range");
 DATA(insert OID = 3470 (  spg_range_quad_choose	PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2278 "2281 2281" _null_ _null_ _null_ _null_  spg_range_quad_choose _null_ _null_ _null_ ));
diff --git a/src/include/replication/applycache.h b/src/include/replication/applycache.h
new file mode 100644
index 0000000..f101eeb
--- /dev/null
+++ b/src/include/replication/applycache.h
@@ -0,0 +1,239 @@
+/*
+ * applycache.h
+ *
+ * PostgreSQL logical replay "cache" management
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/replication/applycache.h
+ */
+#ifndef APPLYCACHE_H
+#define APPLYCACHE_H
+
+#include "access/htup_details.h"
+#include "utils/hsearch.h"
+#include "utils/ilist.h"
+#include "utils/snapshot.h"
+
+typedef struct ApplyCache ApplyCache;
+
+enum ApplyCacheChangeType
+{
+	APPLY_CACHE_CHANGE_INSERT,
+	APPLY_CACHE_CHANGE_UPDATE,
+	APPLY_CACHE_CHANGE_DELETE,
+	/*
+	 * for efficiency and simplicity reasons we keep those in the same list,
+	 * thats somewhat annoying because switch()es warn if those aren't
+	 * handled... Make those private values?
+	 */
+	APPLY_CACHE_CHANGE_SNAPSHOT,
+	APPLY_CACHE_CHANGE_COMMAND_ID
+};
+
+typedef struct ApplyCacheTupleBuf
+{
+	/* position in preallocated list */
+	ilist_s_node node;
+
+	HeapTupleData tuple;
+	HeapTupleHeaderData header;
+	char data[MaxHeapTupleSize];
+} ApplyCacheTupleBuf;
+
+typedef struct ApplyCacheChange
+{
+	XLogRecPtr lsn;
+	enum ApplyCacheChangeType action;
+
+	RelFileNode relnode;
+
+	union {
+		ApplyCacheTupleBuf* newtuple;
+		Snapshot snapshot;
+		CommandId command_id;
+	};
+	ApplyCacheTupleBuf* oldtuple;
+
+
+	HeapTuple table;
+
+	/*
+	 * While in use this is how a change is linked into a transactions,
+	 * otherwise its the preallocated list.
+	*/
+	ilist_d_node node;
+} ApplyCacheChange;
+
+typedef struct ApplyCacheTXN
+{
+	TransactionId xid;
+
+	XLogRecPtr lsn;
+
+	/*
+	 * How many ApplyCacheChange's do we have in this txn.
+	 *
+	 * Subtransactions are *not* included.
+	 */
+	Size nentries;
+
+	/*
+	 * How many of the above entries are stored in memory in contrast to being
+	 * spilled to disk.
+	 */
+	Size nentries_mem;
+
+	/*
+	 * List of actual changes
+	 */
+	ilist_d_head changes;
+
+	/*
+	 * non-hierarchical list of subtransactions that are *not* aborted
+	 */
+	ilist_d_head subtxns;
+
+	/*
+	 * our position in a list of subtransactions while the TXN is in
+	 * use. Otherwise its the position in the list of preallocated
+	 * transactions.
+	 */
+	ilist_d_node node;
+
+	/*
+	 * List of (lsn, command_id).
+	 *
+	 * Everytime a catalog change happens this list gets appended with the
+	 * current commandid. This is used to be able to construct proper
+	 * Snapshot's for decoding.
+	 */
+	ilist_d_head commandids;
+
+	/*
+	 * List of (lsn, Snapshot) pairs.
+	 *
+	 * The first record always is the (InvalidXLogRecPtr, SnapshotAtStart)
+	 * pair. Everytime *another* transaction commits this gets appended with a
+	 * new Snapshot that has enough information to make SnapshotNow lookups.
+	 */
+	ilist_d_head snapshots;
+} ApplyCacheTXN;
+
+
+/* XXX: were currently passing the originating subtxn. Not sure thats necessary */
+typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache, ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change);
+typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+
+/*
+ * max number of concurrent top-level transactions or transaction where we
+ * don't know if they are top-level can be calculated by:
+ * (max_connections + max_prepared_xactx + ?)  * PGPROC_MAX_CACHED_SUBXIDS
+ */
+struct ApplyCache
+{
+	/*
+	 * Should snapshots for decoding be collected. If many catalog changes
+	 * happen this can be considerably expensive.
+	 */
+	bool build_snapshots;
+
+	TransactionId last_txn;
+	ApplyCacheTXN *last_txn_cache;
+	HTAB *by_txn;
+
+	ApplyCacheBeginCB begin;
+	ApplyCacheApplyChangeCB apply_change;
+	ApplyCacheCommitCB commit;
+
+	void* private_data;
+
+	MemoryContext context;
+
+	/*
+	 * we don't want to repeatedly (de-)allocated those structs, so cache them for reusage.
+	 */
+	ilist_d_head cached_transactions;
+	size_t nr_cached_transactions;
+
+	ilist_d_head cached_changes;
+	size_t nr_cached_changes;
+
+	ilist_s_head cached_tuplebufs;
+	size_t nr_cached_tuplebufs;
+};
+
+
+ApplyCache*
+ApplyCacheAllocate(void);
+
+void
+ApplyCacheFree(ApplyCache*);
+
+ApplyCacheTupleBuf*
+ApplyCacheGetTupleBuf(ApplyCache*);
+
+void
+ApplyCacheReturnTupleBuf(ApplyCache* cache, ApplyCacheTupleBuf* tuple);
+
+/*
+ * Returns a (potentically preallocated) change struct. Its lifetime is managed
+ * by the applycache module.
+ *
+ * If not added to a transaction with ApplyCacheAddChange it needs to be
+ * returned via ApplyCacheReturnChange
+ *
+ * FIXME: better name
+ */
+ApplyCacheChange*
+ApplyCacheGetChange(ApplyCache*);
+
+/*
+ * Return an unused ApplyCacheChange struct
+ */
+void
+ApplyCacheReturnChange(ApplyCache*, ApplyCacheChange*);
+
+
+/*
+ * record the transaction as in-progress if not already done, add the current
+ * change.
+ *
+ * We have a one-entry cache for lookin up the current ApplyCacheTXN so we
+ * don't need to do a full hash-lookup if the same xid is used
+ * sequentially. Them being used multiple times that way is rather frequent.
+ */
+void
+ApplyCacheAddChange(ApplyCache*, TransactionId, XLogRecPtr lsn, ApplyCacheChange*);
+
+/*
+ *
+ */
+void
+ApplyCacheCommit(ApplyCache*, TransactionId, XLogRecPtr lsn);
+
+void
+ApplyCacheCommitChild(ApplyCache*, TransactionId, TransactionId, XLogRecPtr lsn);
+
+void
+ApplyCacheAbort(ApplyCache*, TransactionId, XLogRecPtr lsn);
+
+typedef struct SnapshotData* Snapshot;
+
+/*
+ * if lsn == InvalidXLogRecPtr this is the first snap for the transaction
+ */
+void
+ApplyCacheAddBaseSnapshot(ApplyCache*, TransactionId, XLogRecPtr lsn, Snapshot snap);
+
+/*
+ * Will only be called for command ids > 1
+ */
+void
+ApplyCacheAddNewCommandId(ApplyCache*, TransactionId, XLogRecPtr lsn, CommandId cid);
+
+bool
+ApplyCacheIsXidKnown(ApplyCache* cache, TransactionId xid);
+#endif
diff --git a/src/include/replication/decode.h b/src/include/replication/decode.h
new file mode 100644
index 0000000..86312d1
--- /dev/null
+++ b/src/include/replication/decode.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ * decode.h
+ *     PostgreSQL WAL to logical transformation
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DECODE_H
+#define DECODE_H
+
+#include "access/xlogreader.h"
+#include "replication/applycache.h"
+
+struct Snapstate;
+
+typedef struct ReaderApplyState
+{
+	ApplyCache *apply_cache;
+	struct Snapstate *snapstate;
+} ReaderApplyState;
+
+void DecodeRecordIntoApplyCache(ReaderApplyState* state, XLogRecordBuffer* buf);
+
+#endif
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
new file mode 100644
index 0000000..ed92e75
--- /dev/null
+++ b/src/include/replication/snapbuild.h
@@ -0,0 +1,119 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapbuild.h
+ *	  Exports from replication/logical/snapbuild.c.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * src/include/replication/snapbuild.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPBUILD_H
+#define SNAPBUILD_H
+
+#include "replication/applycache.h"
+
+#include "utils/hsearch.h"
+#include "utils/snapshot.h"
+#include "access/htup.h"
+
+typedef enum
+{
+	SNAPBUILD_START,
+	/*
+	 * found initial visibility information.
+	 *
+	 * Thats either: XLOG_RUNNING_XACTS or XLOG_CHECKPOINT_SHUTDOWN
+	 */
+	SNAPBUILD_INITIAL_POINT,
+	/*
+	 * We have collected enough information to decode tuples in transactions
+	 * that started after this.
+	 *
+	 * Once we reached this we start to collect changes. We cannot apply them
+	 * yet because the might be based on transactions that were still running
+	 * when we reached them yet.
+	 */
+	SNAPBUILD_FULL_SNAPSHOT,
+	/*
+	 * Found a point after hitting built_full_snapshot where all transactions
+	 * that were running at that point finished. Till we reach that we hold off
+	 * calling any commit callbacks.
+	 */
+	SNAPBUILD_CONSISTENT
+} SnapBuildState;
+
+typedef enum
+{
+	SNAPBUILD_SKIP,
+	SNAPBUILD_DECODE
+} SnapBuildAction;
+
+typedef struct Snapstate
+{
+	SnapBuildState state;
+
+	/* all transactions smaller than this have committed/aborted */
+	TransactionId xmin;
+
+	/* all transactions bigger than this are uncommitted */
+	TransactionId xmax;
+
+	/*
+	 * All transactions in this window have to be checked via the running
+	 * array. This will only be used initially till we are past xmax_running.
+	 *
+	 * Note that we initially assume treat already running transactions to have
+	 * catalog modifications because we don't have enough information about
+	 * them to properly judge that.
+	 */
+	TransactionId xmin_running;
+	TransactionId xmax_running;
+
+	/* sorted array of running transactions, can be searched with bsearch() */
+	TransactionId* running;
+	/* how many running transactions remain */
+	size_t nrrunning;
+	/* how much free space do we have to add more running txn's */
+	size_t nrrunning_space;
+	/*
+	 * we need to keep track of the amount of tracked transactions separately
+	 * from nrrunning_space as nrunning_initial gives the range of valid xids
+	 * in the array so bsearch() can work.
+	 */
+	size_t nrrunning_initial;
+
+	TransactionId valid_after;
+
+	/*
+	 * Running (sub-)transactions with catalog changes. This will be used to
+	 * fill the committed array with a transactions xid and all it subxids
+	 * at commit.
+	 */
+	HTAB *by_txn;
+
+	/*
+	 * Transactions which could have catalog changes that committed between
+	 * xmin and xmax
+	 */
+	size_t nrcommitted;
+	size_t nrcommitted_space;
+	TransactionId* committed;
+
+	/* contains all catalog modifying txns */
+} Snapstate;
+
+extern Snapstate*
+AllocateSnapshotBuilder(ApplyCache *cache);
+
+extern void
+FreeSnapshotBuilder(Snapstate *cache);
+
+extern SnapBuildAction
+SnapBuildCallback(ApplyCache *cache, Snapstate* snapstate, XLogRecordBuffer* buf);
+
+extern HeapTuple
+LookupTableByRelFileNode(RelFileNode* r);
+
+#endif /* SNAPBUILD_H */
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index ff74f86..6c9b261 100644
--- a/src/include/utils/tqual.h
+++ b/src/include/utils/tqual.h
@@ -39,7 +39,8 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
 
 /* This macro encodes the knowledge of which snapshots are MVCC-safe */
 #define IsMVCCSnapshot(snapshot)  \
-	((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
+	((snapshot)->satisfies == HeapTupleSatisfiesMVCC || \
+	 (snapshot)->satisfies == HeapTupleSatisfiesMVCCDuringDecoding)
 
 /*
  * HeapTupleSatisfiesVisibility
@@ -89,4 +90,22 @@ extern bool HeapTupleIsSurelyDead(HeapTupleHeader tuple,
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 					 uint16 infomask, TransactionId xid);
 
+/*
+ * Special "satisfies" routines used during decoding xlog from a different
+ * point of lsn. Also used for timetravel SnapshotNow's.
+ */
+extern bool HeapTupleSatisfiesMVCCDuringDecoding(HeapTupleHeader tuple,
+                                                 Snapshot snapshot, Buffer buffer);
+
+/*
+ * install the 'snapshot_now' snapshot as a timetravelling snapshot replacing
+ * the normal SnapshotNow behaviour. This snapshot needs to have been created
+ * by snapbuild.c otherwise you will see crashes!
+ *
+ * FIXME: We need something resembling the real SnapshotNow to handle things
+ * like enum lookups from indices correctly.
+ */
+extern void SetupDecodingSnapshots(Snapshot snapshot_now);
+extern void RevertFromDecodingSnapshots(void);
+
 #endif   /* TQUAL_H */
#10Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
git tree

Hi,

A last note:

A git tree of this is at
git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-
decoding-rebasing-cf2

checkout with:

git clone --branch xlog-decoding-rebasing-cf2
git://git.postgresql.org/git/users/andresfreund/postgres.git

Webview:

http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-
decoding-rebasing-cf2

That branch will be regularly rebased to a new master,fixes/new features, and
a pgindent run over the new files...

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#10)
Re: git tree

On Saturday, September 15, 2012 03:14:32 AM Andres Freund wrote:

That branch will be regularly rebased to a new master,fixes/new features,
and a pgindent run over the new files...

I fixed up the formatting of the new stuff (xlogreader, ilist are submitted
separately, no point in doing anything there).

pushed to the repo mentioned upthread.

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#6)
Add pg_relation_by_filenode(reltbspc, filenode) admin function

Now that I proposed a new syscache upthread its easily possible to provide
pg_relation_by_filenode which I wished for multiple times in the past when
looking at filesystem activity and wondering which table does what. You can
sortof get the same result via

SELECT oid FROM (
SELECT oid, pg_relation_filenode(oid::regclass) filenode
FROM pg_class WHERE relkind != 'v'
) map
WHERE map.filenode = ...;

but thats neither efficient nor obvious.

So, two patches to do this:

Did others need this in the past? I can live with the 2nd patch living in a
private extension somewhere. The first one would also be useful for some
error/debug messages during decoding...

Greetings,

Andres

#13Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#12)
1 attachment(s)
[PATCH 1/2] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

---
src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++
src/include/utils/relmapper.h | 2 ++
2 files changed, 55 insertions(+)

Attachments:

0001-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchtext/x-patch; name=0001-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchDownload
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 6f21495..771f34d 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -180,6 +180,59 @@ RelationMapOidToFilenode(Oid relationId, bool shared)
 	return InvalidOid;
 }
 
+/* RelationMapFilenodeToOid
+ *
+ * Do the reverse of the normal direction of mapping done in
+ * RelationMapOidToFilenode.
+ *
+ * This is not supposed to be used during normal running but rather for
+ * information purposes when looking at the filesystem or the xlog.
+ *
+ * Returns InvalidOid if the OID is not know which can easily happen if the
+ * filenode is not of a relation that is nailed or shared or if it simply
+ * doesn't exists anywhere.
+ */
+Oid
+RelationMapFilenodeToOid(Oid filenode, bool shared)
+{
+	const RelMapFile *map;
+	int32		i;
+
+	/* If there are active updates, believe those over the main maps */
+	if (shared)
+	{
+		map = &active_shared_updates;
+		for (i = 0; i < map->num_mappings; i++)
+		{
+			if (filenode == map->mappings[i].mapfilenode)
+				return map->mappings[i].mapoid;
+		}
+		map = &shared_map;
+		for (i = 0; i < map->num_mappings; i++)
+		{
+			if (filenode == map->mappings[i].mapfilenode)
+				return map->mappings[i].mapoid;
+		}
+	}
+	else
+	{
+		map = &active_local_updates;
+		for (i = 0; i < map->num_mappings; i++)
+		{
+			if (filenode == map->mappings[i].mapfilenode)
+				return map->mappings[i].mapoid;
+		}
+		map = &local_map;
+		for (i = 0; i < map->num_mappings; i++)
+		{
+			if (filenode == map->mappings[i].mapfilenode)
+				return map->mappings[i].mapoid;
+		}
+	}
+
+	return InvalidOid;
+}
+
 /*
  * RelationMapUpdateMap
  *
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 111a05c..4e56508 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -36,6 +36,8 @@ typedef struct xl_relmap_update
 
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
+extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 					 bool immediate);
 
#14Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#12)
1 attachment(s)
[PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

This requires the previously added RELFILENODE syscache.
---
doc/src/sgml/func.sgml | 23 ++++++++++++-
src/backend/utils/adt/dbsize.c | 78 ++++++++++++++++++++++++++++++++++++++++++
src/include/catalog/pg_proc.h | 2 ++
src/include/utils/builtins.h | 1 +
4 files changed, 103 insertions(+), 1 deletion(-)

Attachments:

0002-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchtext/x-patch; name=0002-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f8f63d8..708da35 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15170,7 +15170,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 
    <para>
     The functions shown in <xref linkend="functions-admin-dblocation"> assist
-    in identifying the specific disk files associated with database objects.
+    in identifying the specific disk files associated with database objects or doing the reverse.
    </para>
 
    <indexterm>
@@ -15179,6 +15179,9 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
    <indexterm>
     <primary>pg_relation_filepath</primary>
    </indexterm>
+   <indexterm>
+    <primary>pg_relation_by_filenode</primary>
+   </indexterm>
 
    <table id="functions-admin-dblocation">
     <title>Database Object Location Functions</title>
@@ -15207,6 +15210,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         File path name of the specified relation
        </entry>
       </row>
+      <row>
+       <entry>
+        <literal><function>pg_relation_by_filenode(<parameter>tablespace</parameter> <type>oid</type>, <parameter>filenode</parameter> <type>oid</type>)</function></literal>
+        </entry>
+       <entry><type>regclass</type></entry>
+       <entry>
+        Find the associated relation of a filenode
+       </entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
@@ -15230,6 +15242,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
     the relation.
    </para>
 
+   <para>
+    <function>pg_relation_by_filenode</> is the reverse of
+    <function>pg_relation_filenode</>. Given a <quote>tablespace</> OID and
+    a <quote>filenode</> it returns the associated relation. The default
+    tablespace for user tables can be replaced with 0. Check the
+    documentation of <function>pg_relation_filenode</> for an explanation why
+    this cannot always easily answered by querying <structname>pg_class</>.
+   </para>
+
   </sect2>
 
   <sect2 id="functions-admin-genfile">
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index cd23334..841a445 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -741,6 +741,84 @@ pg_relation_filenode(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Get the relation via (reltablespace, relfilenode)
+ *
+ * This is expected to be used when somebody wants to match an individual file
+ * on the filesystem back to its table. Thats not trivially possible via
+ * pg_class because that doesn't contain the relfilenodes of shared and nailed
+ * tables.
+ *
+ * We don't fail but return NULL if we cannot find a mapping.
+ *
+ * Instead of knowing DEFAULTTABLESPACE_OID you can pass 0.
+ */
+Datum
+pg_relation_by_filenode(PG_FUNCTION_ARGS)
+{
+	Oid			reltablespace = PG_GETARG_OID(0);
+	Oid			relfilenode = PG_GETARG_OID(1);
+	Oid			lookup_tablespace = reltablespace;
+	Oid         result = InvalidOid;
+	HeapTuple	tuple;
+
+	if (reltablespace == 0)
+		reltablespace = DEFAULTTABLESPACE_OID;
+
+	/* pg_class stores 0 instead of DEFAULTTABLESPACE_OID */
+	if (reltablespace == DEFAULTTABLESPACE_OID)
+		lookup_tablespace = 0;
+
+	tuple = SearchSysCache2(RELFILENODE,
+							lookup_tablespace,
+							relfilenode);
+
+	/* found it in the system catalog, not be a shared/nailed table */
+	if (HeapTupleIsValid(tuple))
+	{
+		result = HeapTupleHeaderGetOid(tuple->t_data);
+		ReleaseSysCache(tuple);
+	}
+	else
+	{
+		if (reltablespace == GLOBALTABLESPACE_OID)
+		{
+			result = RelationMapFilenodeToOid(relfilenode, true);
+		}
+		else
+		{
+			Form_pg_class relform;
+
+			result = RelationMapFilenodeToOid(relfilenode, false);
+
+			if (result != InvalidOid)
+			{
+				/* check that we found the correct relation */
+				tuple = SearchSysCache1(RELOID,
+									result);
+
+				if (!HeapTupleIsValid(tuple))
+				{
+					elog(ERROR, "Couldn't refind previously looked up relation with oid %u",
+						 result);
+				}
+
+				relform = (Form_pg_class) GETSTRUCT(tuple);
+
+				if (relform->reltablespace != reltablespace)
+					result = InvalidOid;
+
+				ReleaseSysCache(tuple);
+			}
+		}
+	}
+
+	if (!OidIsValid(result))
+		PG_RETURN_NULL();
+	else
+		PG_RETURN_OID(result);
+}
+
+/*
  * Get the pathname (relative to $PGDATA) of a relation
  *
  * See comments for pg_relation_filenode.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index b5b886b..c8233cd 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3430,6 +3430,8 @@ DATA(insert OID = 2998 ( pg_indexes_size		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1
 DESCR("disk space usage for all indexes attached to the specified table");
 DATA(insert OID = 2999 ( pg_relation_filenode	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 26 "2205" _null_ _null_ _null_ _null_ pg_relation_filenode _null_ _null_ _null_ ));
 DESCR("filenode identifier of relation");
+DATA(insert OID = 3170 ( pg_relation_by_filenode PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 2205 "26 26" _null_ _null_ _null_ _null_ pg_relation_by_filenode _null_ _null_ _null_ ));
+DESCR("filenode identifier of relation");
 DATA(insert OID = 3034 ( pg_relation_filepath	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 25 "2205" _null_ _null_ _null_ _null_ pg_relation_filepath _null_ _null_ _null_ ));
 DESCR("file path of relation");
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index c9c665d..8ee4c3c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -458,6 +458,7 @@ extern Datum pg_table_size(PG_FUNCTION_ARGS);
 extern Datum pg_indexes_size(PG_FUNCTION_ARGS);
 extern Datum pg_relation_filenode(PG_FUNCTION_ARGS);
 extern Datum pg_relation_filepath(PG_FUNCTION_ARGS);
+extern Datum pg_relation_by_filenode(PG_FUNCTION_ARGS);
 
 /* genfile.c */
 extern bytea *read_binary_file(const char *filename,
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: [PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

Andres Freund <andres@2ndquadrant.com> writes:

This requires the previously added RELFILENODE syscache.

[ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one,
and I doubt it would work given that the contents of
pg_class.relfilenode aren't unique (the zero entries are the problem).

regards, tom lane

#16Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#15)
Re: [PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

On Monday, September 17, 2012 12:35:32 AM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

This requires the previously added RELFILENODE syscache.

[ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one,
and I doubt it would work given that the contents of
pg_class.relfilenode aren't unique (the zero entries are the problem).

Well, one patch upthread ;). It mentions the problem of it not being unique due
to relfilenode in (reltablespace, relfilenode) being 0 for shared/nailed
catalogs.

I am not really sure yet how big a problem for the caching infrastructure it is
that values that shouldn't ever get queried (because the relfilenode is
actually different) are duplicated. Reading code about all that atm.

Robert suggested writing a specialized cache akin to whats done in
attoptcache.c or such.

I haven't formed an opinion on whats the way forward on that topic. But anyway,
I don't see how the wal decoding stuff can progress without some variant of
that mapping, so I sure hope I/we can build something. Changing that aspect of
the patch should be trivial...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#4)
1 attachment(s)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On 15.09.2012 03:39, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

My previous objections to this approach still apply. 1. I don't want to
maintain a second copy of the code to read xlog. 2. We should focus on
reading WAL, I don't see the point of mixing WAL writing into this. 3. I
don't like the callback-style API.

I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.

There is still one callback, XLogPageRead(), to obtain a given page in
WAL. The XLogReader facility is responsible for decoding the WAL into
records, but the user of the facility is responsible for supplying the
physical bytes, via the callback.

So the usage is like this:

/*
* Callback to read the page starting at 'RecPtr' into *readBuf. It's
* up to you to do this any way you like. Typically you'd read from a
* file. The WAL recovery implementation of this in xlog.c is more
* complicated. It checks the archive, waits for streaming replication
* etc.
*/
static bool
MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
*readBuf, void *private_data)
{
...
}

state = XLogReaderAllocate(&MyXLogPageRead);

while ((record = XLogReadRecord(state, ...)))
{
/* do something with the record */
}

XLogReaderFree(state);

- Heikki

Attachments:

xlogreader-heikki-1.patchtext/x-diff; name=xlogreader-heikki-1.patchDownload
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index f82f10e..660b5fc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
-	twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o
+	twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff56c26..769ddea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -541,6 +542,8 @@ static uint32 readOff = 0;
 static uint32 readLen = 0;
 static int	readSource = 0;		/* XLOG_FROM_* code */
 
+static bool fetching_ckpt_global;
+
 /*
  * Keeps track of which sources we've tried to read the current WAL
  * record from and failed.
@@ -556,13 +559,6 @@ static int	failedSources = 0;	/* OR of XLOG_FROM_* codes */
 static TimestampTz XLogReceiptTime = 0;
 static int	XLogReceiptSource = 0;		/* XLOG_FROM_* code */
 
-/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
-static char *readBuf = NULL;
-
-/* Buffer for current ReadRecord result (expandable) */
-static char *readRecordBuf = NULL;
-static uint32 readRecordBufSize = 0;
-
 /* State information for XLOG reading */
 static XLogRecPtr ReadRecPtr;	/* start of last record read */
 static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
@@ -632,9 +628,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notexistOk);
 static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources);
-static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
-			 bool randAccess);
-static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+static bool XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+			 int emode, bool randAccess, char *reaBuf, void *private_data);
 static void XLogFileClose(void);
 static bool RestoreArchivedFile(char *path, const char *xlogfname,
 					const char *recovername, off_t expectedSize);
@@ -646,12 +641,10 @@ static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
-static void CheckRecoveryConsistency(void);
+static XLogRecord *ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, bool fetching_ckpt);
+static void CheckRecoveryConsistency(XLogRecPtr EndRecPtr);
 static bool ValidXLogPageHeader(XLogPageHeader hdr, int emode);
-static bool ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record,
-					  int emode, bool randAccess);
-static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
+static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt);
 static List *readTimeLineHistory(TimeLineID targetTLI);
 static bool existsTimeLineHistory(TimeLineID probeTLI);
 static bool rescanLatestTimeLine(void);
@@ -3703,102 +3696,6 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
 }
 
 /*
- * CRC-check an XLOG record.  We do not believe the contents of an XLOG
- * record (other than to the minimal extent of computing the amount of
- * data to read in) until we've checked the CRCs.
- *
- * We assume all of the record (that is, xl_tot_len bytes) has been read
- * into memory at *record.  Also, ValidXLogRecordHeader() has accepted the
- * record's header, which means in particular that xl_tot_len is at least
- * SizeOfXlogRecord, so it is safe to fetch xl_len.
- */
-static bool
-RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
-{
-	pg_crc32	crc;
-	int			i;
-	uint32		len = record->xl_len;
-	BkpBlock	bkpb;
-	char	   *blk;
-	size_t		remaining = record->xl_tot_len;
-
-	/* First the rmgr data */
-	if (remaining < SizeOfXLogRecord + len)
-	{
-		/* ValidXLogRecordHeader() should've caught this already... */
-		ereport(emode_for_corrupt_record(emode, recptr),
-				(errmsg("invalid record length at %X/%X",
-						(uint32) (recptr >> 32), (uint32) recptr)));
-		return false;
-	}
-	remaining -= SizeOfXLogRecord + len;
-	INIT_CRC32(crc);
-	COMP_CRC32(crc, XLogRecGetData(record), len);
-
-	/* Add in the backup blocks, if any */
-	blk = (char *) XLogRecGetData(record) + len;
-	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
-	{
-		uint32		blen;
-
-		if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
-			continue;
-
-		if (remaining < sizeof(BkpBlock))
-		{
-			ereport(emode_for_corrupt_record(emode, recptr),
-					(errmsg("invalid backup block size in record at %X/%X",
-							(uint32) (recptr >> 32), (uint32) recptr)));
-			return false;
-		}
-		memcpy(&bkpb, blk, sizeof(BkpBlock));
-
-		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
-		{
-			ereport(emode_for_corrupt_record(emode, recptr),
-					(errmsg("incorrect hole size in record at %X/%X",
-							(uint32) (recptr >> 32), (uint32) recptr)));
-			return false;
-		}
-		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
-
-		if (remaining < blen)
-		{
-			ereport(emode_for_corrupt_record(emode, recptr),
-					(errmsg("invalid backup block size in record at %X/%X",
-							(uint32) (recptr >> 32), (uint32) recptr)));
-			return false;
-		}
-		remaining -= blen;
-		COMP_CRC32(crc, blk, blen);
-		blk += blen;
-	}
-
-	/* Check that xl_tot_len agrees with our calculation */
-	if (remaining != 0)
-	{
-		ereport(emode_for_corrupt_record(emode, recptr),
-				(errmsg("incorrect total length in record at %X/%X",
-						(uint32) (recptr >> 32), (uint32) recptr)));
-		return false;
-	}
-
-	/* Finally include the record header */
-	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
-	FIN_CRC32(crc);
-
-	if (!EQ_CRC32(record->xl_crc, crc))
-	{
-		ereport(emode_for_corrupt_record(emode, recptr),
-		(errmsg("incorrect resource manager data checksum in record at %X/%X",
-				(uint32) (recptr >> 32), (uint32) recptr)));
-		return false;
-	}
-
-	return true;
-}
-
-/*
  * Attempt to read an XLOG record.
  *
  * If RecPtr is not NULL, try to read a record at that position.  Otherwise
@@ -3811,290 +3708,35 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
  * the returned record pointer always points there.
  */
 static XLogRecord *
-ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
+ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, bool fetching_ckpt)
 {
 	XLogRecord *record;
-	XLogRecPtr	tmpRecPtr = EndRecPtr;
-	bool		randAccess = false;
-	uint32		len,
-				total_len;
-	uint32		targetRecOff;
-	uint32		pageHeaderSize;
-	bool		gotheader;
-
-	if (readBuf == NULL)
-	{
-		/*
-		 * First time through, permanently allocate readBuf.  We do it this
-		 * way, rather than just making a static array, for two reasons: (1)
-		 * no need to waste the storage in most instantiations of the backend;
-		 * (2) a static char array isn't guaranteed to have any particular
-		 * alignment, whereas malloc() will provide MAXALIGN'd storage.
-		 */
-		readBuf = (char *) malloc(XLOG_BLCKSZ);
-		Assert(readBuf != NULL);
-	}
-
-	if (RecPtr == NULL)
-	{
-		RecPtr = &tmpRecPtr;
 
-		/*
-		 * RecPtr is pointing to end+1 of the previous WAL record.  If
-		 * we're at a page boundary, no more records can fit on the current
-		 * page. We must skip over the page header, but we can't do that
-		 * until we've read in the page, since the header size is variable.
-		 */
-	}
-	else
-	{
-		/*
-		 * In this case, the passed-in record pointer should already be
-		 * pointing to a valid record starting position.
-		 */
-		if (!XRecOffIsValid(*RecPtr))
-			ereport(PANIC,
-					(errmsg("invalid record offset at %X/%X",
-							(uint32) (*RecPtr >> 32), (uint32) *RecPtr)));
-
-		/*
-		 * Since we are going to a random position in WAL, forget any prior
-		 * state about what timeline we were in, and allow it to be any
-		 * timeline in expectedTLIs.  We also set a flag to allow curFileTLI
-		 * to go backwards (but we can't reset that variable right here, since
-		 * we might not change files at all).
-		 */
+	if (!XLogRecPtrIsInvalid(RecPtr))
 		lastPageTLI = 0;		/* see comment in ValidXLogPageHeader */
-		randAccess = true;		/* allow curFileTLI to go backwards too */
-	}
+
+	fetching_ckpt_global = fetching_ckpt;
 
 	/* This is the first try to read this page. */
 	failedSources = 0;
-retry:
-	/* Read the page containing the record */
-	if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
-		return NULL;
-
-	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
-	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
-	if (targetRecOff == 0)
+	do
 	{
-		/*
-		 * At page start, so skip over page header.  The Assert checks that
-		 * we're not scribbling on caller's record pointer; it's OK because we
-		 * can only get here in the continuing-from-prev-record case, since
-		 * XRecOffIsValid rejected the zero-page-offset case otherwise.
-		 */
-		Assert(RecPtr == &tmpRecPtr);
-		(*RecPtr) += pageHeaderSize;
-		targetRecOff = pageHeaderSize;
-	}
-	else if (targetRecOff < pageHeaderSize)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid record offset at %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		goto next_record_is_invalid;
-	}
-	if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
-		targetRecOff == pageHeaderSize)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("contrecord is requested by %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		goto next_record_is_invalid;
-	}
-
-	/*
-	 * Read the record length.
-	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of
-	 * the struct, so it must be on this page (the records are MAXALIGNed),
-	 * but we cannot access any other fields until we've verified that we
-	 * got the whole header.
-	 */
-	record = (XLogRecord *) (readBuf + (*RecPtr) % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
-	 * check is necessary here to ensure that we enter the "Need to reassemble
-	 * record" code path below; otherwise we might fail to apply
-	 * ValidXLogRecordHeader at all.
-	 */
-	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
-	{
-		if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
-			goto next_record_is_invalid;
-		gotheader = true;
-	}
-	else
-	{
-		if (total_len < SizeOfXLogRecord)
+		record = XLogReadRecord(xlogreader, RecPtr, emode);
+		ReadRecPtr = xlogreader->ReadRecPtr;
+		EndRecPtr = xlogreader->EndRecPtr;
+		if (record == NULL)
 		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("invalid record length at %X/%X",
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			goto next_record_is_invalid;
-		}
-		gotheader = false;
-	}
-
-	/*
-	 * Allocate or enlarge readRecordBuf as needed.  To avoid useless small
-	 * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
-	 * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with.  (That is
-	 * enough for all "normal" records, but very large commit or abort records
-	 * might need more space.)
-	 */
-	if (total_len > readRecordBufSize)
-	{
-		uint32		newSize = total_len;
+			failedSources |= readSource;
 
-		newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
-		newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
-		if (readRecordBuf)
-			free(readRecordBuf);
-		readRecordBuf = (char *) malloc(newSize);
-		if (!readRecordBuf)
-		{
-			readRecordBufSize = 0;
-			/* We treat this as a "bogus data" condition */
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record length %u at %X/%X too long",
-							total_len, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			goto next_record_is_invalid;
-		}
-		readRecordBufSize = newSize;
-	}
-
-	len = XLOG_BLCKSZ - (*RecPtr) % XLOG_BLCKSZ;
-	if (total_len > len)
-	{
-		/* Need to reassemble record */
-		char	   *contrecord;
-		XLogPageHeader pageHeader;
-		XLogRecPtr	pagelsn;
-		char	   *buffer;
-		uint32		gotlen;
-
-		/* Initialize pagelsn to the beginning of the page this record is on */
-		pagelsn = ((*RecPtr) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
-
-		/* Copy the first fragment of the record from the first page. */
-		memcpy(readRecordBuf, readBuf + (*RecPtr) % XLOG_BLCKSZ, len);
-		buffer = readRecordBuf + len;
-		gotlen = len;
-
-		do
-		{
-			/* Calculate pointer to beginning of next page */
-			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
-			/* Wait for the next page to become available */
-			if (!XLogPageRead(&pagelsn, emode, false, false))
-				return NULL;
-
-			/* Check that the continuation on next page looks valid */
-			pageHeader = (XLogPageHeader) readBuf;
-			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
-			{
-				ereport(emode_for_corrupt_record(emode, *RecPtr),
-						(errmsg("there is no contrecord flag in log segment %s, offset %u",
-								XLogFileNameP(curFileTLI, readSegNo),
-								readOff)));
-				goto next_record_is_invalid;
-			}
-			/*
-			 * Cross-check that xlp_rem_len agrees with how much of the record
-			 * we expect there to be left.
-			 */
-			if (pageHeader->xlp_rem_len == 0 ||
-				total_len != (pageHeader->xlp_rem_len + gotlen))
+			if (readFile >= 0)
 			{
-				ereport(emode_for_corrupt_record(emode, *RecPtr),
-						(errmsg("invalid contrecord length %u in log segment %s, offset %u",
-								pageHeader->xlp_rem_len,
-								XLogFileNameP(curFileTLI, readSegNo),
-								readOff)));
-				goto next_record_is_invalid;
+				close(readFile);
+				readFile = -1;
 			}
+		}
+	} while(StandbyMode && record == NULL);
 
-			/* Append the continuation from this page to the buffer */
-			pageHeaderSize = XLogPageHeaderSize(pageHeader);
-			contrecord = (char *) readBuf + pageHeaderSize;
-			len = XLOG_BLCKSZ - pageHeaderSize;
-			if (pageHeader->xlp_rem_len < len)
-				len = pageHeader->xlp_rem_len;
-			memcpy(buffer, (char *) contrecord, len);
-			buffer += len;
-			gotlen += len;
-
-			/* If we just reassembled the record header, validate it. */
-			if (!gotheader)
-			{
-				record = (XLogRecord *) readRecordBuf;
-				if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
-					goto next_record_is_invalid;
-				gotheader = true;
-			}
-		} while (pageHeader->xlp_rem_len > len);
-
-		record = (XLogRecord *) readRecordBuf;
-		if (!RecordIsValid(record, *RecPtr, emode))
-			goto next_record_is_invalid;
-		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
-		XLogSegNoOffsetToRecPtr(
-			readSegNo,
-			readOff + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len),
-			EndRecPtr);
-		ReadRecPtr = *RecPtr;
-	}
-	else
-	{
-		/* Record does not cross a page boundary */
-		if (!RecordIsValid(record, *RecPtr, emode))
-			goto next_record_is_invalid;
-		EndRecPtr = *RecPtr + MAXALIGN(total_len);
-
-		ReadRecPtr = *RecPtr;
-		memcpy(readRecordBuf, record, total_len);
-	}
-
-	/*
-	 * Special processing if it's an XLOG SWITCH record
-	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
-	{
-		/* Pretend it extends to end of segment */
-		EndRecPtr += XLogSegSize - 1;
-		EndRecPtr -= EndRecPtr % XLogSegSize;
-
-		/*
-		 * Pretend that readBuf contains the last page of the segment. This is
-		 * just to avoid Assert failure in StartupXLOG if XLOG ends with this
-		 * segment.
-		 */
-		readOff = XLogSegSize - XLOG_BLCKSZ;
-	}
 	return record;
-
-next_record_is_invalid:
-	failedSources |= readSource;
-
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-
-	/* In standby-mode, keep trying */
-	if (StandbyMode)
-		goto retry;
-	else
-		return NULL;
 }
 
 /*
@@ -4223,88 +3865,6 @@ ValidXLogPageHeader(XLogPageHeader hdr, int emode)
 }
 
 /*
- * Validate an XLOG record header.
- *
- * This is just a convenience subroutine to avoid duplicated code in
- * ReadRecord.	It's not intended for use from anywhere else.
- */
-static bool
-ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
-					  bool randAccess)
-{
-	/*
-	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
-	 * required.
-	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
-	{
-		if (record->xl_len != 0)
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("invalid xlog switch record at %X/%X",
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			return false;
-		}
-	}
-	else if (record->xl_len == 0)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("record with zero length at %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		return false;
-	}
-	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
-		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
-		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid record length at %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		return false;
-	}
-	if (record->xl_rmid > RM_MAX_ID)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid resource manager ID %u at %X/%X",
-						record->xl_rmid, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		return false;
-	}
-	if (randAccess)
-	{
-		/*
-		 * We can't exactly verify the prev-link, but surely it should be less
-		 * than the record's own address.
-		 */
-		if (!XLByteLT(record->xl_prev, *RecPtr))
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			return false;
-		}
-	}
-	else
-	{
-		/*
-		 * Record's prev-link should exactly match our previous location. This
-		 * check guards against torn WAL pages where a stale but valid-looking
-		 * WAL record starts on a sector boundary.
-		 */
-		if (!XLByteEQ(record->xl_prev, ReadRecPtr))
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			return false;
-		}
-	}
-
-	return true;
-}
-
-/*
  * Try to read a timeline's history file.
  *
  * If successful, return the list of component TLIs (the given TLI followed by
@@ -6089,6 +5649,7 @@ StartupXLOG(void)
 	bool		backupEndRequired = false;
 	bool		backupFromStandby = false;
 	DBState		dbstate_at_startup;
+	XLogReaderState *xlogreader;
 
 	/*
 	 * Read control file and check XLOG status looks valid.
@@ -6222,6 +5783,8 @@ StartupXLOG(void)
 	if (StandbyMode)
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+	xlogreader = XLogReaderAllocate(InvalidXLogRecPtr, &XLogPageRead, NULL);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -6229,7 +5792,7 @@ StartupXLOG(void)
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(checkPointLoc, 0);
+		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0);
 		if (record != NULL)
 		{
 			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
@@ -6247,7 +5810,7 @@ StartupXLOG(void)
 			 */
 			if (XLByteLT(checkPoint.redo, checkPointLoc))
 			{
-				if (!ReadRecord(&(checkPoint.redo), LOG, false))
+				if (!ReadRecord(xlogreader, checkPoint.redo, LOG, false))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
 							 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
@@ -6271,7 +5834,7 @@ StartupXLOG(void)
 		 */
 		checkPointLoc = ControlFile->checkPoint;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
-		record = ReadCheckpointRecord(checkPointLoc, 1);
+		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1);
 		if (record != NULL)
 		{
 			ereport(DEBUG1,
@@ -6290,7 +5853,7 @@ StartupXLOG(void)
 		else
 		{
 			checkPointLoc = ControlFile->prevCheckPoint;
-			record = ReadCheckpointRecord(checkPointLoc, 2);
+			record = ReadCheckpointRecord(xlogreader, checkPointLoc, 2);
 			if (record != NULL)
 			{
 				ereport(LOG,
@@ -6591,7 +6154,7 @@ StartupXLOG(void)
 		 * Allow read-only connections immediately if we're consistent
 		 * already.
 		 */
-		CheckRecoveryConsistency();
+		CheckRecoveryConsistency(EndRecPtr);
 
 		/*
 		 * Find the first record that logically follows the checkpoint --- it
@@ -6600,12 +6163,12 @@ StartupXLOG(void)
 		if (XLByteLT(checkPoint.redo, RecPtr))
 		{
 			/* back up to find the record */
-			record = ReadRecord(&(checkPoint.redo), PANIC, false);
+			record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);
 		}
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(NULL, LOG, false);
+			record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
 		}
 
 		if (record != NULL)
@@ -6652,7 +6215,7 @@ StartupXLOG(void)
 				HandleStartupProcInterrupts();
 
 				/* Allow read-only connections if we're consistent now */
-				CheckRecoveryConsistency();
+				CheckRecoveryConsistency(EndRecPtr);
 
 				/*
 				 * Have we reached our recovery target?
@@ -6756,7 +6319,7 @@ StartupXLOG(void)
 
 				LastRec = ReadRecPtr;
 
-				record = ReadRecord(NULL, LOG, false);
+				record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
 			} while (record != NULL && recoveryContinue);
 
 			/*
@@ -6806,7 +6369,7 @@ StartupXLOG(void)
 	 * Re-fetch the last valid or last applied record, so we can identify the
 	 * exact endpoint of what we consider the valid portion of WAL.
 	 */
-	record = ReadRecord(&LastRec, PANIC, false);
+	record = ReadRecord(xlogreader, LastRec, PANIC, false);
 	EndOfLog = EndRecPtr;
 	XLByteToPrevSeg(EndOfLog, endLogSegNo);
 
@@ -6905,8 +6468,15 @@ StartupXLOG(void)
 	 * record spans, not the one it starts in.	The last block is indeed the
 	 * one we want to use.
 	 */
-	Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
-	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
+	if (EndOfLog % XLOG_BLCKSZ == 0)
+	{
+		memset(Insert->currpage, 0, XLOG_BLCKSZ);
+	}
+	else
+	{
+		Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
+		memcpy((char *) Insert->currpage, xlogreader->readBuf, XLOG_BLCKSZ);
+	}
 	Insert->currpos = (char *) Insert->currpage +
 		(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
 
@@ -7063,17 +6633,7 @@ StartupXLOG(void)
 		close(readFile);
 		readFile = -1;
 	}
-	if (readBuf)
-	{
-		free(readBuf);
-		readBuf = NULL;
-	}
-	if (readRecordBuf)
-	{
-		free(readRecordBuf);
-		readRecordBuf = NULL;
-		readRecordBufSize = 0;
-	}
+	XLogReaderFree(xlogreader);
 
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
@@ -7104,7 +6664,7 @@ StartupXLOG(void)
  * that it can start accepting read-only connections.
  */
 static void
-CheckRecoveryConsistency(void)
+CheckRecoveryConsistency(XLogRecPtr EndRecPtr)
 {
 	/*
 	 * During crash recovery, we don't reach a consistent state until we've
@@ -7284,7 +6844,7 @@ LocalSetXLogInsertAllowed(void)
  * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
+ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt)
 {
 	XLogRecord *record;
 
@@ -7308,7 +6868,7 @@ ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
 		return NULL;
 	}
 
-	record = ReadRecord(&RecPtr, LOG, true);
+	record = ReadRecord(xlogreader, RecPtr, LOG, true);
 
 	if (record == NULL)
 	{
@@ -10100,19 +9660,21 @@ CancelBackup(void)
  * sleep and retry.
  */
 static bool
-XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
-			 bool randAccess)
+XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
+			 bool randAccess, char *readBuf, void *private_data)
 {
+	/* TODO: these, and fetching_ckpt, would be better in private_data */
 	static XLogRecPtr receivedUpto = 0;
+	static pg_time_t last_fail_time = 0;
+	bool		fetching_ckpt = fetching_ckpt_global;
 	bool		switched_segment = false;
 	uint32		targetPageOff;
 	uint32		targetRecOff;
 	XLogSegNo	targetSegNo;
-	static pg_time_t last_fail_time = 0;
 
-	XLByteToSeg(*RecPtr, targetSegNo);
-	targetPageOff = (((*RecPtr) % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
-	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
+	XLByteToSeg(RecPtr, targetSegNo);
+	targetPageOff = ((RecPtr % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+	targetRecOff = RecPtr % XLOG_BLCKSZ;
 
 	/* Fast exit if we have read the record in the current buffer already */
 	if (failedSources == 0 && targetSegNo == readSegNo &&
@@ -10123,7 +9685,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readSegNo))
+	if (readFile >= 0 && !XLByteInSeg(RecPtr, readSegNo))
 	{
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
@@ -10144,12 +9706,12 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 		readSource = 0;
 	}
 
-	XLByteToSeg(*RecPtr, readSegNo);
+	XLByteToSeg(RecPtr, readSegNo);
 
 retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
-		(readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
+		(readSource == XLOG_FROM_STREAM && !XLByteLT(RecPtr, receivedUpto)))
 	{
 		if (StandbyMode)
 		{
@@ -10192,17 +9754,17 @@ retry:
 					 * XLogReceiptTime will not advance, so the grace time
 					 * alloted to conflicting queries will decrease.
 					 */
-					if (XLByteLT(*RecPtr, receivedUpto))
+					if (XLByteLT(RecPtr, receivedUpto))
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
 						receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart);
-						if (XLByteLT(*RecPtr, receivedUpto))
+						if (XLByteLT(RecPtr, receivedUpto))
 						{
 							havedata = true;
-							if (!XLByteLT(*RecPtr, latestChunkStart))
+							if (!XLByteLT(RecPtr, latestChunkStart))
 							{
 								XLogReceiptTime = GetCurrentTimestamp();
 								SetCurrentChunkStartTime(XLogReceiptTime);
@@ -10321,7 +9883,7 @@ retry:
 						if (PrimaryConnInfo)
 						{
 							RequestXLogStreaming(
-									  fetching_ckpt ? RedoStartLSN : *RecPtr,
+									  fetching_ckpt ? RedoStartLSN : RecPtr,
 												 PrimaryConnInfo);
 							continue;
 						}
@@ -10393,7 +9955,7 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((*RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+		if (((RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
 		{
 			readLen = XLOG_BLCKSZ;
 		}
@@ -10417,7 +9979,7 @@ retry:
 		{
 			char fname[MAXFNAMELEN];
 			XLogFileName(fname, curFileTLI, readSegNo);
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
+			ereport(emode_for_corrupt_record(emode, RecPtr),
 					(errcode_for_file_access(),
 					 errmsg("could not read from log segment %s, offset %u: %m",
 							fname, readOff)));
@@ -10433,7 +9995,7 @@ retry:
 	{
 		char fname[MAXFNAMELEN];
 		XLogFileName(fname, curFileTLI, readSegNo);
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
+		ereport(emode_for_corrupt_record(emode, RecPtr),
 				(errcode_for_file_access(),
 		 errmsg("could not seek in log segment %s to offset %u: %m",
 				fname, readOff)));
@@ -10443,7 +10005,7 @@ retry:
 	{
 		char fname[MAXFNAMELEN];
 		XLogFileName(fname, curFileTLI, readSegNo);
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
+		ereport(emode_for_corrupt_record(emode, RecPtr),
 				(errcode_for_file_access(),
 		 errmsg("could not read from log segment %s, offset %u: %m",
 				fname, readOff)));
@@ -10501,7 +10063,7 @@ triggered:
  * you are about to ereport(), or you might cause a later message to be
  * erroneously suppressed.
  */
-static int
+int
 emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 0000000..8ba05b1
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,496 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogreader.c
+ *		Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogreader.c
+ *
+ * NOTES
+ *		Documentation about how do use this interface can be found in
+ *		xlogreader.h, more specifically in the definition of the
+ *		XLogReaderState struct where all parameters are documented.
+ *
+ * TODO:
+ * * usable without backend code around
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "catalog/pg_control.h"
+
+static bool ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr,
+					  XLogRecord *record, int emode, bool randAccess);
+static bool RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode);
+
+/*
+ * Initialize a new xlog reader
+ */
+XLogReaderState *
+XLogReaderAllocate(XLogRecPtr startpoint,
+				   XLogPageReadCB pagereadfunc, void *private_data)
+{
+	XLogReaderState *state;
+
+	state = (XLogReaderState *) palloc0(sizeof(XLogReaderState));
+
+	/*
+	 * First time through, permanently allocate readBuf.  We do it this
+	 * way, rather than just making a static array, for two reasons: (1)
+	 * no need to waste the storage in most instantiations of the backend;
+	 * (2) a static char array isn't guaranteed to have any particular
+	 * alignment, whereas malloc() will provide MAXALIGN'd storage.
+	 */
+	state->readBuf = (char *) palloc(XLOG_BLCKSZ);
+
+	state->read_page = pagereadfunc;
+	state->private_data = private_data;
+	state->EndRecPtr = startpoint;
+
+	return state;
+}
+
+void
+XLogReaderFree(XLogReaderState *state)
+{
+	if (state->readRecordBuf)
+		pfree(state->readRecordBuf);
+	pfree(state->readBuf);
+	pfree(state);
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * If RecPtr is not NULL, try to read a record at that position.  Otherwise
+ * try to read a record just after the last one previously read.
+ *
+ * If no valid record is available, returns NULL, or fails if emode is PANIC.
+ * (emode must be either PANIC, LOG)
+ *
+ * The record is copied into readRecordBuf, so that on successful return,
+ * the returned record pointer always points there.
+ */
+XLogRecord *
+XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, int emode)
+{
+	XLogRecord *record;
+	XLogRecPtr	tmpRecPtr = state->EndRecPtr;
+	bool		randAccess = false;
+	uint32		len,
+				total_len;
+	uint32		targetRecOff;
+	uint32		pageHeaderSize;
+	bool		gotheader;
+
+	if (RecPtr == InvalidXLogRecPtr)
+	{
+		RecPtr = tmpRecPtr;
+
+		/*
+		 * RecPtr is pointing to end+1 of the previous WAL record.  If
+		 * we're at a page boundary, no more records can fit on the current
+		 * page. We must skip over the page header, but we can't do that
+		 * until we've read in the page, since the header size is variable.
+		 */
+	}
+	else
+	{
+		/*
+		 * In this case, the passed-in record pointer should already be
+		 * pointing to a valid record starting position.
+		 */
+		if (!XRecOffIsValid(RecPtr))
+			ereport(PANIC,
+					(errmsg("invalid record offset at %X/%X",
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		randAccess = true;		/* allow curFileTLI to go backwards too */
+	}
+
+	/* Read the page containing the record */
+	if (!state->read_page(state, RecPtr, emode, randAccess, state->readBuf, state->private_data))
+		return NULL;
+
+	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+	targetRecOff = RecPtr % XLOG_BLCKSZ;
+	if (targetRecOff == 0)
+	{
+		/*
+		 * At page start, so skip over page header.  The Assert checks that
+		 * we're not scribbling on caller's record pointer; it's OK because we
+		 * can only get here in the continuing-from-prev-record case, since
+		 * XRecOffIsValid rejected the zero-page-offset case otherwise.
+		 * XXX: does this assert make sense now that RecPtr is not a pointer?
+		 */
+		Assert(RecPtr == tmpRecPtr);
+		RecPtr += pageHeaderSize;
+		targetRecOff = pageHeaderSize;
+	}
+	else if (targetRecOff < pageHeaderSize)
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("invalid record offset at %X/%X",
+						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		goto next_record_is_invalid;
+	}
+	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
+		targetRecOff == pageHeaderSize)
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("contrecord is requested by %X/%X",
+						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		goto next_record_is_invalid;
+	}
+
+	/*
+	 * Read the record length.
+	 *
+	 * NB: Even though we use an XLogRecord pointer here, the whole record
+	 * header might not fit on this page. xl_tot_len is the first field of
+	 * the struct, so it must be on this page (the records are MAXALIGNed),
+	 * but we cannot access any other fields until we've verified that we
+	 * got the whole header.
+	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+	total_len = record->xl_tot_len;
+
+	/*
+	 * If the whole record header is on this page, validate it immediately.
+	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
+	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * check is necessary here to ensure that we enter the "Need to reassemble
+	 * record" code path below; otherwise we might fail to apply
+	 * ValidXLogRecordHeader at all.
+	 */
+	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+	{
+		if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode, randAccess))
+			goto next_record_is_invalid;
+		gotheader = true;
+	}
+	else
+	{
+		if (total_len < SizeOfXLogRecord)
+		{
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("invalid record length at %X/%X",
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			goto next_record_is_invalid;
+		}
+		gotheader = false;
+	}
+
+	/*
+	 * Allocate or enlarge readRecordBuf as needed.  To avoid useless small
+	 * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
+	 * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with.  (That is
+	 * enough for all "normal" records, but very large commit or abort records
+	 * might need more space.)
+	 */
+	if (total_len > state->readRecordBufSize)
+	{
+		uint32		newSize = total_len;
+
+		newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
+		newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
+		if (state->readRecordBuf)
+			pfree(state->readRecordBuf);
+		state->readRecordBuf = (char *) palloc(newSize);
+		if (!state->readRecordBuf)
+		{
+			state->readRecordBufSize = 0;
+			/* We treat this as a "bogus data" condition */
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("record length %u at %X/%X too long",
+							total_len, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			goto next_record_is_invalid;
+		}
+		state->readRecordBufSize = newSize;
+	}
+
+	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
+	if (total_len > len)
+	{
+		/* Need to reassemble record */
+		char	   *contrecord;
+		XLogPageHeader pageHeader;
+		XLogRecPtr	pagelsn;
+		char	   *buffer;
+		uint32		gotlen;
+
+		/* Initialize pagelsn to the beginning of the page this record is on */
+		pagelsn = (RecPtr / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+
+		/* Copy the first fragment of the record from the first page. */
+		memcpy(state->readRecordBuf, state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+		buffer = state->readRecordBuf + len;
+		gotlen = len;
+
+		do
+		{
+			/* Calculate pointer to beginning of next page */
+			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
+			/* Wait for the next page to become available */
+			if (!state->read_page(state, pagelsn, emode, false, state->readBuf, NULL))
+				return NULL;
+
+			/* Check that the continuation on next page looks valid */
+			pageHeader = (XLogPageHeader) state->readBuf;
+			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
+			{
+				ereport(emode_for_corrupt_record(emode, RecPtr),
+						(errmsg("there is no contrecord flag at %X/%X",
+								(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+				goto next_record_is_invalid;
+			}
+			/*
+			 * Cross-check that xlp_rem_len agrees with how much of the record
+			 * we expect there to be left.
+			 */
+			if (pageHeader->xlp_rem_len == 0 ||
+				total_len != (pageHeader->xlp_rem_len + gotlen))
+			{
+				ereport(emode_for_corrupt_record(emode, RecPtr),
+						(errmsg("invalid contrecord length %u at %X/%X",
+								pageHeader->xlp_rem_len,
+								(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+				goto next_record_is_invalid;
+			}
+
+			/* Append the continuation from this page to the buffer */
+			pageHeaderSize = XLogPageHeaderSize(pageHeader);
+			contrecord = (char *) state->readBuf + pageHeaderSize;
+			len = XLOG_BLCKSZ - pageHeaderSize;
+			if (pageHeader->xlp_rem_len < len)
+				len = pageHeader->xlp_rem_len;
+			memcpy(buffer, (char *) contrecord, len);
+			buffer += len;
+			gotlen += len;
+
+			/* If we just reassembled the record header, validate it. */
+			if (!gotheader)
+			{
+				record = (XLogRecord *) state->readRecordBuf;
+				if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode, randAccess))
+					goto next_record_is_invalid;
+				gotheader = true;
+			}
+		} while (pageHeader->xlp_rem_len > len);
+
+		record = (XLogRecord *) state->readRecordBuf;
+		if (!RecordIsValid(record, RecPtr, emode))
+			goto next_record_is_invalid;
+		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+		state->ReadRecPtr = RecPtr;
+		state->EndRecPtr = pagelsn + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len);
+	}
+	else
+	{
+		/* Record does not cross a page boundary */
+		if (!RecordIsValid(record, RecPtr, emode))
+			goto next_record_is_invalid;
+		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+
+		state->ReadRecPtr = RecPtr;
+		memcpy(state->readRecordBuf, record, total_len);
+	}
+
+	/*
+	 * Special processing if it's an XLOG SWITCH record
+	 */
+	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	{
+		/* Pretend it extends to end of segment */
+		state->EndRecPtr += XLogSegSize - 1;
+		state->EndRecPtr -= state->EndRecPtr % XLogSegSize;
+	}
+	return record;
+
+next_record_is_invalid:
+	return NULL;
+}
+
+/*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * ReadRecord.	It's not intended for use from anywhere else.
+ */
+static bool
+ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, int emode,
+					  bool randAccess)
+{
+	/*
+	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
+	 * required.
+	 */
+	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	{
+		if (record->xl_len != 0)
+		{
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("invalid xlog switch record at %X/%X",
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			return false;
+		}
+	}
+	else if (record->xl_len == 0)
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("record with zero length at %X/%X",
+						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		return false;
+	}
+	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
+		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
+		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("invalid record length at %X/%X",
+						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		return false;
+	}
+	if (record->xl_rmid > RM_MAX_ID)
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("invalid resource manager ID %u at %X/%X",
+						record->xl_rmid, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		return false;
+	}
+	if (randAccess)
+	{
+		/*
+		 * We can't exactly verify the prev-link, but surely it should be less
+		 * than the record's own address.
+		 */
+		if (!XLByteLT(record->xl_prev, RecPtr))
+		{
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			return false;
+		}
+	}
+	else
+	{
+		/*
+		 * Record's prev-link should exactly match our previous location. This
+		 * check guards against torn WAL pages where a stale but valid-looking
+		 * WAL record starts on a sector boundary.
+		 */
+		if (!XLByteEQ(record->xl_prev, PrevRecPtr))
+		{
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			return false;
+		}
+	}
+
+	return true;
+}
+
+
+/*
+ * CRC-check an XLOG record.  We do not believe the contents of an XLOG
+ * record (other than to the minimal extent of computing the amount of
+ * data to read in) until we've checked the CRCs.
+ *
+ * We assume all of the record (that is, xl_tot_len bytes) has been read
+ * into memory at *record.  Also, ValidXLogRecordHeader() has accepted the
+ * record's header, which means in particular that xl_tot_len is at least
+ * SizeOfXlogRecord, so it is safe to fetch xl_len.
+ */
+static bool
+RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
+{
+	pg_crc32	crc;
+	int			i;
+	uint32		len = record->xl_len;
+	BkpBlock	bkpb;
+	char	   *blk;
+	size_t		remaining = record->xl_tot_len;
+
+	/* First the rmgr data */
+	if (remaining < SizeOfXLogRecord + len)
+	{
+		/* ValidXLogRecordHeader() should've caught this already... */
+		ereport(emode_for_corrupt_record(emode, recptr),
+				(errmsg("invalid record length at %X/%X",
+						(uint32) (recptr >> 32), (uint32) recptr)));
+		return false;
+	}
+	remaining -= SizeOfXLogRecord + len;
+	INIT_CRC32(crc);
+	COMP_CRC32(crc, XLogRecGetData(record), len);
+
+	/* Add in the backup blocks, if any */
+	blk = (char *) XLogRecGetData(record) + len;
+	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+	{
+		uint32		blen;
+
+		if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
+			continue;
+
+		if (remaining < sizeof(BkpBlock))
+		{
+			ereport(emode_for_corrupt_record(emode, recptr),
+					(errmsg("invalid backup block size in record at %X/%X",
+							(uint32) (recptr >> 32), (uint32) recptr)));
+			return false;
+		}
+		memcpy(&bkpb, blk, sizeof(BkpBlock));
+
+		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
+		{
+			ereport(emode_for_corrupt_record(emode, recptr),
+					(errmsg("incorrect hole size in record at %X/%X",
+							(uint32) (recptr >> 32), (uint32) recptr)));
+			return false;
+		}
+		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
+
+		if (remaining < blen)
+		{
+			ereport(emode_for_corrupt_record(emode, recptr),
+					(errmsg("invalid backup block size in record at %X/%X",
+							(uint32) (recptr >> 32), (uint32) recptr)));
+			return false;
+		}
+		remaining -= blen;
+		COMP_CRC32(crc, blk, blen);
+		blk += blen;
+	}
+
+	/* Check that xl_tot_len agrees with our calculation */
+	if (remaining != 0)
+	{
+		ereport(emode_for_corrupt_record(emode, recptr),
+				(errmsg("incorrect total length in record at %X/%X",
+						(uint32) (recptr >> 32), (uint32) recptr)));
+		return false;
+	}
+
+	/* Finally include the record header */
+	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
+	FIN_CRC32(crc);
+
+	if (!EQ_CRC32(record->xl_crc, crc))
+	{
+		ereport(emode_for_corrupt_record(emode, recptr),
+		(errmsg("incorrect resource manager data checksum in record at %X/%X",
+				(uint32) (recptr >> 32), (uint32) recptr)));
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index b5bfb7b..1ada664 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -229,6 +229,14 @@ extern const RmgrData RmgrTable[];
 extern pg_time_t GetLastSegSwitchTime(void);
 extern XLogRecPtr RequestXLogSwitch(void);
 
+
+/*
+ * Exported so that xlogreader.c can call this. TODO: Should be refactored
+ * into a callback, or just have xlogreader return the error string and have
+ * the caller of XLogReadRecord() do the ereport() call.
+ */
+extern int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+
 /*
  * These aren't in xlog.h because I'd rather not include fmgr.h there.
  */
#18Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#17)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Hi Heikki,

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:

On 15.09.2012 03:39, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that
wants to do so is not tightly integrated into xlog.c is rather hard and
would require changes to rather integral parts of the recovery code
which doesn't seem to be a good idea.

My previous objections to this approach still apply. 1. I don't want to
maintain a second copy of the code to read xlog.

Yes. I aggree. And I am willing to provide an implementation of this if should
my xlogreader variant gets a bit more buyin.

2. We should focus on reading WAL, I don't see the point of mixing WAL

writing into this.
If you write something that filters/analyzes and then forwards WAL and you want
to do that without a big overhead (i.e. completely reassembling everything, and
then deassembling it again for writeout) its hard to do that without
integrating both sides.

Also, I want to read records incrementally/partially just as data comes in
which again is hard to combine with writing out the data again.

3. I don't like the callback-style API.

I tried to accomodate to that by providing:

extern XLogRecordBuffer* XLogReaderReadOne(XLogReaderState* state);

which does exactly that.

I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.

The problem with that is that kind of API is that it, at least as far as I can
see, that it never can operate on incomplete/partial input. Your need to buffer
larger amounts of xlog somewhere and you need to be aware of record boundaries.
Both are things I dislike in a more generic user than xlog.c.

There is still one callback, XLogPageRead(), to obtain a given page in
WAL. The XLogReader facility is responsible for decoding the WAL into
records, but the user of the facility is responsible for supplying the
physical bytes, via the callback.

Makes sense.

So the usage is like this:

/*
* Callback to read the page starting at 'RecPtr' into *readBuf. It's
* up to you to do this any way you like. Typically you'd read from a
* file. The WAL recovery implementation of this in xlog.c is more
* complicated. It checks the archive, waits for streaming replication
* etc.
*/
static bool
MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
*readBuf, void *private_data)
{
...
}

state = XLogReaderAllocate(&MyXLogPageRead);

while ((record = XLogReadRecord(state, ...)))
{
/* do something with the record */
}

If you don't want the capability to forward/filter the data and read partial
data without regard for record constraints/buffering your patch seems to be
quite a good start. It misses xlogreader.h though...

Do my aims make any sense to you?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#18)
1 attachment(s)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On 17.09.2012 11:12, Andres Freund wrote:

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:

On 15.09.2012 03:39, Andres Freund wrote:
2. We should focus on reading WAL, I don't see the point of mixing WAL

writing into this.
If you write something that filters/analyzes and then forwards WAL and you want
to do that without a big overhead (i.e. completely reassembling everything, and
then deassembling it again for writeout) its hard to do that without
integrating both sides.

It seems really complicated to filter/analyze WAL records without
reassembling them, anyway. The user of the facility is in charge of
reading the physical data, so you can still access the raw data, for
forwarding purposes, in addition to the reassembled records.

Or what exactly do you mean by "completely deassembling"? I read that to
mean dealing with page boundaries, ie. if a record is split across
pages, copy parts into a contiguous temporary buffer.

Also, I want to read records incrementally/partially just as data comes in
which again is hard to combine with writing out the data again.

You mean, you want to start reading the first half of a record, before
the 2nd half is available? That seems complicated. I'd suggest keeping
it simple for now, and optimize later if necessary. Note that before you
have the whole WAL record, you cannot CRC check it, so you don't know if
it's in fact a valid WAL record.

I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.

The problem with that is that kind of API is that it, at least as far as I can
see, that it never can operate on incomplete/partial input. Your need to buffer
larger amounts of xlog somewhere and you need to be aware of record boundaries.
Both are things I dislike in a more generic user than xlog.c.

I don't understand that argument. A typical large WAL record is split
across 1-2 pages, maybe 3-4 at most, for an index page split record.
That doesn't feel like much to me. In extreme cases, a WAL record can be
much larger (e.g a commit record of a transaction with a huge number of
subtransactions), but that should be rare in practice.

The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.

If you don't want the capability to forward/filter the data and read partial
data without regard for record constraints/buffering your patch seems to be
quite a good start. It misses xlogreader.h though...

Ah sorry, patch with xlogreader.h attached.

- Heikki

Attachments:

xlogreader-heikki-2.patchtext/x-diff; name=xlogreader-heikki-2.patchDownload
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index f82f10e..660b5fc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
-	twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o
+	twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff56c26..769ddea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -541,6 +542,8 @@ static uint32 readOff = 0;
 static uint32 readLen = 0;
 static int	readSource = 0;		/* XLOG_FROM_* code */
 
+static bool fetching_ckpt_global;
+
 /*
  * Keeps track of which sources we've tried to read the current WAL
  * record from and failed.
@@ -556,13 +559,6 @@ static int	failedSources = 0;	/* OR of XLOG_FROM_* codes */
 static TimestampTz XLogReceiptTime = 0;
 static int	XLogReceiptSource = 0;		/* XLOG_FROM_* code */
 
-/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
-static char *readBuf = NULL;
-
-/* Buffer for current ReadRecord result (expandable) */
-static char *readRecordBuf = NULL;
-static uint32 readRecordBufSize = 0;
-
 /* State information for XLOG reading */
 static XLogRecPtr ReadRecPtr;	/* start of last record read */
 static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
@@ -632,9 +628,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notexistOk);
 static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources);
-static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
-			 bool randAccess);
-static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+static bool XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+			 int emode, bool randAccess, char *reaBuf, void *private_data);
 static void XLogFileClose(void);
 static bool RestoreArchivedFile(char *path, const char *xlogfname,
 					const char *recovername, off_t expectedSize);
@@ -646,12 +641,10 @@ static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
-static void CheckRecoveryConsistency(void);
+static XLogRecord *ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, bool fetching_ckpt);
+static void CheckRecoveryConsistency(XLogRecPtr EndRecPtr);
 static bool ValidXLogPageHeader(XLogPageHeader hdr, int emode);
-static bool ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record,
-					  int emode, bool randAccess);
-static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
+static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt);
 static List *readTimeLineHistory(TimeLineID targetTLI);
 static bool existsTimeLineHistory(TimeLineID probeTLI);
 static bool rescanLatestTimeLine(void);
@@ -3703,102 +3696,6 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
 }
 
 /*
- * CRC-check an XLOG record.  We do not believe the contents of an XLOG
- * record (other than to the minimal extent of computing the amount of
- * data to read in) until we've checked the CRCs.
- *
- * We assume all of the record (that is, xl_tot_len bytes) has been read
- * into memory at *record.  Also, ValidXLogRecordHeader() has accepted the
- * record's header, which means in particular that xl_tot_len is at least
- * SizeOfXlogRecord, so it is safe to fetch xl_len.
- */
-static bool
-RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
-{
-	pg_crc32	crc;
-	int			i;
-	uint32		len = record->xl_len;
-	BkpBlock	bkpb;
-	char	   *blk;
-	size_t		remaining = record->xl_tot_len;
-
-	/* First the rmgr data */
-	if (remaining < SizeOfXLogRecord + len)
-	{
-		/* ValidXLogRecordHeader() should've caught this already... */
-		ereport(emode_for_corrupt_record(emode, recptr),
-				(errmsg("invalid record length at %X/%X",
-						(uint32) (recptr >> 32), (uint32) recptr)));
-		return false;
-	}
-	remaining -= SizeOfXLogRecord + len;
-	INIT_CRC32(crc);
-	COMP_CRC32(crc, XLogRecGetData(record), len);
-
-	/* Add in the backup blocks, if any */
-	blk = (char *) XLogRecGetData(record) + len;
-	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
-	{
-		uint32		blen;
-
-		if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
-			continue;
-
-		if (remaining < sizeof(BkpBlock))
-		{
-			ereport(emode_for_corrupt_record(emode, recptr),
-					(errmsg("invalid backup block size in record at %X/%X",
-							(uint32) (recptr >> 32), (uint32) recptr)));
-			return false;
-		}
-		memcpy(&bkpb, blk, sizeof(BkpBlock));
-
-		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
-		{
-			ereport(emode_for_corrupt_record(emode, recptr),
-					(errmsg("incorrect hole size in record at %X/%X",
-							(uint32) (recptr >> 32), (uint32) recptr)));
-			return false;
-		}
-		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
-
-		if (remaining < blen)
-		{
-			ereport(emode_for_corrupt_record(emode, recptr),
-					(errmsg("invalid backup block size in record at %X/%X",
-							(uint32) (recptr >> 32), (uint32) recptr)));
-			return false;
-		}
-		remaining -= blen;
-		COMP_CRC32(crc, blk, blen);
-		blk += blen;
-	}
-
-	/* Check that xl_tot_len agrees with our calculation */
-	if (remaining != 0)
-	{
-		ereport(emode_for_corrupt_record(emode, recptr),
-				(errmsg("incorrect total length in record at %X/%X",
-						(uint32) (recptr >> 32), (uint32) recptr)));
-		return false;
-	}
-
-	/* Finally include the record header */
-	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
-	FIN_CRC32(crc);
-
-	if (!EQ_CRC32(record->xl_crc, crc))
-	{
-		ereport(emode_for_corrupt_record(emode, recptr),
-		(errmsg("incorrect resource manager data checksum in record at %X/%X",
-				(uint32) (recptr >> 32), (uint32) recptr)));
-		return false;
-	}
-
-	return true;
-}
-
-/*
  * Attempt to read an XLOG record.
  *
  * If RecPtr is not NULL, try to read a record at that position.  Otherwise
@@ -3811,290 +3708,35 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
  * the returned record pointer always points there.
  */
 static XLogRecord *
-ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
+ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, bool fetching_ckpt)
 {
 	XLogRecord *record;
-	XLogRecPtr	tmpRecPtr = EndRecPtr;
-	bool		randAccess = false;
-	uint32		len,
-				total_len;
-	uint32		targetRecOff;
-	uint32		pageHeaderSize;
-	bool		gotheader;
-
-	if (readBuf == NULL)
-	{
-		/*
-		 * First time through, permanently allocate readBuf.  We do it this
-		 * way, rather than just making a static array, for two reasons: (1)
-		 * no need to waste the storage in most instantiations of the backend;
-		 * (2) a static char array isn't guaranteed to have any particular
-		 * alignment, whereas malloc() will provide MAXALIGN'd storage.
-		 */
-		readBuf = (char *) malloc(XLOG_BLCKSZ);
-		Assert(readBuf != NULL);
-	}
-
-	if (RecPtr == NULL)
-	{
-		RecPtr = &tmpRecPtr;
 
-		/*
-		 * RecPtr is pointing to end+1 of the previous WAL record.  If
-		 * we're at a page boundary, no more records can fit on the current
-		 * page. We must skip over the page header, but we can't do that
-		 * until we've read in the page, since the header size is variable.
-		 */
-	}
-	else
-	{
-		/*
-		 * In this case, the passed-in record pointer should already be
-		 * pointing to a valid record starting position.
-		 */
-		if (!XRecOffIsValid(*RecPtr))
-			ereport(PANIC,
-					(errmsg("invalid record offset at %X/%X",
-							(uint32) (*RecPtr >> 32), (uint32) *RecPtr)));
-
-		/*
-		 * Since we are going to a random position in WAL, forget any prior
-		 * state about what timeline we were in, and allow it to be any
-		 * timeline in expectedTLIs.  We also set a flag to allow curFileTLI
-		 * to go backwards (but we can't reset that variable right here, since
-		 * we might not change files at all).
-		 */
+	if (!XLogRecPtrIsInvalid(RecPtr))
 		lastPageTLI = 0;		/* see comment in ValidXLogPageHeader */
-		randAccess = true;		/* allow curFileTLI to go backwards too */
-	}
+
+	fetching_ckpt_global = fetching_ckpt;
 
 	/* This is the first try to read this page. */
 	failedSources = 0;
-retry:
-	/* Read the page containing the record */
-	if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
-		return NULL;
-
-	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
-	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
-	if (targetRecOff == 0)
+	do
 	{
-		/*
-		 * At page start, so skip over page header.  The Assert checks that
-		 * we're not scribbling on caller's record pointer; it's OK because we
-		 * can only get here in the continuing-from-prev-record case, since
-		 * XRecOffIsValid rejected the zero-page-offset case otherwise.
-		 */
-		Assert(RecPtr == &tmpRecPtr);
-		(*RecPtr) += pageHeaderSize;
-		targetRecOff = pageHeaderSize;
-	}
-	else if (targetRecOff < pageHeaderSize)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid record offset at %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		goto next_record_is_invalid;
-	}
-	if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
-		targetRecOff == pageHeaderSize)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("contrecord is requested by %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		goto next_record_is_invalid;
-	}
-
-	/*
-	 * Read the record length.
-	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of
-	 * the struct, so it must be on this page (the records are MAXALIGNed),
-	 * but we cannot access any other fields until we've verified that we
-	 * got the whole header.
-	 */
-	record = (XLogRecord *) (readBuf + (*RecPtr) % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
-	 * check is necessary here to ensure that we enter the "Need to reassemble
-	 * record" code path below; otherwise we might fail to apply
-	 * ValidXLogRecordHeader at all.
-	 */
-	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
-	{
-		if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
-			goto next_record_is_invalid;
-		gotheader = true;
-	}
-	else
-	{
-		if (total_len < SizeOfXLogRecord)
+		record = XLogReadRecord(xlogreader, RecPtr, emode);
+		ReadRecPtr = xlogreader->ReadRecPtr;
+		EndRecPtr = xlogreader->EndRecPtr;
+		if (record == NULL)
 		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("invalid record length at %X/%X",
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			goto next_record_is_invalid;
-		}
-		gotheader = false;
-	}
-
-	/*
-	 * Allocate or enlarge readRecordBuf as needed.  To avoid useless small
-	 * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
-	 * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with.  (That is
-	 * enough for all "normal" records, but very large commit or abort records
-	 * might need more space.)
-	 */
-	if (total_len > readRecordBufSize)
-	{
-		uint32		newSize = total_len;
+			failedSources |= readSource;
 
-		newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
-		newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
-		if (readRecordBuf)
-			free(readRecordBuf);
-		readRecordBuf = (char *) malloc(newSize);
-		if (!readRecordBuf)
-		{
-			readRecordBufSize = 0;
-			/* We treat this as a "bogus data" condition */
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record length %u at %X/%X too long",
-							total_len, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			goto next_record_is_invalid;
-		}
-		readRecordBufSize = newSize;
-	}
-
-	len = XLOG_BLCKSZ - (*RecPtr) % XLOG_BLCKSZ;
-	if (total_len > len)
-	{
-		/* Need to reassemble record */
-		char	   *contrecord;
-		XLogPageHeader pageHeader;
-		XLogRecPtr	pagelsn;
-		char	   *buffer;
-		uint32		gotlen;
-
-		/* Initialize pagelsn to the beginning of the page this record is on */
-		pagelsn = ((*RecPtr) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
-
-		/* Copy the first fragment of the record from the first page. */
-		memcpy(readRecordBuf, readBuf + (*RecPtr) % XLOG_BLCKSZ, len);
-		buffer = readRecordBuf + len;
-		gotlen = len;
-
-		do
-		{
-			/* Calculate pointer to beginning of next page */
-			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
-			/* Wait for the next page to become available */
-			if (!XLogPageRead(&pagelsn, emode, false, false))
-				return NULL;
-
-			/* Check that the continuation on next page looks valid */
-			pageHeader = (XLogPageHeader) readBuf;
-			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
-			{
-				ereport(emode_for_corrupt_record(emode, *RecPtr),
-						(errmsg("there is no contrecord flag in log segment %s, offset %u",
-								XLogFileNameP(curFileTLI, readSegNo),
-								readOff)));
-				goto next_record_is_invalid;
-			}
-			/*
-			 * Cross-check that xlp_rem_len agrees with how much of the record
-			 * we expect there to be left.
-			 */
-			if (pageHeader->xlp_rem_len == 0 ||
-				total_len != (pageHeader->xlp_rem_len + gotlen))
+			if (readFile >= 0)
 			{
-				ereport(emode_for_corrupt_record(emode, *RecPtr),
-						(errmsg("invalid contrecord length %u in log segment %s, offset %u",
-								pageHeader->xlp_rem_len,
-								XLogFileNameP(curFileTLI, readSegNo),
-								readOff)));
-				goto next_record_is_invalid;
+				close(readFile);
+				readFile = -1;
 			}
+		}
+	} while(StandbyMode && record == NULL);
 
-			/* Append the continuation from this page to the buffer */
-			pageHeaderSize = XLogPageHeaderSize(pageHeader);
-			contrecord = (char *) readBuf + pageHeaderSize;
-			len = XLOG_BLCKSZ - pageHeaderSize;
-			if (pageHeader->xlp_rem_len < len)
-				len = pageHeader->xlp_rem_len;
-			memcpy(buffer, (char *) contrecord, len);
-			buffer += len;
-			gotlen += len;
-
-			/* If we just reassembled the record header, validate it. */
-			if (!gotheader)
-			{
-				record = (XLogRecord *) readRecordBuf;
-				if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
-					goto next_record_is_invalid;
-				gotheader = true;
-			}
-		} while (pageHeader->xlp_rem_len > len);
-
-		record = (XLogRecord *) readRecordBuf;
-		if (!RecordIsValid(record, *RecPtr, emode))
-			goto next_record_is_invalid;
-		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
-		XLogSegNoOffsetToRecPtr(
-			readSegNo,
-			readOff + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len),
-			EndRecPtr);
-		ReadRecPtr = *RecPtr;
-	}
-	else
-	{
-		/* Record does not cross a page boundary */
-		if (!RecordIsValid(record, *RecPtr, emode))
-			goto next_record_is_invalid;
-		EndRecPtr = *RecPtr + MAXALIGN(total_len);
-
-		ReadRecPtr = *RecPtr;
-		memcpy(readRecordBuf, record, total_len);
-	}
-
-	/*
-	 * Special processing if it's an XLOG SWITCH record
-	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
-	{
-		/* Pretend it extends to end of segment */
-		EndRecPtr += XLogSegSize - 1;
-		EndRecPtr -= EndRecPtr % XLogSegSize;
-
-		/*
-		 * Pretend that readBuf contains the last page of the segment. This is
-		 * just to avoid Assert failure in StartupXLOG if XLOG ends with this
-		 * segment.
-		 */
-		readOff = XLogSegSize - XLOG_BLCKSZ;
-	}
 	return record;
-
-next_record_is_invalid:
-	failedSources |= readSource;
-
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-
-	/* In standby-mode, keep trying */
-	if (StandbyMode)
-		goto retry;
-	else
-		return NULL;
 }
 
 /*
@@ -4223,88 +3865,6 @@ ValidXLogPageHeader(XLogPageHeader hdr, int emode)
 }
 
 /*
- * Validate an XLOG record header.
- *
- * This is just a convenience subroutine to avoid duplicated code in
- * ReadRecord.	It's not intended for use from anywhere else.
- */
-static bool
-ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
-					  bool randAccess)
-{
-	/*
-	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
-	 * required.
-	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
-	{
-		if (record->xl_len != 0)
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("invalid xlog switch record at %X/%X",
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			return false;
-		}
-	}
-	else if (record->xl_len == 0)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("record with zero length at %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		return false;
-	}
-	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
-		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
-		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid record length at %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		return false;
-	}
-	if (record->xl_rmid > RM_MAX_ID)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid resource manager ID %u at %X/%X",
-						record->xl_rmid, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		return false;
-	}
-	if (randAccess)
-	{
-		/*
-		 * We can't exactly verify the prev-link, but surely it should be less
-		 * than the record's own address.
-		 */
-		if (!XLByteLT(record->xl_prev, *RecPtr))
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			return false;
-		}
-	}
-	else
-	{
-		/*
-		 * Record's prev-link should exactly match our previous location. This
-		 * check guards against torn WAL pages where a stale but valid-looking
-		 * WAL record starts on a sector boundary.
-		 */
-		if (!XLByteEQ(record->xl_prev, ReadRecPtr))
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			return false;
-		}
-	}
-
-	return true;
-}
-
-/*
  * Try to read a timeline's history file.
  *
  * If successful, return the list of component TLIs (the given TLI followed by
@@ -6089,6 +5649,7 @@ StartupXLOG(void)
 	bool		backupEndRequired = false;
 	bool		backupFromStandby = false;
 	DBState		dbstate_at_startup;
+	XLogReaderState *xlogreader;
 
 	/*
 	 * Read control file and check XLOG status looks valid.
@@ -6222,6 +5783,8 @@ StartupXLOG(void)
 	if (StandbyMode)
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+	xlogreader = XLogReaderAllocate(InvalidXLogRecPtr, &XLogPageRead, NULL);
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -6229,7 +5792,7 @@ StartupXLOG(void)
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(checkPointLoc, 0);
+		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0);
 		if (record != NULL)
 		{
 			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
@@ -6247,7 +5810,7 @@ StartupXLOG(void)
 			 */
 			if (XLByteLT(checkPoint.redo, checkPointLoc))
 			{
-				if (!ReadRecord(&(checkPoint.redo), LOG, false))
+				if (!ReadRecord(xlogreader, checkPoint.redo, LOG, false))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
 							 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
@@ -6271,7 +5834,7 @@ StartupXLOG(void)
 		 */
 		checkPointLoc = ControlFile->checkPoint;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
-		record = ReadCheckpointRecord(checkPointLoc, 1);
+		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1);
 		if (record != NULL)
 		{
 			ereport(DEBUG1,
@@ -6290,7 +5853,7 @@ StartupXLOG(void)
 		else
 		{
 			checkPointLoc = ControlFile->prevCheckPoint;
-			record = ReadCheckpointRecord(checkPointLoc, 2);
+			record = ReadCheckpointRecord(xlogreader, checkPointLoc, 2);
 			if (record != NULL)
 			{
 				ereport(LOG,
@@ -6591,7 +6154,7 @@ StartupXLOG(void)
 		 * Allow read-only connections immediately if we're consistent
 		 * already.
 		 */
-		CheckRecoveryConsistency();
+		CheckRecoveryConsistency(EndRecPtr);
 
 		/*
 		 * Find the first record that logically follows the checkpoint --- it
@@ -6600,12 +6163,12 @@ StartupXLOG(void)
 		if (XLByteLT(checkPoint.redo, RecPtr))
 		{
 			/* back up to find the record */
-			record = ReadRecord(&(checkPoint.redo), PANIC, false);
+			record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);
 		}
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(NULL, LOG, false);
+			record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
 		}
 
 		if (record != NULL)
@@ -6652,7 +6215,7 @@ StartupXLOG(void)
 				HandleStartupProcInterrupts();
 
 				/* Allow read-only connections if we're consistent now */
-				CheckRecoveryConsistency();
+				CheckRecoveryConsistency(EndRecPtr);
 
 				/*
 				 * Have we reached our recovery target?
@@ -6756,7 +6319,7 @@ StartupXLOG(void)
 
 				LastRec = ReadRecPtr;
 
-				record = ReadRecord(NULL, LOG, false);
+				record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
 			} while (record != NULL && recoveryContinue);
 
 			/*
@@ -6806,7 +6369,7 @@ StartupXLOG(void)
 	 * Re-fetch the last valid or last applied record, so we can identify the
 	 * exact endpoint of what we consider the valid portion of WAL.
 	 */
-	record = ReadRecord(&LastRec, PANIC, false);
+	record = ReadRecord(xlogreader, LastRec, PANIC, false);
 	EndOfLog = EndRecPtr;
 	XLByteToPrevSeg(EndOfLog, endLogSegNo);
 
@@ -6905,8 +6468,15 @@ StartupXLOG(void)
 	 * record spans, not the one it starts in.	The last block is indeed the
 	 * one we want to use.
 	 */
-	Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
-	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
+	if (EndOfLog % XLOG_BLCKSZ == 0)
+	{
+		memset(Insert->currpage, 0, XLOG_BLCKSZ);
+	}
+	else
+	{
+		Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
+		memcpy((char *) Insert->currpage, xlogreader->readBuf, XLOG_BLCKSZ);
+	}
 	Insert->currpos = (char *) Insert->currpage +
 		(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
 
@@ -7063,17 +6633,7 @@ StartupXLOG(void)
 		close(readFile);
 		readFile = -1;
 	}
-	if (readBuf)
-	{
-		free(readBuf);
-		readBuf = NULL;
-	}
-	if (readRecordBuf)
-	{
-		free(readRecordBuf);
-		readRecordBuf = NULL;
-		readRecordBufSize = 0;
-	}
+	XLogReaderFree(xlogreader);
 
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
@@ -7104,7 +6664,7 @@ StartupXLOG(void)
  * that it can start accepting read-only connections.
  */
 static void
-CheckRecoveryConsistency(void)
+CheckRecoveryConsistency(XLogRecPtr EndRecPtr)
 {
 	/*
 	 * During crash recovery, we don't reach a consistent state until we've
@@ -7284,7 +6844,7 @@ LocalSetXLogInsertAllowed(void)
  * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
+ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt)
 {
 	XLogRecord *record;
 
@@ -7308,7 +6868,7 @@ ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
 		return NULL;
 	}
 
-	record = ReadRecord(&RecPtr, LOG, true);
+	record = ReadRecord(xlogreader, RecPtr, LOG, true);
 
 	if (record == NULL)
 	{
@@ -10100,19 +9660,21 @@ CancelBackup(void)
  * sleep and retry.
  */
 static bool
-XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
-			 bool randAccess)
+XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
+			 bool randAccess, char *readBuf, void *private_data)
 {
+	/* TODO: these, and fetching_ckpt, would be better in private_data */
 	static XLogRecPtr receivedUpto = 0;
+	static pg_time_t last_fail_time = 0;
+	bool		fetching_ckpt = fetching_ckpt_global;
 	bool		switched_segment = false;
 	uint32		targetPageOff;
 	uint32		targetRecOff;
 	XLogSegNo	targetSegNo;
-	static pg_time_t last_fail_time = 0;
 
-	XLByteToSeg(*RecPtr, targetSegNo);
-	targetPageOff = (((*RecPtr) % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
-	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
+	XLByteToSeg(RecPtr, targetSegNo);
+	targetPageOff = ((RecPtr % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+	targetRecOff = RecPtr % XLOG_BLCKSZ;
 
 	/* Fast exit if we have read the record in the current buffer already */
 	if (failedSources == 0 && targetSegNo == readSegNo &&
@@ -10123,7 +9685,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readSegNo))
+	if (readFile >= 0 && !XLByteInSeg(RecPtr, readSegNo))
 	{
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
@@ -10144,12 +9706,12 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 		readSource = 0;
 	}
 
-	XLByteToSeg(*RecPtr, readSegNo);
+	XLByteToSeg(RecPtr, readSegNo);
 
 retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
-		(readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
+		(readSource == XLOG_FROM_STREAM && !XLByteLT(RecPtr, receivedUpto)))
 	{
 		if (StandbyMode)
 		{
@@ -10192,17 +9754,17 @@ retry:
 					 * XLogReceiptTime will not advance, so the grace time
 					 * alloted to conflicting queries will decrease.
 					 */
-					if (XLByteLT(*RecPtr, receivedUpto))
+					if (XLByteLT(RecPtr, receivedUpto))
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
 						receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart);
-						if (XLByteLT(*RecPtr, receivedUpto))
+						if (XLByteLT(RecPtr, receivedUpto))
 						{
 							havedata = true;
-							if (!XLByteLT(*RecPtr, latestChunkStart))
+							if (!XLByteLT(RecPtr, latestChunkStart))
 							{
 								XLogReceiptTime = GetCurrentTimestamp();
 								SetCurrentChunkStartTime(XLogReceiptTime);
@@ -10321,7 +9883,7 @@ retry:
 						if (PrimaryConnInfo)
 						{
 							RequestXLogStreaming(
-									  fetching_ckpt ? RedoStartLSN : *RecPtr,
+									  fetching_ckpt ? RedoStartLSN : RecPtr,
 												 PrimaryConnInfo);
 							continue;
 						}
@@ -10393,7 +9955,7 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((*RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+		if (((RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
 		{
 			readLen = XLOG_BLCKSZ;
 		}
@@ -10417,7 +9979,7 @@ retry:
 		{
 			char fname[MAXFNAMELEN];
 			XLogFileName(fname, curFileTLI, readSegNo);
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
+			ereport(emode_for_corrupt_record(emode, RecPtr),
 					(errcode_for_file_access(),
 					 errmsg("could not read from log segment %s, offset %u: %m",
 							fname, readOff)));
@@ -10433,7 +9995,7 @@ retry:
 	{
 		char fname[MAXFNAMELEN];
 		XLogFileName(fname, curFileTLI, readSegNo);
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
+		ereport(emode_for_corrupt_record(emode, RecPtr),
 				(errcode_for_file_access(),
 		 errmsg("could not seek in log segment %s to offset %u: %m",
 				fname, readOff)));
@@ -10443,7 +10005,7 @@ retry:
 	{
 		char fname[MAXFNAMELEN];
 		XLogFileName(fname, curFileTLI, readSegNo);
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
+		ereport(emode_for_corrupt_record(emode, RecPtr),
 				(errcode_for_file_access(),
 		 errmsg("could not read from log segment %s, offset %u: %m",
 				fname, readOff)));
@@ -10501,7 +10063,7 @@ triggered:
  * you are about to ereport(), or you might cause a later message to be
  * erroneously suppressed.
  */
-static int
+int
 emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 0000000..8ba05b1
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,496 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogreader.c
+ *		Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogreader.c
+ *
+ * NOTES
+ *		Documentation about how do use this interface can be found in
+ *		xlogreader.h, more specifically in the definition of the
+ *		XLogReaderState struct where all parameters are documented.
+ *
+ * TODO:
+ * * usable without backend code around
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "catalog/pg_control.h"
+
+static bool ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr,
+					  XLogRecord *record, int emode, bool randAccess);
+static bool RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode);
+
+/*
+ * Initialize a new xlog reader
+ */
+XLogReaderState *
+XLogReaderAllocate(XLogRecPtr startpoint,
+				   XLogPageReadCB pagereadfunc, void *private_data)
+{
+	XLogReaderState *state;
+
+	state = (XLogReaderState *) palloc0(sizeof(XLogReaderState));
+
+	/*
+	 * First time through, permanently allocate readBuf.  We do it this
+	 * way, rather than just making a static array, for two reasons: (1)
+	 * no need to waste the storage in most instantiations of the backend;
+	 * (2) a static char array isn't guaranteed to have any particular
+	 * alignment, whereas malloc() will provide MAXALIGN'd storage.
+	 */
+	state->readBuf = (char *) palloc(XLOG_BLCKSZ);
+
+	state->read_page = pagereadfunc;
+	state->private_data = private_data;
+	state->EndRecPtr = startpoint;
+
+	return state;
+}
+
+void
+XLogReaderFree(XLogReaderState *state)
+{
+	if (state->readRecordBuf)
+		pfree(state->readRecordBuf);
+	pfree(state->readBuf);
+	pfree(state);
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * If RecPtr is not NULL, try to read a record at that position.  Otherwise
+ * try to read a record just after the last one previously read.
+ *
+ * If no valid record is available, returns NULL, or fails if emode is PANIC.
+ * (emode must be either PANIC, LOG)
+ *
+ * The record is copied into readRecordBuf, so that on successful return,
+ * the returned record pointer always points there.
+ */
+XLogRecord *
+XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, int emode)
+{
+	XLogRecord *record;
+	XLogRecPtr	tmpRecPtr = state->EndRecPtr;
+	bool		randAccess = false;
+	uint32		len,
+				total_len;
+	uint32		targetRecOff;
+	uint32		pageHeaderSize;
+	bool		gotheader;
+
+	if (RecPtr == InvalidXLogRecPtr)
+	{
+		RecPtr = tmpRecPtr;
+
+		/*
+		 * RecPtr is pointing to end+1 of the previous WAL record.  If
+		 * we're at a page boundary, no more records can fit on the current
+		 * page. We must skip over the page header, but we can't do that
+		 * until we've read in the page, since the header size is variable.
+		 */
+	}
+	else
+	{
+		/*
+		 * In this case, the passed-in record pointer should already be
+		 * pointing to a valid record starting position.
+		 */
+		if (!XRecOffIsValid(RecPtr))
+			ereport(PANIC,
+					(errmsg("invalid record offset at %X/%X",
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		randAccess = true;		/* allow curFileTLI to go backwards too */
+	}
+
+	/* Read the page containing the record */
+	if (!state->read_page(state, RecPtr, emode, randAccess, state->readBuf, state->private_data))
+		return NULL;
+
+	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+	targetRecOff = RecPtr % XLOG_BLCKSZ;
+	if (targetRecOff == 0)
+	{
+		/*
+		 * At page start, so skip over page header.  The Assert checks that
+		 * we're not scribbling on caller's record pointer; it's OK because we
+		 * can only get here in the continuing-from-prev-record case, since
+		 * XRecOffIsValid rejected the zero-page-offset case otherwise.
+		 * XXX: does this assert make sense now that RecPtr is not a pointer?
+		 */
+		Assert(RecPtr == tmpRecPtr);
+		RecPtr += pageHeaderSize;
+		targetRecOff = pageHeaderSize;
+	}
+	else if (targetRecOff < pageHeaderSize)
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("invalid record offset at %X/%X",
+						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		goto next_record_is_invalid;
+	}
+	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
+		targetRecOff == pageHeaderSize)
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("contrecord is requested by %X/%X",
+						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		goto next_record_is_invalid;
+	}
+
+	/*
+	 * Read the record length.
+	 *
+	 * NB: Even though we use an XLogRecord pointer here, the whole record
+	 * header might not fit on this page. xl_tot_len is the first field of
+	 * the struct, so it must be on this page (the records are MAXALIGNed),
+	 * but we cannot access any other fields until we've verified that we
+	 * got the whole header.
+	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+	total_len = record->xl_tot_len;
+
+	/*
+	 * If the whole record header is on this page, validate it immediately.
+	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
+	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * check is necessary here to ensure that we enter the "Need to reassemble
+	 * record" code path below; otherwise we might fail to apply
+	 * ValidXLogRecordHeader at all.
+	 */
+	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+	{
+		if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode, randAccess))
+			goto next_record_is_invalid;
+		gotheader = true;
+	}
+	else
+	{
+		if (total_len < SizeOfXLogRecord)
+		{
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("invalid record length at %X/%X",
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			goto next_record_is_invalid;
+		}
+		gotheader = false;
+	}
+
+	/*
+	 * Allocate or enlarge readRecordBuf as needed.  To avoid useless small
+	 * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
+	 * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with.  (That is
+	 * enough for all "normal" records, but very large commit or abort records
+	 * might need more space.)
+	 */
+	if (total_len > state->readRecordBufSize)
+	{
+		uint32		newSize = total_len;
+
+		newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
+		newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
+		if (state->readRecordBuf)
+			pfree(state->readRecordBuf);
+		state->readRecordBuf = (char *) palloc(newSize);
+		if (!state->readRecordBuf)
+		{
+			state->readRecordBufSize = 0;
+			/* We treat this as a "bogus data" condition */
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("record length %u at %X/%X too long",
+							total_len, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			goto next_record_is_invalid;
+		}
+		state->readRecordBufSize = newSize;
+	}
+
+	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
+	if (total_len > len)
+	{
+		/* Need to reassemble record */
+		char	   *contrecord;
+		XLogPageHeader pageHeader;
+		XLogRecPtr	pagelsn;
+		char	   *buffer;
+		uint32		gotlen;
+
+		/* Initialize pagelsn to the beginning of the page this record is on */
+		pagelsn = (RecPtr / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+
+		/* Copy the first fragment of the record from the first page. */
+		memcpy(state->readRecordBuf, state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+		buffer = state->readRecordBuf + len;
+		gotlen = len;
+
+		do
+		{
+			/* Calculate pointer to beginning of next page */
+			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
+			/* Wait for the next page to become available */
+			if (!state->read_page(state, pagelsn, emode, false, state->readBuf, NULL))
+				return NULL;
+
+			/* Check that the continuation on next page looks valid */
+			pageHeader = (XLogPageHeader) state->readBuf;
+			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
+			{
+				ereport(emode_for_corrupt_record(emode, RecPtr),
+						(errmsg("there is no contrecord flag at %X/%X",
+								(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+				goto next_record_is_invalid;
+			}
+			/*
+			 * Cross-check that xlp_rem_len agrees with how much of the record
+			 * we expect there to be left.
+			 */
+			if (pageHeader->xlp_rem_len == 0 ||
+				total_len != (pageHeader->xlp_rem_len + gotlen))
+			{
+				ereport(emode_for_corrupt_record(emode, RecPtr),
+						(errmsg("invalid contrecord length %u at %X/%X",
+								pageHeader->xlp_rem_len,
+								(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+				goto next_record_is_invalid;
+			}
+
+			/* Append the continuation from this page to the buffer */
+			pageHeaderSize = XLogPageHeaderSize(pageHeader);
+			contrecord = (char *) state->readBuf + pageHeaderSize;
+			len = XLOG_BLCKSZ - pageHeaderSize;
+			if (pageHeader->xlp_rem_len < len)
+				len = pageHeader->xlp_rem_len;
+			memcpy(buffer, (char *) contrecord, len);
+			buffer += len;
+			gotlen += len;
+
+			/* If we just reassembled the record header, validate it. */
+			if (!gotheader)
+			{
+				record = (XLogRecord *) state->readRecordBuf;
+				if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode, randAccess))
+					goto next_record_is_invalid;
+				gotheader = true;
+			}
+		} while (pageHeader->xlp_rem_len > len);
+
+		record = (XLogRecord *) state->readRecordBuf;
+		if (!RecordIsValid(record, RecPtr, emode))
+			goto next_record_is_invalid;
+		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+		state->ReadRecPtr = RecPtr;
+		state->EndRecPtr = pagelsn + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len);
+	}
+	else
+	{
+		/* Record does not cross a page boundary */
+		if (!RecordIsValid(record, RecPtr, emode))
+			goto next_record_is_invalid;
+		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+
+		state->ReadRecPtr = RecPtr;
+		memcpy(state->readRecordBuf, record, total_len);
+	}
+
+	/*
+	 * Special processing if it's an XLOG SWITCH record
+	 */
+	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	{
+		/* Pretend it extends to end of segment */
+		state->EndRecPtr += XLogSegSize - 1;
+		state->EndRecPtr -= state->EndRecPtr % XLogSegSize;
+	}
+	return record;
+
+next_record_is_invalid:
+	return NULL;
+}
+
+/*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * ReadRecord.	It's not intended for use from anywhere else.
+ */
+static bool
+ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, int emode,
+					  bool randAccess)
+{
+	/*
+	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
+	 * required.
+	 */
+	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	{
+		if (record->xl_len != 0)
+		{
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("invalid xlog switch record at %X/%X",
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			return false;
+		}
+	}
+	else if (record->xl_len == 0)
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("record with zero length at %X/%X",
+						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		return false;
+	}
+	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
+		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
+		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("invalid record length at %X/%X",
+						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		return false;
+	}
+	if (record->xl_rmid > RM_MAX_ID)
+	{
+		ereport(emode_for_corrupt_record(emode, RecPtr),
+				(errmsg("invalid resource manager ID %u at %X/%X",
+						record->xl_rmid, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+		return false;
+	}
+	if (randAccess)
+	{
+		/*
+		 * We can't exactly verify the prev-link, but surely it should be less
+		 * than the record's own address.
+		 */
+		if (!XLByteLT(record->xl_prev, RecPtr))
+		{
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			return false;
+		}
+	}
+	else
+	{
+		/*
+		 * Record's prev-link should exactly match our previous location. This
+		 * check guards against torn WAL pages where a stale but valid-looking
+		 * WAL record starts on a sector boundary.
+		 */
+		if (!XLByteEQ(record->xl_prev, PrevRecPtr))
+		{
+			ereport(emode_for_corrupt_record(emode, RecPtr),
+					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+			return false;
+		}
+	}
+
+	return true;
+}
+
+
+/*
+ * CRC-check an XLOG record.  We do not believe the contents of an XLOG
+ * record (other than to the minimal extent of computing the amount of
+ * data to read in) until we've checked the CRCs.
+ *
+ * We assume all of the record (that is, xl_tot_len bytes) has been read
+ * into memory at *record.  Also, ValidXLogRecordHeader() has accepted the
+ * record's header, which means in particular that xl_tot_len is at least
+ * SizeOfXlogRecord, so it is safe to fetch xl_len.
+ */
+static bool
+RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
+{
+	pg_crc32	crc;
+	int			i;
+	uint32		len = record->xl_len;
+	BkpBlock	bkpb;
+	char	   *blk;
+	size_t		remaining = record->xl_tot_len;
+
+	/* First the rmgr data */
+	if (remaining < SizeOfXLogRecord + len)
+	{
+		/* ValidXLogRecordHeader() should've caught this already... */
+		ereport(emode_for_corrupt_record(emode, recptr),
+				(errmsg("invalid record length at %X/%X",
+						(uint32) (recptr >> 32), (uint32) recptr)));
+		return false;
+	}
+	remaining -= SizeOfXLogRecord + len;
+	INIT_CRC32(crc);
+	COMP_CRC32(crc, XLogRecGetData(record), len);
+
+	/* Add in the backup blocks, if any */
+	blk = (char *) XLogRecGetData(record) + len;
+	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+	{
+		uint32		blen;
+
+		if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
+			continue;
+
+		if (remaining < sizeof(BkpBlock))
+		{
+			ereport(emode_for_corrupt_record(emode, recptr),
+					(errmsg("invalid backup block size in record at %X/%X",
+							(uint32) (recptr >> 32), (uint32) recptr)));
+			return false;
+		}
+		memcpy(&bkpb, blk, sizeof(BkpBlock));
+
+		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
+		{
+			ereport(emode_for_corrupt_record(emode, recptr),
+					(errmsg("incorrect hole size in record at %X/%X",
+							(uint32) (recptr >> 32), (uint32) recptr)));
+			return false;
+		}
+		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
+
+		if (remaining < blen)
+		{
+			ereport(emode_for_corrupt_record(emode, recptr),
+					(errmsg("invalid backup block size in record at %X/%X",
+							(uint32) (recptr >> 32), (uint32) recptr)));
+			return false;
+		}
+		remaining -= blen;
+		COMP_CRC32(crc, blk, blen);
+		blk += blen;
+	}
+
+	/* Check that xl_tot_len agrees with our calculation */
+	if (remaining != 0)
+	{
+		ereport(emode_for_corrupt_record(emode, recptr),
+				(errmsg("incorrect total length in record at %X/%X",
+						(uint32) (recptr >> 32), (uint32) recptr)));
+		return false;
+	}
+
+	/* Finally include the record header */
+	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
+	FIN_CRC32(crc);
+
+	if (!EQ_CRC32(record->xl_crc, crc))
+	{
+		ereport(emode_for_corrupt_record(emode, recptr),
+		(errmsg("incorrect resource manager data checksum in record at %X/%X",
+				(uint32) (recptr >> 32), (uint32) recptr)));
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index b5bfb7b..1ada664 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -229,6 +229,14 @@ extern const RmgrData RmgrTable[];
 extern pg_time_t GetLastSegSwitchTime(void);
 extern XLogRecPtr RequestXLogSwitch(void);
 
+
+/*
+ * Exported so that xlogreader.c can call this. TODO: Should be refactored
+ * into a callback, or just have xlogreader return the error string and have
+ * the caller of XLogReadRecord() do the ereport() call.
+ */
+extern int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+
 /*
  * These aren't in xlog.h because I'd rather not include fmgr.h there.
  */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
new file mode 100644
index 0000000..d475a9b
--- /dev/null
+++ b/src/include/access/xlogreader.h
@@ -0,0 +1,101 @@
+/*-------------------------------------------------------------------------
+ *
+ * readxlog.h
+ *
+ *		Generic xlog reading facility.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogreader.h
+ *
+ * NOTES
+ *		Check the definition of the XLogReaderState struct for instructions on
+ *		how to use the XLogReader infrastructure.
+ *
+ *		The basic idea is to allocate an XLogReaderState via
+ *		XLogReaderAllocate, and call XLogReadRecord() until it returns NULL.
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGREADER_H
+#define XLOGREADER_H
+
+#include "access/xlog_internal.h"
+
+struct XLogReaderState;
+
+/*
+ * The callbacks are explained in more detail inside the XLogReaderState
+ * struct.
+ */
+typedef bool (*XLogPageReadCB)(struct XLogReaderState *state,
+							   XLogRecPtr RecPtr, int emode,
+							   bool randAccess,
+							   char *readBuf,
+							   void *private_data);
+
+typedef struct XLogReaderState
+{
+	/* ----------------------------------------
+	 * Public parameters
+	 * ----------------------------------------
+	 */
+
+	/* callbacks */
+
+	/*
+	 * Data input function.
+	 *
+	 * This callback *has* to be implemented.
+	 *
+	 * Has to read XLOG_BLKSZ bytes that are at the location 'RecPtr' into the
+	 * memory pointed at by 'readBuf' parameter. Returns true on success,
+	 * false if the page could not be read.
+	 */
+	XLogPageReadCB read_page;
+
+	/*
+	 * this can be used by the caller to pass state to the callbacks without
+	 * using global variables or such ugliness. It will neither be read or set
+	 * by anything but your code.
+	 */
+	void *private_data;
+
+	/* from where to where are we reading */
+
+	XLogRecPtr ReadRecPtr;	/* start of last record read */
+	XLogRecPtr EndRecPtr;	/* end+1 of last record read */
+
+	/* ----------------------------------------
+	 * private/internal state
+	 * ----------------------------------------
+	 */
+
+	/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
+	char	   *readBuf;
+
+	/* Buffer for current ReadRecord result (expandable) */
+	char	   *readRecordBuf;
+	uint32		readRecordBufSize;
+} XLogReaderState;
+
+/*
+ * Get a new XLogReader
+ *
+ * At least the read_page callback, startptr and endptr have to be set before
+ * the reader can be used.
+ */
+extern XLogReaderState *XLogReaderAllocate(XLogRecPtr startpoint,
+				   XLogPageReadCB pagereadfunc, void *private_data);
+
+/*
+ * Free an XLogReader
+ */
+extern void XLogReaderFree(XLogReaderState *state);
+
+/*
+ * Read the next record from xlog. Returns NULL on end-of-WAL or on failure.
+ */
+extern XLogRecord *XLogReadRecord(XLogReaderState *state, XLogRecPtr ptr, int emode);
+
+#endif /* XLOGREADER_H */
#20Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#19)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:

On 17.09.2012 11:12, Andres Freund wrote:

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:

On 15.09.2012 03:39, Andres Freund wrote:
2. We should focus on reading WAL, I don't see the point of mixing WAL

writing into this.
If you write something that filters/analyzes and then forwards WAL and
you want to do that without a big overhead (i.e. completely reassembling
everything, and then deassembling it again for writeout) its hard to do
that without integrating both sides.

It seems really complicated to filter/analyze WAL records without
reassembling them, anyway. The user of the facility is in charge of
reading the physical data, so you can still access the raw data, for
forwarding purposes, in addition to the reassembled records.

It works ;)

Or what exactly do you mean by "completely deassembling"? I read that to
mean dealing with page boundaries, ie. if a record is split across
pages, copy parts into a contiguous temporary buffer.

Well, if you want to fully split reading and writing of records - which is a
nice goal! - you basically need the full logic of XLogInsert again to split
them apart again to write them. Alternatively you need to store record
boundaries somewhere and copy that way, but in the end if you filter you need
to correct CRCs...

Also, I want to read records incrementally/partially just as data comes
in which again is hard to combine with writing out the data again.

You mean, you want to start reading the first half of a record, before
the 2nd half is available? That seems complicated.

Well, I just can say again: It works ;). Makes it easy to follow something like
XLogwrtResult without taking care about record boundaries.

I'd suggest keeping it simple for now, and optimize later if necessary.

Well, yes. The API should be able to comfortably support those cases though
which I don't think is neccesarily the case in a simple, one call API as
proposed.

Note that before you have the whole WAL record, you cannot CRC check it, so
you don't know if it's in fact a valid WAL record.

Sure. But you can start the CRC computation without any problems and finish it
when the last part of the data comes in.

I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.

The problem with that is that kind of API is that it, at least as far as
I can see, that it never can operate on incomplete/partial input. Your
need to buffer larger amounts of xlog somewhere and you need to be aware
of record boundaries. Both are things I dislike in a more generic user
than xlog.c.

I don't understand that argument. A typical large WAL record is split
across 1-2 pages, maybe 3-4 at most, for an index page split record.
That doesn't feel like much to me. In extreme cases, a WAL record can be
much larger (e.g a commit record of a transaction with a huge number of
subtransactions), but that should be rare in practice.

Well, imagine something like the walsender that essentially follows the flush
position ideally without regard for record boundaries. It is nice to be able to
send/analyze/filter as soon as possible without waiting till a page is full.
And it sure would be nice to be able to read the data on the other side
directly from the network, decompress it again, and only then store it to disk.

The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.

With the proposed API it seems pretty much a requirement to wait inside the
callback. Thats not really nice if your process has other things to wait for as
well.

In my proposal you can simply do something like:

XLogReaderRead(state);

DoSomeOtherWork();

if (CheckForForMessagesFromWalreceiver())
ProcessMessages();
else if (state->needs_input)
UseLatchOrSelectOnInputSocket();
else if (state->needs_output)
UseSelectOnOutputSocket();

but you can also do something like waiting on a Latch but *also* on other fds.

If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your patch
seems to be quite a good start. It misses xlogreader.h though...

Ah sorry, patch with xlogreader.h attached.

Will look at it in a second.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#20)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:

On 17.09.2012 11:12, Andres Freund wrote:

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your patch
seems to be quite a good start. It misses xlogreader.h though...

Ah sorry, patch with xlogreader.h attached.

Will look at it in a second.

It seems we would need one additional callback for both approaches like:

->error(severity, format, ...)

For both to avoid having to draw in elog.c.

Otherwise it looks sensible although it has a more minimal approach (which
might or might not be a good thing). The one thing I definitely like is that
nearly all of it is tried and true code...

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#22Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#20)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On 17.09.2012 12:07, Andres Freund wrote:

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:

The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.

With the proposed API it seems pretty much a requirement to wait inside the
callback.

Or you can return false from the XLogPageRead() callback if the
requested page is not available. That will cause ReadRecord() to return
NULL, and you can retry when more WAL is available.

- Heikki

#23Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#21)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On 17.09.2012 13:01, Andres Freund wrote:

On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:

On 17.09.2012 11:12, Andres Freund wrote:

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your patch
seems to be quite a good start. It misses xlogreader.h though...

Ah sorry, patch with xlogreader.h attached.

Will look at it in a second.

It seems we would need one additional callback for both approaches like:

->error(severity, format, ...)

For both to avoid having to draw in elog.c.

Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.

- Heikki

#24Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#23)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote:

On 17.09.2012 13:01, Andres Freund wrote:

On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:

On 17.09.2012 11:12, Andres Freund wrote:

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your
patch seems to be quite a good start. It misses xlogreader.h
though...

Ah sorry, patch with xlogreader.h attached.

Will look at it in a second.

It seems we would need one additional callback for both approaches like:

->error(severity, format, ...)

For both to avoid having to draw in elog.c.

Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.

That seems a bit more complex from a memory management perspective as you
probably would have to sprintf() into some buffer. We cannot rely on a backend
environment with memory contexts around et al...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#25Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#24)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On 17.09.2012 14:42, Andres Freund wrote:

On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote:

On 17.09.2012 13:01, Andres Freund wrote:

On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:

On 17.09.2012 11:12, Andres Freund wrote:

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your
patch seems to be quite a good start. It misses xlogreader.h
though...

Ah sorry, patch with xlogreader.h attached.

Will look at it in a second.

It seems we would need one additional callback for both approaches like:

->error(severity, format, ...)

For both to avoid having to draw in elog.c.

Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.

That seems a bit more complex from a memory management perspective as you
probably would have to sprintf() into some buffer. We cannot rely on a backend
environment with memory contexts around et al...

Hmm. I was thinking that making this work in a non-backend context would
be too hard, so I didn't give that much thought, but I guess there isn't
many dependencies to backend functions after all. palloc/pfree are
straightforward to replace with malloc/free. That's what we could easily
do with the error messages too, just malloc a suitably sized buffer.

How does a non-backend program get access to xlogreader.c? Copy
xlogreader.c from the source tree at build time and link into the
program? Or should we turn it into a shared library?

- Heikki

#26Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#25)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Monday, September 17, 2012 01:50:33 PM Heikki Linnakangas wrote:

On 17.09.2012 14:42, Andres Freund wrote:

On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote:

On 17.09.2012 13:01, Andres Freund wrote:

On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:

On 17.09.2012 11:12, Andres Freund wrote:

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your
patch seems to be quite a good start. It misses xlogreader.h
though...

Ah sorry, patch with xlogreader.h attached.

Will look at it in a second.

It seems we would need one additional callback for both approaches
like:

->error(severity, format, ...)

For both to avoid having to draw in elog.c.

Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.

That seems a bit more complex from a memory management perspective as you
probably would have to sprintf() into some buffer. We cannot rely on a
backend environment with memory contexts around et al...

Hmm. I was thinking that making this work in a non-backend context would
be too hard, so I didn't give that much thought, but I guess there isn't
many dependencies to backend functions after all. palloc/pfree are
straightforward to replace with malloc/free.

Hm. I thought that it was pretty much a design requirement that this is usable
outside of the backend environment?

That's what we could easily do with the error messages too, just malloc a
suitably sized buffer.

Not very comfortable though... Especially if you need to return an error from
the read_page callback...

How does a non-backend program get access to xlogreader.c? Copy
xlogreader.c from the source tree at build time and link into the
program? Or should we turn it into a shared library?

Not really sure. I thought about just putting it in pgport or such, but that
seemed ugly as well.
The bin/xlogdump hack, which I find really helpful, at first simply had a
dependency on ../../backend/access/transam/xlogreader.o which worked fine. Till
it needed more because of *_desc routines... But Alvaro started to work on this
although I don't know when he will be able to finish it.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#27Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#22)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Monday, September 17, 2012 12:52:32 PM Heikki Linnakangas wrote:

On 17.09.2012 12:07, Andres Freund wrote:

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:

The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.

With the proposed API it seems pretty much a requirement to wait inside
the callback.

Or you can return false from the XLogPageRead() callback if the
requested page is not available. That will cause ReadRecord() to return
NULL, and you can retry when more WAL is available.

That requires to build quite a bit of knowledge on the outside:
* you need to transport the information that you need more input via some
external variable/->private_data
* you need to transport at which RecPtr you needed more data
* you need to signal that youre not dealing with an invalid record after
returning, given both conditions return NULL
* you need to buffer all incoming data somewhere if it comes from the network
or similar, because at the next call XLgReadRecord will restart reading from
the beginning

Sorry, if I sound sceptical! If I had your patch in my hands half a year ago I
would have been very happy, but after building the more generic version that
can do all of the above (including a compatible XLogReaderReadOne(state)) its a
bit hard to do that. Not sure if its just the feeling of possibly having wasted
the time...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#23)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 17.09.2012 13:01, Andres Freund wrote:

It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.

Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.

I think it's basically insane to imagine that you can carve out a
non-trivial piece of the backend that doesn't contain any elog calls.
There's too much low-level infrastructure, such as palloc, that could
call it. Even if you managed to make it safe at the instant the feature
is committed, the odds it would stay safe over time are negligible.

Furthermore, returning enough state for useful error messages back out
of multiple layers of function call is going to be notationally messy,
and will end up requiring complicated infrastructure barely simpler than
elog anyway.

It'd be a lot better for the wal-dumping program to supply a cut-down
version of elog than to try to promise that all errors will be returned
back from ReadRecord.

regards, tom lane

#29Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#28)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On 17.09.2012 17:08, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

On 17.09.2012 13:01, Andres Freund wrote:

It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.

Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.

I think it's basically insane to imagine that you can carve out a
non-trivial piece of the backend that doesn't contain any elog calls.
There's too much low-level infrastructure, such as palloc, that could
call it. Even if you managed to make it safe at the instant the feature
is committed, the odds it would stay safe over time are negligible.

I wasn't thinking that we'd completely eliminate all elog() calls from
ReadRecord and everything it calls, but only the "expected" ones that
mean we've reached the end of valid WAL. The ones that use
emode_for_corrupt_record(). Any unexpected errors like running out of
file descriptors would still use ereport() like usual.

That said, Andres' suggestion of making this facility completely
independent of any backend functions, making it usable in external
programs, doesn't actually seem that hard. ReadRecord() itself is fairly
small, as are the subroutines that validate the records. XLogReadPage(),
which goes out to fetch the right xlog page from archive or whatever, is
way more complicated. But that would live in the callback, so it would
be free to use all the normal backend facilities. However, it means that
external programs would need to supply their own (hopefully much
simpler) version of XLogReadPage(); I'm not sure how that goes with
Andres' plans on using xlogreader.

- Heikki

#30Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#28)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Monday, September 17, 2012 04:08:01 PM Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 17.09.2012 13:01, Andres Freund wrote:

It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.

Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.

I think it's basically insane to imagine that you can carve out a
non-trivial piece of the backend that doesn't contain any elog calls.
There's too much low-level infrastructure, such as palloc, that could
call it. Even if you managed to make it safe at the instant the feature
is committed, the odds it would stay safe over time are negligible.

If you start relying on palloc all hope is gone anyway. I "only" want a
standalone XLogReader because thats just too damn annoying/hard to duplicate in
standalone code. There are several very useful utilities out there that are
incomplete and/or unreliable for that reason. And loads of others that haven't
been written because of that.

That is one of the reasons - beside finding the respective xlog.c code very
hard to read/modify/extend - why I wrote a completely standalone xlogreader.
One other factor was just learning how the hell all that works ;)

I still think the interface that something plain as the proposed
XLogReadRecord() provides is too restrictive for many use-cases. I aggree that
a wrapper with exactly such an interface for xlog.c is useful, though.

Furthermore, returning enough state for useful error messages back out
of multiple layers of function call is going to be notationally messy,
and will end up requiring complicated infrastructure barely simpler than
elog anyway.

Hm. You mean because of file/function/location?

It'd be a lot better for the wal-dumping program to supply a cut-down
version of elog than to try to promise that all errors will be returned
back from ReadRecord.

Well, I suggested a ->error() callback for that reason, that seems relatively
easy to wrap.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#31Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#29)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Monday, September 17, 2012 04:18:28 PM Heikki Linnakangas wrote:

On 17.09.2012 17:08, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

On 17.09.2012 13:01, Andres Freund wrote:

It seems we would need one additional callback for both approaches
like: ->error(severity, format, ...)
For both to avoid having to draw in elog.c.

Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.

I think it's basically insane to imagine that you can carve out a
non-trivial piece of the backend that doesn't contain any elog calls.
There's too much low-level infrastructure, such as palloc, that could
call it. Even if you managed to make it safe at the instant the feature
is committed, the odds it would stay safe over time are negligible.

I wasn't thinking that we'd completely eliminate all elog() calls from
ReadRecord and everything it calls, but only the "expected" ones that
mean we've reached the end of valid WAL. The ones that use
emode_for_corrupt_record(). Any unexpected errors like running out of
file descriptors would still use ereport() like usual.

That said, Andres' suggestion of making this facility completely
independent of any backend functions, making it usable in external
programs, doesn't actually seem that hard. ReadRecord() itself is fairly
small, as are the subroutines that validate the records. XLogReadPage(),
which goes out to fetch the right xlog page from archive or whatever, is
way more complicated. But that would live in the callback, so it would
be free to use all the normal backend facilities. However, it means that
external programs would need to supply their own (hopefully much
simpler) version of XLogReadPage(); I'm not sure how that goes with
Andres' plans on using xlogreader.

XLogRead() from walsender.c is pretty easy to translate to backend-independent
code, so I don't think thats a problem. I don't see how the backend's version
is useful outside of the startup process anyway.

We could provide a default backend independent variant that hits files in
xlogreader.c, its not much code, to avoid others copying it multiple times.

I used a variant of that in the places that read from disk without any
problems. Obviously not in the places that read from network, but thats shelved
due to the different decoding approach atm anyway.

Regards,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#32Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
1 attachment(s)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents

Hi all,

Attached is the .txt and .pdf (both are imo readable and contain the same
content) with design documentation about the proposed feature.

Christan Kruse, Marko Tiikkaja and Hannu Krosing read the document and told me
about my most egregious mistakes. Thanks!

I would appreciate some feedback!

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

DESIGN.txttext/plain; charset=UTF-8; name=DESIGN.txtDownload
#33Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#32)
2 attachment(s)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

This time I really attached both...
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

DESIGN.txttext/plain; charset=UTF-8; name=DESIGN.txtDownload
DESIGN.pdfapplication/pdf; name=DESIGN.pdfDownload
%PDF-1.4
%����
4 0 obj
<<
/Title (High Level Design for Logical Replication in Postgres)
/Author (Andres Freund, 2ndQuadrant Ltd.)
/Creator (DocBook XSL Stylesheets with Apache FOP)
/Producer (Apache FOP Version 1.0)
/CreationDate (D:20120922183645+02'00')
>>
endobj
5 0 obj
<<
  /N 3
  /Length 20 0 R
  /Filter /FlateDecode
>>
stream
x���wTS����7�P����khRH
�H�.*1	J��"6DTpDQ��2(���C��"��Q��D�qp�Id���y�����~k����g�}������LX	��X������g`�l�p��B�F�|��l���� ��*�?������Y"1P������\�8=W�%�O���4M�0J�"Y�2V�s�,[|��e9�2�<�s��e���'��9���`���2�&c�tI�@�o��|N6(��.�sSdl-c�(2�-�y�H�_��/X������Z.$��&\S�������M����07�#�1��Y�rf��Yym�";�8980m-m�(�]����v�^��D���W~�
��e����mi]�P����`/���u}q�|^R��,g+���\K�k)/����C_|�R����ax��8�t1C^7nfz�D����p������u�$��/�ED��L L��[���B�@���������������X�!@~(* 	{d+��}�G�����������}W�L��$�cGD2�Q����Z4 E@�@������A(�q`1���D ������`'�u�4�6pt�c�48.��`�R0��)�
�@���R�t C���X��CP�%CBH@��R�����f�[�(t�
C��Qh�z#0	��Z�l�`O8�����28.����p|��O���X
?���:��0�FB�x$	!���i@�����H���[EE1PL�������V�6��QP��>�U�(j
�MFk�����t,:��.FW�������8���c�1�L&�����9���a��X�:���
�r�bl1�
{{{;�}�#�tp�8_\<N�+�U�Zp'pWp����������e�F|~?��!(�	��HB*a-���F8K�KxA$��N�p����XI<D<O%�%QHf$6)�$!m!�'�"�"� ��Fdr<YL�Bn&�!�'�Q�*X*(�V+�(t*\Qx��W4T�T\���X�xDqH��^�H���QZ�T�tT����2U�F9T9Cy�r���G,���C�Q�(�(g(cT��OeS��u�F�Y�8
C3��Ri��oh��)���J�J�J��q)�����2�a�u�;U-UOU��&�6�+�����y���J���F���3�}�����w���@i�i�k�j��8��tm�����9�����5�4#4Wh�������������:��T�������C����U�MG��C���c�
�����d�1�t5u�u%�����3z�zQz�z�z��	�,�$����S:!��
��,��]�������b�6u=2V30�7n5�kB6q7Yf�`r�c�2M3�mz�6�7K1�12�������-�NB��L����le�Z�-�--�,�YX�[m����hmo�n�h}��bhSh�c����-�����\�\����v�}ngn���cw��jb������������a���1������
cmf�wB;y9�v:�����Y�|���K�K���y���������r\�]�n�D��nRw]w�{��}�G�����g��A�g^�^"���lg�J�)o�����{����S�s�W�7���w���o��)���6�Z�����@����}A��A�A���E�=!pH`�����
��w�������������}�	�	aQ����`����"�"�"�D�DI�z�����_�x���Hc�bW�^����u�c�������,��p<�>�8��"�Ey�.,�X�����%�%G��1�-��9��������K��l�.��oo���/�O$�&�'=JvM��<���R��T�T�������NM���)=&�=���qTH�	�2�3�2�����������\6%
5eC�����4�����D�^2���S��&7:�H�r�0o`���M�'�}��^�Z�]�[�[��`t����U����zW��.Z=��o�����ik(�.,/|�.f]O�V�����~�[��E�76�l����(�8�i���MKx%K�K+J�o�n����W�_}���e���l�V�V������(W.�/���scG���;���PaWQ���K�KZ\�]eP���}uJ�H�WM{�f�����y������V�UWZ�n�`��z������}�}9�6F7�����I�����~�~���}����-�-e�p��u�`����x���l�o����$������A�{����}g�]m����\�9���%���>x������{����=Vs\�x�	����N���>�u�����c�Kz���=s�/�o�l����|�����?y������^d]���p�s�~���:;���/;]��7|�����W����p�������Q�o�H�!�����V����sn��Ys}���������~4��]� =>�=:�`��;c��'?e��~��!�a���D�#�G�&}'/?^�x�I�����?+�\����w�x�20;5�\�����_������e�t���W�f^��Qs�-�m���w3����+?�~�������O�~����
endstream
endobj
6 0 obj
[/ICCBased 5 0 R]
endobj
7 0 obj
<<
  /Type /Metadata
  /Subtype /XML
  /Length 21 0 R
>>
stream
<?xpacket begin="���" id="W5M0MpCehiHzreSzNTczkc9d"?><x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<dc:title>High Level Design for Logical Replication in Postgres</dc:title>
<dc:creator>Andres Freund, 2ndQuadrant Ltd.</dc:creator>
<dc:date>2012-09-22T18:36:45+02:00</dc:date>
</rdf:Description>
<rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="">
<pdf:PDFVersion>1.4</pdf:PDFVersion>
<pdf:Producer>Apache FOP Version 1.0</pdf:Producer>
</rdf:Description>
<rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="">
<xmp:CreateDate>2012-09-22T18:36:45+02:00</xmp:CreateDate>
<xmp:CreatorTool>DocBook XSL Stylesheets with Apache FOP</xmp:CreatorTool>
<xmp:MetadataDate>2012-09-22T18:36:45+02:00</xmp:MetadataDate>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta><?xpacket end="r"?>

endstream
endobj
10 0 obj
<< /URI (http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com)
/S /URI >>
endobj
11 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 78.0 757.289 329.964 768.089 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 10 0 R
/H /I

>>
endobj
13 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 336.96 757.289 524.94 768.089 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 10 0 R
/H /I

>>
endobj
14 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 78.0 742.889 324.696 753.689 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 10 0 R
/H /I

>>
endobj
15 0 obj
<< /URI (http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com)
/S /URI >>
endobj
16 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 78.0 714.611 414.6 725.411 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 15 0 R
/H /I

>>
endobj
17 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 421.596 714.611 450.276 725.411 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 15 0 R
/H /I

>>
endobj
18 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 78.0 700.211 483.996 711.011 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 15 0 R
/H /I

>>
endobj
19 0 obj
<< /Length 22 0 R /Filter /FlateDecode >>
stream
x��WMs�6��W����$�o���N�v�6�=�C�LB'$!������lIT��x2���������L����Q$���������	���(+X���I������H�,����
���:8x��?g���������"�Rz�������a4����>8-g?���oR�.�J&8���EL�I�q�#I�-}��m�\���������-��k��K�|��n}��l�$�����S���z�`��MGuG���K�������������s~��>�)��|�������V���(������I:�k����G�SW7'��!��I���-����N�`]��fsS�D!�L�&����0�X������@kk����&�U�l���.}o���<e2�e!�l��>N� b�E2�^y�~9��6���(�
3v9o�sj�_L75����G\D<����E	��N������{UY�yV�v��2�%�3~?�����ey,�.����'�Uc����W�Si:W;��rKUo�nI�JuK��'��[U����?_]���3�����i�I�,gi8|(��z��tX��e����R�ec4�(��Y�$�E��g�%�����������T������9j���r�����J��C
�������aU�+ ^�8���*�%��U���*��`�*�Y����A����s1{g��a��F�}�#��"@O��I~	VS�W>��yR�V���KS<]k�������(���M&�N�`z[�c��r��$������)M�~78@�m��ZyM��{���n"7�g���k�UW�l��M�G�`2��7�/���]]]��jp����+��R����Km��xI���P��������[ru���z07P��0�� ������T��~�Db�J����EY��t��)�Rr��K�L�nZ�4��+�Z7�0����SI.q9B
�}�� ��H_�z.p��n'��U�0� -����tw�q��<9X��~Y�m���E��-���O�4�Rut��U�����������rz��*t3z��i]=��O�b9R�Si�/	���	[y��
���}D�����B��������jMw�������� �i*	���9����*
�����*�}yfq��$Md���^�N���x�t�s��;����q3��q��\�1s����$j��`�n�u�X���c��D�����]
�wy���K�JPT.8v�_Lvb~��bC��q�D,* �a�9���+���mj��a�����.���u�^
��4�R�;4�3���e���
[�%`��f�.���������YZ��u�_�&v�H@��y�������{z��C�
FHj��B��T�A��|��p����bxo��O�x��6�9�E����%A�p����,\�x*���A\�([#��T������!/�8�����i/g8.�U��|j�(+*t��@X�V���
���K\nS;
�������K?����/b��dR�$C��q�������xM�VN����bhK$�d����1�o�(�����|���X� ��B���c�������/h�Uw���b�j����&�KG�5���\�������
endstream
endobj
12 0 obj
[
11 0 R
13 0 R
14 0 R
16 0 R
17 0 R
18 0 R
]
endobj
8 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Annots 12 0 R
  /Contents 19 0 R
>>

endobj
20 0 obj
2596
endobj
21 0 obj
938
endobj
22 0 obj
1627
endobj
24 0 obj
<< /Length 25 0 R /Filter /FlateDecode >>
stream
x��WK��F��Wt.T����I !��Mqcil$����������ZZ��2�����[�W>y�����Y�S�����>���x�d"
b�)����C
=<�������x��Dzzw���xGW����V���FI�������
|X���a���8<��^��n^��t��/��� �!����<O<�r�m��������U?TM*[m[�hC��mU������^����X�D�����W�w\�[�Zz���e�n$:���y~� �>�c�]���b<��������Y��$��~u}�?U������Gm������cG`.���	n
���Z��`������}M��v��\��z���j�T����Y**t������Z]���0q�v������Xa������v��6n�>����_dfA����\�bW9U����-��H�,[`"�)�(U���&{�N5���)Cn���Dvi�<��o�/�����4r�S����
"�Z��0#��+�����"�����N��+%�d[��s��X�!.�Q��|"���J��x���P���z���rMW���(������Q�����"���zB�����1z"��2��A�K����&Im���Xo���,H��*��4������~��4%�
Q�P����3z��4Yq"��EL���#����I6�k;2b��S�i�zK�6����f�b���N�E2���c[��nug�\�>t����q$�\�^G�A.�D���*T��?��w����XG�4FK���&��N����[}���\Y9x=ad��x�����Sw�I����,��L��#�0��2�+4'����8{_�������R/FA(�
�-�f"������Q�"
!%#���{U�f$��q"�@78wpm��=�;�$��6������-��o���3g�a�d����,�D��6
��l���O=�x����Zoo�w����^x���{�!�XA�;��_qO����#Dy��A�G���LO�
#����(w|�Vi�Z��;�����k�
.���MM�P�C�,���kA��-�t���/ �T�g�50-�gqa"����[|*�[�Em���G�%5��Fv����=�D	L[��SV?��{X�9�D� �,�F�{�f�P�=l]n��:W�+���T�~����9���*��z�06p�����u�F1�I��2���7h3��"�����\��i�������g	�y���~�s� 8���n����[v�����lk^��2�1,���V�������8�3P}6J��<�=��=�Q������d�+����n��!9T`���Us[�����������Ao8���������^��#_��V�H�A���^
D ��*t?����6/�aO��V��s�;u�-��������H�z�U����*��H�ABz�}WK!��)�%�m<�w��:;C�I��(��,
��>kG�s���r9�'����}T}A�\|��=�#����t��R�T5��O��BS��6 Z��:��w����JE�&
endstream
endobj
23 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Contents 24 0 R
>>

endobj
25 0 obj
1547
endobj
27 0 obj
<< /Length 28 0 R /Filter /FlateDecode >>
stream
x��VKs�6��W���q<I���n��d�q�NIIlIBH;���B�I�xN��%��������?�-��@�.>���Q�G�&�+JpC�4�-�ss8n�r�4^9<iR>�;pr����nA����'��J<.���g�w'>� ��=�Aa���~�����~�5e��"�C�)(���}~���l��{p
��Xo:X�o���l��'�����VI���.�0I�aGA�]�w��wPw���~\<����,�BD�q�;s��4��K>�� ��,���������8Q���WM����]��T���q��V3�3�*X������o���Tx�k��<#�����z�q��:"Xh�xb�RG�G$�%ke[u��].�(�z�6s�.Qq��w_��r7����>U?��[��d�U���}��`KI,-��3���uU�����s����pqU&����]��~}��
���a�������.�����RQR� S�c�Tf��?�m��a(�`W�:5S��1��S'=
+����R�s�qq�U��;?D�;�m��+�k�IeR��lQ���Pk��w��*�a����nB��&��D%��%�mm�j\���7bA�����j�����7�w��@�g
�p�GK��+>����s��T����4�u)�c,���������OdI���q#gUd�g6T[��#}�+*)RDQ�*�p�(�m�O��KH������ �F�K1����a��b�����:���d���Y���=T�[��!�e�
�����V!�W��K�
��}������_�:,SZ���lP�%#B����6A���M|�;5Az_Fqx��^F��;�e��]pu�~�bV1�������e������������W��I����E��/x3L�T1M(�B�M��� ��56md!l���L��[�?��n�}2�_5����q�Lh�aJ��Tc����,�	��R����3���0z������H&M�����:$N���2![��!���H�pI6{�>7'_iR������gO��
N���j�p}�LP��EZ��Xg&t����_"�s�%������#r�846���w�����'
endstream
endobj
26 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Contents 27 0 R
>>

endobj
28 0 obj
1099
endobj
30 0 obj
<< /Length 31 0 R /Filter /FlateDecode >>
stream
x��W�r�8��+�6I��pyL*�Y*3N���!�"!e`���> ��\�l�X�_w����.B
���}�I�����[|��CJ��y(�Y�AST� )�CG�n���m���r���8n�n�����;*
XJ�E���n�b���W�#�z��v��}JaD��}���EQ��yL+�,�����>��S�
}���w��Z�Z��kY�����{oh�$����fN��}�=��������v2���Q��8c����OG����"e����y#������W��d$�qM�OB�����9����O�i��!���(P8������s��i�;�J�p=���L�������U+n����C#
�l�n!ca���m�������t���wo�����^�}��/O\X�4�a?��l��nl���v�O����8w�-Y������:[��A��d���j���F�h�h��p���V�ZqE�q �������}{�s�e���*K����d�����bCoL�Hd}��l�c��F��w����B���Q�G)6����.Q�8�@%��z4��hxK�2�i�h�$�������H�5�����Eu��+R���oI���>�W&,��c`_&3���-)!*`%hc�0��(�jbX?�D���H|�tErpx��
�o�}50v��X���\�V�s4Q���d���8Y�p8��#�et`\y�����jK�(����6��|���E����u����c@���KI��[M�N��F��}:�CS����mi��@zW�����j����uC�0c0\Y����<����J�\�F�&$��>�%�6v�\9bUZ	w�c����Yz�A�P�lF)o��������qT=�?9G��F�	*��!;CD�p7�4<sz==x��k��T���n�2?��U|���;=�\z}!�<8�`5����F@	�\��AC����.<G��^4��O�m���+
��\�8�Al���Lu.!�����,�f�h�p��q�s�E���1V;9:��B�6�o����R�Z�I'm�iU%e5B�t���u���W��P�pV%��e��w����N/]�\~t����g� K�\7M�GvP������5M/J���������/��E�]��9Q}��J�4v`TsA2���1��Z4/�������7�c������5��;����E{s�
mG�6{��I��[B�|n� N����5��!{vp����]z�
endstream
endobj
29 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Contents 30 0 R
>>

endobj
31 0 obj
1262
endobj
33 0 obj
<<
  /Name /Im1
  /Type /XObject
  /Length 34 0 R
  /Filter /FlateDecode
  /Subtype /Image
  /Width 540
  /Height 700
  /BitsPerComponent 8
  /ColorSpace [/ICCBased 5 0 R]
>>
stream
x���yxU�����F"��� ���@@�{�G�
( �^Fq�
w���aD�
������(��%��@ LX�HB�b��"����PO����tUwuU?��s��9o����{NU���uB!�B!�B,�o�B�[PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5���\�o+�ce�a�'p���u������Q�f��EEE�w�}�D�;vl����B����������4e��8���/^�x������������F#o��V\r���:3	!:BM1
�h��u��v;u��S%%%�]^�K��}B�j�Q�zUUU��M���h����_�j�8��K���gcI/]�l���;v��m�����)�����7�����i��M�4����v�%%%����q��������YYYRF��h�PUrr�/����Q��V�v��}���M�?~ZZ��+W���3gn��������_ef��%Nm��I���Zp�"���P�$�����
HJ�"J�����K�.���KH���_��.�xbb�<��m���G��c�k_!��Ae\����]�������������/�Ltt�H0�}������{���G�A$SRR���E^�������R[��D����9v(l����,��E��6������)�xj�Q�z555��.L�<9***  @���c���p$�E;�5�I�&���)/)=��v���Z�F�	�����>$$�;v���lj
!>5�(���C��j�#"YUU%/�eMqv�k#��7l�`���a��`(5�_��b��^hh�<����S^���6���4h�B��}����}9�����
���Z���]��lmD����B|j�Q�z�F��o�Y^^���`+/��7��T��o��
���w��xG/��������3k]�"����c��9��.\�|�$Qx���.����r���������.j
!�5�(�����1c��������X������~��I����%K���(��uk�g��������6y�k#]�Q�F�����	k��������Ky��YYY���h�y�����)���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!z�}M�������D����R}�_���6_����(%v�K������UUU��k�+��W��e_�m����z�����z��z�E_��^����p�Bqq1=��J=��z|�Wr��,�O�m�(��t���;�	�t��������z�5��z�+������AV*++=������-�Z��&��������1l�1��W���������v�����2O�m�(��t���;�	�����as��1z�5��z�+������������x:n�Gi��[����M����Wo��555�s
}��J=��+�m�N����_<������-�Z��&������7�:t�s
}��J=��+�m?��35����h��aXX�=����k�]�rE{�xm���-��3}E_�Y^c�,..

�����8������v�)�����-��)�m�O�5jr���G{�Xl��W�U��5���B!+.����������)����#�e���z��s�����,�1N��������W�k�)zAM����4m�T��{E1���t�2{��������G����[�U�V9����m��Q#�����2d�������0L�M�6�Y�f�L���������gggK>����;��-[fl���^E_��P�������?�X|]��S����kN������-[����7i��O�>H�����K�B*�������{��j�{(��t�Z4�����=9c������9399������K�����Y�f�S�9�j�v��o�^QQq������+k��gO�-0��	�s�����~��a�O�8�1������9r���411��������mC$�x?N�W��^��x���q��y�p<w�\��3G^y�
������
��aC(��;��BV��}1b�T��@�| ����z�)�����-��)v`����O�_�r���Er���H�[��Y�8�Y�	1�������8���!�.\@�!�����2�^��j�)I$�����8I_����R���
WAA���9���M��r��k��RS�,�FDr���Hb���_����b����s�p�B��0������(6j������8<{��X�=���"���p���QQQrRp�hg��F��TD���c"�s�=�Y~V\h�4eDR�k�����1�Hb���8)%�+�W�W�/�^���6L�:t(r R���h����)��>�(�o��&��x�
O�0A�r�w���W��FM�Q��Q���/"'$$D$�����HVWW�/����F��!C���b�����>}E_�y�n��'�|�����zJ�\���hg��Z���s�N$���N���������h����QSd��w���97�|�H�������]���4����wVsjjj�V�pp���gffJg�s���$�V�s���o3�y�<��rq}��B�-Z4k����B��1rZ�l��H�WVV���3�A�v��=�w$�}]��E�:uB��E���R�r�w������Sg9�m����w��}��9���O����N�z�-�p���;�/9p�@���o���o��������&M�����c��r���s���og������j��+77w�������<g��-6��;�W��^��~�3W�7�������t�xU��k�!�����n�[i��
�'N��r�;zqGyyy8���^}�U[��q,���;������ZO����H`�`�������^:{���1c������cccW�\i�A��G���3  ��g>O�:��$�H,�R3!!�Mdb:`B�������Y�dItt4�BLL��7�$}E_��+������
�����k���P�������X����������uk�L�Y�&��*�&�����Q�F��75�7Sn��z�����}�0G�m�-��?�5���J=��z|�W�9��;��/�g_��R}�_�5E����o��������C_���}EM��k�����x�5�+��W��e_QS��Z��&�<�}
�J=��z|�W�9��;��/�g_��R}�_�5E����o��������C_���}EM��k�����x�5�+��W��e_QS��Z��&�<�}
�J=��z|�W�9���6Raaa�����0����R}�_���6j
())������HII��q�J�d�	��3���{�y���}��$}������mCD�B�V�!�RZZz���������={�l���F�`<��Kgo�5���7\c��\��������W�#�m��^��*1DS***�����������>�����|
x����n�5���7\c��\��������W�#�m��^��*1DS���!�p6''����7�{��^*���<���p���r�z��R�F_y����!�z!n��M�r�
�m�t�Rqq�E�o�	��3��T}�y���}��$}������mCD�B�V�!�"�v��>z�h����}���7\c��\��������W�#�m>����S���6���7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E����zC/�I�X�W��#5PS�����7���T��|e�;R5E`�]�v��
��E�F����'�
��"���H
���z_�u-�g�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj������b.�7r�wGj���h�F���@���3j
������w8��)&���Agh���%�����LM�*f�� ��9��)V����$V�?�35���Q���J��x��X3�{�X	����bF]`+�����bU���Ab%�s<SS��u�=H���gj�U1�.������LM�*f�� ��9��)V����$V�?�35���Q���J��x��X3�{�X	����bF]`+�����bU���Ab%�s<SS��u�=H���gj�U1�.������LM�*f�� ��9��)V����$V�?�35���Q���J��x��X���(�z��e�k	�q�3"QS,��3�����{�z����gD��X���4��9  ����z����gD��X����+���#�z-!��F$j��HLLT�����[�ZB|��H�+���g�CBB*++�z-!��F$j������<>O�0������gD��X�������m��}-!>�F$j��(,,l��������/_�����,���)�#!!A����������&���)�#))I��C������&���)����2$$�S�N~r-!��F$j�%�2eJbb��\K�����bI�m�����?����gD��B�'���DM!�O����B!��?#5�B<�F$j
!�x��H�B����)��	�3"QS��k��6b4��� 6F$�PS�6+��x�BiiiyyyUUUMM���W��
�1"iG9����}����_���������b(d��qA��#�v���h����=HM�������*((������.�1"iG9����}����_������t����������
�1"iG9����}����_����Y9v�X^^^II����X3F$�(�����{����V�^�u�����S�N���/F�b�����_F[�>f�Aj�/�^��������C������4���t���^!���1"iG9����}���Z4�C��<==]�y��'�3~�x)��~@N�f�jjj������� ���.�zM����f�JK���={��o���c�����2�"�1cj��S�L�������s���#�O��Y�<����,X�L!+.����;M�|����3��F
b:�s�(�����{PK���/p����E����A�"��={Vd�,����{�F������qqqv�������r�����X�?�8Z�������_F[�>f�A-Q���")�6l�vk���8�dNN�tb&r���?�X|3��S���U�I���(���&���"����{�By������M���h���][��#5���2d�������0���M��L���=�i
vy�7����CL����2�"�1cj\��������L��9�&L�����/�A�$�?�<2������s��x��9������S�J�����"^���_��)�W��o9r�����R6�s�����~��A�P���Q`��e.�t�����
6����1
�9Z���h����=�QS�������?����8>|�0����9�|���/�; �����3X�w��Q]����W_��#�<���+W�kq���1cp��ja���:5E�QFF�H^�p�.]����=�4e���7�t�CA1�#F���E9����}���5e����a��I�/_n��I��m���FPPPuu5�qv��mR����`.�:9{�����z�JJJ��h������v;t��$a	N���Z)��NMi���2�C]��v���K/Io�	q���H��(�������4�����H~��AT��C!s����Z�@�o���E*���O:�O=��]�eR\\�ggg�����o�1cF��-�+�����e*�M),,t�t}Mu�M��+--�{��}e��!��Q�/�-r3�w�QQ< ��������i���o||�T���E�LEE���c�@���7P���(���O������;��O�_�������5y� '))�a��tMee�����)����2�"�1�|�_|�ET��)))�A`����#�~���RIg���#R0n�8�����7P����������^��oe�����6m� y��	)[-��Z������.]��O�>.�t��"��mn��B�J���h�����]{T��o��u�XZ#�����'`O�>-�|����a�����k����t�����W�^���{�9�����"�L�g��h��%KZ�nm���������U+�D��{�nu��M���?�05�5(�������^Q�h����/_.�������_F[�>f���_@������xy�b.���h����B�C��B�D9����}�)�=�)���r~m��PS�{PS���2�"���������e�E�CM!�AM!D/���h����B���B�^(����5��5��P�/�-r3j���,b,aaa�Bt��b8%%%���)))7n\I������������BM1�����g�fff������g31x�G/�/�#F
B�
5�p***����������#�����=b�� ��PS���c�2��srrN#�����=b�� ��PS���+bX_�t����"1x�G/�/�#F
B�
5�G�v���$�=17������k	!>5��h�_����)������B�����3�vj
!���b^���)�Xj�y1cl��bm�)������B�����3�vj
!���b^���)�Xj�y1cl��bm�)������B�����3�vj
!���b^���)�Xj�y1cl��bmL�)QQQ6'����h��G������C0����1�Y�y�����N\��{���k	!�cjMIKKsp
��N����]K1Sk
����2�6�h�<���1b�%��f����De�Y�|��vy
-��o�B���5%//�.�������m��px��������]S@��}�1g��	F[�Y����]K�2�����cNrr��y����m��$]�v��7W��������w�t�M�/_6�"�"�����z���]kjp���I�F0�JKK������jjj�^����U��o��!!!A�����m�7�r��v�y��O�q��������b(dE�q�lQ���CRR��9���7�o ���C�x�U��O�qu������������xQ��o��n			������x	-��o��j
�b����Y9�|YY���V����{�)S���3�h+��711��Zj
���2AV�;���WRR���U��o�^c��m���F[�=����]kR�)�`\�^�z�������N����_������"1c���[��O�q������7������)J��=�]���j��W�[5`:���fdd�,�c$Y�~}�~�"""5j��Y��������.����^��y���:w�,U.����?��O����Y����J��={��]�v��s���9s����y������}-������q{G`�9������k1�Df��}Q��	����������7l���~�c�������g��mK�,�?>nm����K������:���o[�h!u�hH�ku����o��;v����B���%�g7n���t��.Dm/��RQQQMMMff&rz��ij|SS�]�)���W�`��1I����Y�jJii)�())��7��:u��a��v�����&"|BB��)S^x��y��-]���/���s�w�}w��s�������|QaJJ
�_�b�z����.�F�P���D����#��g�A@��!��O�,}�����DX.�B��=��3d��qk�����Rp�eee�xU�����j1nEC���q� t
vb�.y�#���@q��'N�0��+�/�q�yyy���p,j�t���2>�)��~��r-$;v��cs��2��Eb��p�@���#��������Q���/��G���q#2�!��B��C%&&Be��=�!��A��`@���`�X�]�:C��RTUU��z�����};���`E��+���+���N�����pQX/�6m�����c�+�&���"'88�K�.�����/����6m�
6tV�Re�g�-[��[��� �������
A���U���m��Xb����������m��I~�x�������U�VI��l���,|��I�>}� )����.��f�����)�����K�{���N*�~�z��y���~���,����O�+��[����<��+W0D1����A�@|���0P/\���'���63{�/�����������Gy�����K/}���>�UT��&L@�wXX/�u���m����b�!/�p����W^^�=����Q`��Y���u9r����.�A�A�����{����A�@0Gr��R����{���+���D�E��Ma���_�S�N��'N�%�"�B/��k��KEEv������ �����\ )d�^[���H�<�������lQ�%�D�>\�m����y�������)��+V��p�t�M.�|�����F������"b�p��5�=����
��"c$��n�;`�1E����AX�[���k8�E��ut\4p��F�I�"�/^�X*�PS�@�Q ::Z^����v�j
�uVF���O$q�$LI��a��lw!����x�����K5#�#'##C$��C;,�<x0�<��Z&��w�^$1����7l�!���G���s8�6!���^�@
�CpkHB�Q9����3���7f�lI�sLys={�D~l%��1\q����=�Q������!�j+�3B�7�E�o���W����V)�����7J-(,,�����t�L������\��)�J	[r$[�j%�Ph�����D��� ���!�.�:|��P��L����7�|��t�jw�jZt��3f�nlL>��SC��&y�	56c���S����CY�ji-����O��366{x+�F��x��$j@=��S�LY�t����\�� �����c1��_��g-��zV�~���"�q�F���D(�`������]bI#^^���%q�/W"PSSS��`%/�#G��������������{�E�����
�5tx����_��ZSv��i����G�8���n������ �w�q�
�XQ8���TSDU)))/���x��W/DZDED+�_0SSS��y����a��������'N���`S�}.���D��*>�+�L}��s�3��W�X�%
�+�[	����e��YZ4���s�:�`�G5.E���>�rv-�������]��h����� �`�$�T�����J�Qa�F�A��{��rmP ))��]��~�z�)�g_����zmT��[�h����S��3/
>�V�j���9���ZS$� ��3"*J��>��%���1��=�a }�K��2���;,����7���1�;��***�x�z���5k�@S�@�L�#FL�:Z��+>��
��
�'>x,>�
�BA�����GyD|,��:�:p��{~�����^s�65Kk���p2��$������2m���,���o���	!���e=���k������.>FeWBznn���~���8�m�6���^�k����� P�:�*H��XB��,21_n��6����'ON�4I����fH���������/��|�-:���
l����w����:�SXCS��>����^�s
�+p��C����g�^���D�����T0�`���@h���l�_T`D�g��
t�
cYU�oUUP3t���g�u��Sl���(�	������r�?�h�]]�����~�
p��' H��0z�v������?��Si!�����~�������`h���55e��Q111��-[��F�ni�d����[�c �8f�������`�^V�\�FS��dgg�����������I�\ �C�<g�CS�h����k�^�k�9�-)[Z*�k'����w��-����=���o�Y��%nR����7TPP >� �^W��3/���$��������,�ETG�Q������_�85�C�q���t���Q�C���#E����U� �(�EG���^��1���b���JJa\�R)�!���U��}����/+�����#	��mk����)i*�ZGM�e�E]���c�=�e{����@4�7<�_i�z�������z{g}��B���)zK �`}���Q��W�����E���P/M����J4j�������7�`�/7MM1;Zz��T��g��>�\�F;�����U��r��~����)f����)����bv�)���B��)f����)����bv�)���B��)f�����)��#D/����)��FM�
Z����4������fdd���l��q%!z����q����r)���:	�PS��%5����������iii{���L�`,aDa\ata��;h�sS{���r�)����RQQQTT��������~�=�X�����������sS{���r�)��z���/�bGtttNNv(��!t�o���^���P������Y6~���
�����s��a��������;n�-j��sA5�;PSQ5��PS�5�5`\?~<++���������\�/����� ����)���j�������������KC����� ���L�)3f�p�K�k6����`\%''CV�;���g��OIMM?~|�v����BBB:t�0b��>��t�oy}��xSh

���fdd�,��-������_DDD�F��5ku�}���&��S����P=�J{�1�����R��������o�n ����;����Wo��A��4�lQ{�u������P9r$������A���	5��aM���J��o����F�,\�����KEEE555���������i�4����u��y%��]Cl�>}�uoiJ�f��E7���X��Y�����A�{��W�3���	5�DGG���'O�x��y8~��Wp�[������V�^����y��74E0d����<��5��Tr��%N���!��cGy� s��m����qM���4��-�5�m�-j�����=A�����E�������.]���=��!r�_"���eK��}�6m��I�>}� )oz���S�����u[�l���d"My����"<p��O�
0�z�Op<u�T��<P���%5EMa9-Z���Ea�>� G%���pKqq�C��[���3H^����1�Tj��)�l��a�A`��5�{��>1<<�ck&]��g�9����8�b��/��������/�z��_��;����������o�rrr\��9sfrrryy9�*���G�Y�f�}�l��%���
6������8@R�;v�@Iq`��W��v�Z�8m������Z0�***�O�n�-���0��>l�0 )�)5���J&L������F���������m�������S�4y+���H�o�^^��w��u����q=�\T��lg��bbb�9RZZ����d����Y�����������;��R�E���%K��sA�a����y���]���XOS��E��Tz���+��?x�sL�]"//���}�Dr���HbM%�D�"�g����D�N\\"�����n��NC~II�(���"���b��`��N�)uv�8b���v��!��5��j������%��n�O9}��x�����K��={�?��cyyg�2���uu����j*q8 �eee6���q[�'lE�������0�UUU���N8�z��&z���7h ��)������#""l.�)����'O����#@7�}b���K����1F�Hb��$2E�CyV������D�z��o`c��uk�J,�{��f��Ah�eBBB*j��.��J��,��'@��^	����zODB�����(-�����s_P������S\O1��7�a%j���(���.��'�8�;��U+��1c������O?��������Q�,�)u�O:t��v�z����7����T'������L�y�������>�c�����vQ�������)���y�Q���vXXw/m���&[�K]�)�����{��H~���Ry�����[�doj�3�V�>(s��\"�YYY�w�q�
�SX������4E|��g���>�*��]�v����_"?����m���/i��o������-���H���[$��s��Er��q"�$�y��������:g��	g�g>��M�+��������{�$V�� �?Z�FS\O����o}�vX�� ��������g_����#g��u�A=!:��;/c=M���.���C?~�������H��">q��[oUTT �a ���6m� y��	)��%�c�Q�������1�p���_��g,�"""�^�*����-Z�I��{wX�i�&���=z����A�7-7��t��5--�ua�>A�`���ZXXx��,��D�}���4�V�c����_���ZZb9,��w=��;���5�a%���C�mu������"��
l�O��w������G����������me���[x�q
��1c� �a��r�J����e������/��\|��I-qqq���0�Yb����Y�h.l��Q=;�w�KS���4��g���������Q�0��B�xF`o��%V�IIIr�����6r���������Z��5�r��)����5�a%�
�T-&���K��"�P�C���-�W�o�^��q������)��������o�]K%��`M���JNNVY�X�b���c�k�Yx��W�_�Al�����)^k���WWW���O��Bvk��BM�����~���P}�F;�����q�&�\�/CM�2k�����MY�v��
�)���R���v*����[n�����E��CM��B�@M1;��@M!D
��CM��B�@M1;��@M!D
��CM������H!zFM156j�W��g-�<�)���$777###%%e���+	��%�(�+�.����@M��������g�fff������g3!z����q���1���U�M�u2���xKjJEEEQQQ~~>�zz�B�c	#
�
�cL�A�����d$�CM�����]�z�A;����srr�C��`�R]]���U�M�u�	e��)^A���`����������=w��HYYYMM���V���:9�PS�5�5PS�5�;PSQ��������


���������lQ{��r�)���B�l��h���Y9�<�*��[e����\�CM����3f8��`�5�TbA0����!+���������j�w0���plllFF���������������u��\������*g�]����^�z�������N�����j�w0���������j
{�KR�����������M��I����Y5�5��)�����CM�&���!C���\����������cG/�e�&�5��PS���4�h�"))�EaOx�����
�J9[�l���o��M�4i��O$^.���Y��wo������������g���\��6�����E�u��!((�G��6mz�����i�{��?~\~yJJ
�>444,,,..�]�`:�)f�����)�	&;,�	/�U{��A�>
��aC������\ )��Rbbb�9RZZ����d���]����:kSy;�L�LHH(**Z�z�H�5
#��G����;���w��)t���Q`��e��0��CM�����}�;v({�KR��������S��R���}������{��b^y�8>p��H���!���:����:kp�)�?��#�+++��+W����0�9"y��$�t���saFM������������|
,u}�����{�|���~��W�o����v���Z��]~��/Y���f�i��M7�d���&M� !]$�e@�����555"��
��k�NcT�Vg
�2E�!O^�|�a��M�*{�Q�F��06j
T�����[?��c�+��s���������+�����_n����������@��dff����������PY�������#�
4���_w��^���`���cu��j������z����k3\'���n���QS~��������w��-Z��<y���;��9�L\�x1��`LL�/�����K�����c���9s���V��{��}���BCC����_/^4�F
���s�u���{�&X�,[�l��
��������w�S�����s�:�/HC4���l.�}���A���� �����]ScRM����p���S�����������n���z��'��Y#-�,n��w�1bD��-;t�0z��7�|��O>A����/��]�k��#G�deea�S�_+@y\�k<�-O����5kv��������e��<`���_�d�c�=��sgH	B����.\����._�;���������~���
?{��t���,��uwc��Z���{sss���p�������i���'��#��5IPPf��jjj��	��f��jJuu����ZPP����U%��|��p�w4n��y��t��.����W��2�;��[o
����$11Q,�W�Z���'��t�O�������
PW�Z�>��g�������_��_���4����#�"�Z�I#����[_{����8[?lL�������-���X�b���_}��m�6��x�=?{IP0����>C4����7�%..����u���u��.Z�AM���r�Z��.�8~�xLa�f��j
��X(CV��OOO������o�z��1c��}���C���w�����w����*������G������E���$����"�a��]����oH	vg�~�-��3##�:x>A���'QWA�Q�9z��������_!�b4w��G}�g���	6j�[���������|��J��10� �����X��������p2V��8��u��m��	�re���X�;v;�s���W����w�t�Max�,lt�bRM�v�&uUU��K�
���a<f=V��T"�����!���>������&N���wol9CBB������Ji��	�f��?>b�:s�������a�y��W�M�6j�(4DlC |�
"���/y���� "��`�A<���r�����������COa���KKK�
���$j(**�FCY`*d��p�E�K8��7�xc����� 2������.��>��SO=���/�������O���K�`���������:���'������Xo��f�4���#F��:u��/�(BDp��_~	M�����~����������+~mA�t�����c:�,LM!����"��E�C�z���L��#G��������m��������3f`������]�vm��-����_�#"":u���{����k�����
�4i�(����.A4�^����z�M7A�D���h��:����s�%&&.\�B��&q[�7X��m��q��~�R�{��)��PxC�3Q�eyy9�;��������`y���Kr&�&0	�;&������wf�����OB(
�M�pi��-q�M�6�K�5>�����*4�N�y���p���B�P6t7�QF+�Z��{�t���������������D�Y�r%���0d�o��a���loq_k���������X�`/��	<��j���	L�)��555%%%X�#����b
/B"�0�X6���ODE,D�2"0"� ��]�q��`���+� Hb���7B%�*��G�����Sj�Y$����B���4��kX�-h�E�"��X�v�������,��PI���%~����J���Q��-D
RqA8�>�%�	3`��������w�^�b�O�d�O��Ph�p)��%K��w�}wv-P�'k7n���#dB���o���������?����~{����{!p�{�:l�%�VI
���q���#��q���a�)�K�}j
���f�[�\p/���Xpbm���?"n z`���("6B
�\�F�s�v�Q��p�����N(�u-kA=�
u�f��ZD�h6�� �M�7X����a?��� ����!�{z�,��B��_�3��I*��1	�V������fr�BI���{U8V�-j�����6��@cQ�A.<�a��'N�<	������HN��[RJ$�)�XIS�H!�KM,8�9����X�BeD`D`A���@�9r���;�"$��$�����X��e�sD��Z�cQ�Am�S�5����h���{`l���G|C�~D,��-�!!Nr���`3�q��{��9�b�O�d�O��Ph�p)�C����T�W����|+w/�Mr/Z�
���*��BA0`?���;*//���d�CM!����b�<*"z`���("6B
�\�F�s�v�Q��p/4����:�(��Z~���6���EXC[h��u�K���6)���:Y��W������'"6�;,�K1��^��
�W*�*+�5�}�����v@�����2
��a+�"���	�DS���(	MEE�Lii)��;�"$��dvvv�f����k����3����h����zP�a
��-h�����+d!����"�+]
����T�W����|+w/�Mr/Z�
B;`�o
��PS�'�[M�It�-[f��}��J@)�k�\�$C;�����	�T��R�R���oQ���~���BM!���R'��
�k���S����$$$m���	�#,,�����������n��&g_�GTb��o��QRR����������q������F�F����������o+�x���F�en�)>Bii���g333��������=�X���P�o�6�����K�)�<���v�j��PQQQTT��������~�=�X���������M�)p����7����0����Z�^����������(�+�.������2g�g�����J�CM������J(..�H�`,aDa\��w��ZS:u��LS�b�u&���S\��UB�@Q����;	���$��5�(�'	!�AM!J�IB�{PS�z������$��5�(�'	!�AM!J�IB�{x_S���J�����u$����p����Xz���zv{�S!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5��������\�|Y������[��_����F�5k�,**�����h����c��
�:�$��&��"����y���mz��h��_�x�buu���'���gO�f�z�������ufB|j�yA�u��M�6������h��6KJJ�4�^�:��;�.B�.PS���_|��:�._J���g��!���aaaqqq7n�NEGG����/���"�>�S(��i�����'''���/�l�ph�]��K�2	!>5��Ha���/��;v������?���?q�D�Z�t�8;u�T$�}�]�l�Z�F���;8��NLL����m�>���h��y��
�(�}
!f��b^�`����=HDDDQQ�o��P�������������H����F��G���+��8@r��18Fg�c�2p�@h��,��E�)������[pv��LB�oBM1/�`����������������]L�v�p<n�����:�Il=p
\�PQQ�w����>$$5w����yR5�kCM1/�`{�������;v(5����*���C����������o3f�h��eff&2qJ�16l�%�����r6l(%����)�Xj�y��?��#Bz�.]����N���Kg���3�~�i����������?��O��S��)--�%�;w�r4h`�#AAA���F$8@M!�bPS��2��_�����D�:@/�/���Y�dI�>}�K��������QQQ�jO���S�����g����]�|���Ph����2m��A������rg��y�~�������b1�)�Elkjj���N��C�%$$�j�
J�8?~��]�vIg�\���E����D���>�$2q�Y��F�������h��I��
[�l9p�@�����~��uk�1YYY��������7���Cyj
!��B!D/�)�B���B�$$$m!��PS�z������$��5�(�'	!�AM!J�IB�{PS�z������$���'4�k��6#����n<���BqOh
#��aB���B��	!�AM!J������%�AB�{PS�� !�=�)D	{�������}B�W�^F[G1
�f���LS�{�=��#�C�;����:u��?�YSS�������R_����)���9������B��#�C
S��]���1b��o6��PS���B��wWj
���v{��Tff&�;v4�$���R'�"HLLTj��������c�����vV�.G'%%����q�����������?��$����v��m���.�k~~��I�������Z�n=n����wKg���3d������������7��*G��}��g��W�Z%��?]�t�={vuu��Zxo��i
6Ty��zqp�=�����?�N�[�9w�y�C��j
����M������J��"����'O"��}{y�
8�D������]�t���^B������)����"	�9]K��=]h
������
���`�����8�c�����~���������������m�������I4-�<s����7����������8;k�,�������W�\Qs���*-����~�����K�=9����C��j
�����\S&L�`�E�R��v�Zvv�x������4j���%��w�}'����6�6G�Y5%�dJJ�Hb��BS�5k�S(�;�;�X�S���"y��y$�}���w�^���h�a+�/_����h�����;��7G���B`y�����$�;l��Gsa�������5e��mF[D`S�&���M��� ��������8���$62.4EZ�4m�466v��ig������_��D���V�Z���&O��}�t���K$�>/����i����t���|��'8<x�C��j
�(,,��I����H,����8@
SUUU����{�\�~���<"�;�e����K�KSrss��v��I1*#N���P��v��i�:t��������x����:/qx����S�4hp�w�x��A����O^ej�)DNBB����O7����VPP��wo)I����x_S��/����g_%%%�y�f�D������/]�������P������NMq}	�(�]��B)9|�p$��]�o��	D��C�5��IJJ����CF�B��#G����U�'����Cl������.�k�xG���A�|G?x�����
K����M�6���!C����Hd� O�p���,Y��O�>��D���^��E��7�,//��v���NMq}	h��
����r�X(����m�O��w����av�)DNeeeHHH�N��6�8E��E�D$����������7o�%��5���%�j<&&���>�)��H���������@pn���O<q��9�,�7�># ��?~��]���B`�>p�[�n��1�Cpppll��+�����V�����[�e�i�T��������6lp�
�CM!vL�2%11�h+�����@X����������gy��W`3���+�)��m��effm1=�F�JMM�������K�&BL�)���pl������SPS!� ))���c```XX4������	si�xYs�-�|��F��A�)�B���B!D/�)�B���B!D/�)�B���B!D/|MS�v�j#�
z_��s��9s���vy�gy��Y7�5M��J���`��������������W�j���$!D=�y��Nj
q�~~~������,v�ma��$��G9���IM!�a�������������\LM!�;(���:�)�=�����OOO���?���L�h�>&	!�Q�k�uRS�{�����!+�����+))�>��<��=�����Wo��555���S������HQ���r^k5�HHH0�]������7�:t������e�)�:����t)��'�@������~��V�;�555j������Fb��B���"�2e
.|������o�9����rp9>����FM1�]j
!BM|���p���"YTT��A���=+2q��n�C�����=!j�;zAaa!D$22R$7l��z����w��5"g������B�a<88�K�.�g�������3g&M��
[�n=n��={�H����2$444,,,..n��M�)���w��������������}�Y���u��l�2;M������i��EDD4l��
���z������Z{������������8�9s&�'L�����_��|���_�b�������.]�?>
��5K
n�%�������k���Y�r����/N���3  �_�~�N�*..�8q"JB����#G����&&&"����m��!��O��Zp oTM�o��j�z��{s5���y��NC4���+�����q�P�����o�^����'�H���+W��@tt��$K�5k�c�M��
edd������G^��D���L��Dr������}���US3n�-W�
5���y��NC4e����v��I�/_n��I��m�	�

���8���T���p���QQQ��n =D����}��d��Mccc�M�VPP N!���Q�F�J��i���j���q�Ild�g���;u�WuBM!���r^k��M)//�|t����������B����q�w�^������Cm��.^��$N�C��8//���o�����2�����'g��v��)�k��i
������#���`���G�����c�,�7>>^^844T�w���LS$.]�$^�C�D�x~����,B�����������/��B�������X��D�_|��HIIAB������������z���[���;;��\�v-��/_O��"Na$6G������]�ti�>}��(s�l�bs����5�5����"GZ�VVV"*��qc�y��iy��g��3&<<<88866v����4e���?�pDD�|�v��x�����K����&$$�j�*  �M�6�����{���9K�,���n��ILL$��l�j�j
!�	5�xj
!�	5�x�i
�����'�es����9M���"��A9���IM13f�p�)��������bv��Z{���q���� ��;PS1;�y��Nj�����RSF��M�)�����^'5������JMY�b�7m�;zB�����^����	JHHHUU�7m���B��}G����ZEM1��2&L���B���Bt���?�k�����l��iJ��]��
`\�=&m��/^���&22���+^6��g���N<�UiiiyyyUUUMM���W�5&5�j�=+�hg��$$$M�>}��[�����;z�v�	0����/\�P\\e����|GO�e��5BSRSS����4E���p�O�qu���������Jee�7G���uRS�6�!!!�:u2�uj
!����0Szz:d����eee��J{��IM15S�Ly���
i��B�v0����!+�����+))���V���Nj����};��!MSS�������n����z��)/�Ei��:�)�=,��~���/_VY���x-��������g�q���*��@� ]jvQ���+;u�������E�D�<�g�������������j
��@Sk�������=b�$!+[�O�)v�z���`M2$//�ua���3H����Hv���[�Y
j������#XLS@�-���\�������eK��}�6m��I�>}� )?�b�
�����n��-[�L^�]�G�;vldd�(�j�*�zP�������kjj�4��Tuu��A^&%%2�i�&��a3�	

�],_����,5EF�����]���\p����;���	������.���4l��w��999���8@R��;v�Zq���S^���Ij������+**N�81~�x���3g&''���_�ti����j��Yu�gw*33����K9;w������S����'�jED~�0{C�%����:�h�G�{GJJJ0�322�,��q�J���5����z5��G�8ea���Ni���'�{��E��8p ���={�8�����x��uu�s��������<��r��i�>�����
@M��� �.\@� �����I�Q�Z��#5(��^���[ZZz��Y,T���0�7��4E�o		Y�d��*���Nl.p���L$1��D�H���+�:���M���e����'O�����B�w���4O�1���s_�Q;5j$����V�Z��2+b�����������!+����O�{<��\�2&�u�CA�^ ;;�aa��5M���_�M:�V�������o�qX�3�q������"���_KD��,���;r�
kCM��_��e���O��FMqRX��s��u�_���P4�����/�����V"�����A�p�~�ze���6Yl��k��J\[~��������{K9@���$�;{���-��s�H
J{���q�z��J�;Xo�����D�>���mM�)--�ua���N����/��������S���Cw7n|�m�}������X`L�4I��Z�z������v���=M#G���~n���P�:��������Kq����&����~[��_u��@�`�b
YTTd.M��{���3�����%R�Y�&���������g�q*&&f��E6��dzz����o��fl���%��3fLxxxpppll��,�{��n�:����O�IMMMHHh��m������w��-�*�a?�G�H��v-��y�%z��h�P�U~����&�]j���QQQ���*��C�8z�(������6D������]��. *��~������~���P}���P�F�:|�puu�O?�$�_�W��5E5��Z�wQ�M���]j���ZX�f���������0h���k��H-�9��k��D%6��2�KS�����5��CB����Z�wQ�|��KS4�I��P���	1v���G{�~�)��������8�7o�_y��������Z�J���_�g��:5E_L7�)���]��.p���~��X������0�z�2q<u�T��<P��#�Lj����)��5��Z�w�{�]��N�6���2�����������#���������>l�0 )��)���M��pOFM�_�{���.]��#..n��-��_��y��>}� ���D�����r�J�Y�`���o���`3����;���-��3`,aDa\ata���|Go,�_��1i����+W��m��s�5k�B#/RQ��W�����>K����x�%�(�+�.�1o�v���X�F��fap������g�}��y��D����
���Hb�p��RW�~�)�n�	0�0�0�0�0��9���Zc��2W�=x����o�����w�����E��s�"9n�8�����NM��~�~7����X�����&������]9�5V��*�����k����#""�^�*����-Z�I��{w�����8@�G��}��Qw�5E��u�!!���z=w�fPYYYMM�7G���j�����1����;z�v�	�h
��ob^M�8&�2CY��>$�&������:sA_��h���bU�)���N<�M�������V�C�!J{��IM�*��:�1c��_
vX��>$�&������z��v��h��,�b��qy�fy5/AMQS8666##Cea�����}vv�<Id�`v%�5VM7G|����O����6��#���.�!"�DM�F�e]�w���+��qE\��n���6�l�lj��������IVc�Do�PAV�hH���I,d9z_���rj23g�szf����@M�t?��<����9=--�;z+[�4�-�WS,�}TTTQQ����-�x����������z��4%���G�F�������5�4��|�r�����������#"".]��"���D����-|�&~O?����p�������'�L�6-<<|���{����kmm-�	&���eee�?*a�={v��5III,�t����J����������lu�<.--�7o�rrrr�4����Qyk M��XJ|||II�Fb�F����X���{at�9W���Q��fg�;vL�:�5{�l�a��m��������������DFF>��[�lf� �6lHII��I�&�]����F�&�5�*K�t:7m��������	i�,�I������8�|�(8�:juuuXX���[ZZ�����_/H>��K��2eJUU�����������)(�B����8@P��B��g��`��,Z���=~�8��0w�\����FAA�J51�1#���p�������s����p������u�ST������8p����p��|�W��_~�e�����?�������H��+��Spr��� ���[�`�FM��*K��;���?=6HSdyMj����]SN�<)uW�(����Y}A�P� �<��C��f�L����	��"��<���g���Aa����,Y� �� n�HS,�_S@jj*n���[�a�K/���?���p�q��������|��78�z��4822�cL:\��%?Z7~�x�����#eM��*K��iW���)�'�`���k�8�����Ys��b������Y�)���Pi�z
U�n������o5�g�Q��u[\\,�������ys����`�E����u��_��������g�III����TJ\�A/�lk��M�/_���vW��$8h���yMjg=��W�������S�j���4�4��>�Vb%6[[[U���h���>b�]�v)�J�q��
���Fw�O<��0�����A��)M�p8�}��)S�H�K�&�]�+���0��AY~���)�����P�!UII�j��K��lyy�����3+v
�B�k
{�%>�g_��[����|����~i����08��"""p|��U��V��
�
r78a�ArGt��	U����gk1������vW%M�M���kR;��1�Z.���C��:u*�m��!�������EdTT������1�����n��
�Y��1�����t�^�P������P(�����I�����766j'�o1
#��l��7������4g�
�KNNF�����
r+��#�}����!h��3�	p����H!{Wm���5����)�ISy�k����������������rss'N��{Kxr~~�_|!�E7\�z��������G�����K��P��r�������;��q M����q��1���RSZ[[srr���.x)��5�N;i�$�n;::�~�i����o�>i����U�V%&&B)0�z������4jr[�����&MQ����g��EC=�������b���SRRl6�����!A��#Z����N�>�t:���[�HY|'�2�WSxp��q��������mH<��g�f(��i��V��\QZZ����������AS���,�@WWWdd$��S�N���^Dh���
oC� M����zhnn�8q"|����f����^i
A�C�"�k�% �P^^�����~hJEE����
/�4� �!M����zX�~�����wv�����O^D���k��6$Z��������qc������[�l)))�����y3�
/(�%v���mH�.����$��=���_|��]�v���������~���~����v����HS�\�r�n�777���VTT�#|x�g�}v��������L�%5�y;a�%x�
����.\hll�-�Q���l����<�9������v��K�(��>f�+�AM���mooG�455��&)/��E�;2>�)��F���0�<
~����w��[�OOO��#Hb,f��UXX���i����)<�i
y;aiiimmm��@P0Iq:�Vz��i�@o�1B�i
y;a�+����?�������a+�]YN����^���?�)���_�G M!~�W����x�����������������A�ZQ/f���\�'o'�@��lSSd���S�>xk���|���\��k��p����������j�+�L6�+��Y����G��>�z�/������������:{�V���.LLL7n����SRR~���I[��R��K���khhhii	���SG�HS<��������"=�=k�?�y�������_�p1s����$u����������$M�wHS<�+�������vb�j��������������Y^�/!#�4����K���/))�H�YC��� oNN����O�%��+++����\�������R�����7�����mmm���{����/&.//G��?�Y�	k��G�FO�A�i
<��(((���������m���}���n����jiK��� �������6���EYa)322��9300�u�V-Z�S���8^�b�hv����y���<�6a
<��g�f(��i��V�)�i�~M�����W&����JK�,7n�XDHH���;���cb���'O�`]]���HS������� ����8�v��-(������~Ld�G�.]��������Z	�)R���\)�~bccq�/����W�^�X���k�k��M[R4�c�b2� "�)���Y��Yd���B������?���e�8�L�
i�,�����A�"�-VL
Z[[U�t������"�
b���1�y���p�t�R�C��3a�)���_BF�i����1���}��_A������3��\<����x�%�
��+V x�����P($��:&���h��0���N�t+�Hzzzcc�vb�j��������'���GFFzzz������bK
?]�����v������*Hc>�����w��5�U���������I�����x�W��7��w������������������^�dIII��%�������dee��]�T���y355�EVTTxVa�JHSdy������"qJJ��f���+��.o���
	����u!��4E���K@� M�8��U�����[�J;�����E��|���� M������\������j��h�{�X�l*9y������u!�Bk����_��A���{��jx�����)�?�����]YN����^���9�*(aaa===�U�4� �!M����`��YJMY�r��u M!~HSdy���m�����{�ZY����4E���K@���v�����^�v��:�}RR��e,�����8Z����r#���?_��W@��w�����+v����������bA|	��w���-���>�H�)UUUW 5e``���������555G	��K�(��>f�+Q�k~��)~JOOOTT���$�4��244�����������TOF_�G���]�1�X��_��$M�_rss��l������MS�����N=iiimmm��@P0Iq:�&9�*�~�oS M�[JKK�[644X_�dk�������v{gg'z�������~��5z�X�]�;}�t��n���@�B�<������p���2�����v��m^)�4� ��_�;w�����/_������z��>�9kE���TUUa<�J��)�����SMMM����.LU��ve}8
r��zY0�RXX��K������D���l�������p����"��5z$���lnn���������������Hs^�b�ceP5��:�"��?�������B����>�y����n%QQQEEEz�t|��1Y���9�i�_�����>�6IS�Ma,_���ph'6�����k����7++��*��)�_@�"�k�% |��_RR����vC[�������������E�����O>�d��i���3g���g��y��_s������ZBBBNNNkk�X���{]Di����_���G}A���0


���T�n���L�����o���?����V
		���X�������l�(8���+���4%##���3[�nEp��E���������2w�\�5���GzP���&O��E;��	f���+��UIMM���LlH�����7����8�t�Rhh(��V�YY;�E/^�X}��O�<��)���,888(��3��K�,A����kjjHSL����-���O�/M�!,,��R�'66���Xf��6���<���b�O<!��5%!!��Y��SS���Y����Sa <��s�����B;����)n������?���U51�=��������b�b�"���o���:K�b�)Rh�V_&--��;�).K�g����~���N�3>>~����#q�����gd/_�z������X�Xp��E��g_!!!�����}��y2F���������(��i��V�c�����v;����ISp���������JJJ`$??_�v�Za�,/\��o��&�����3G:�WVV
������<�������F	s8�Fo*�)���_B'��~�����)���o��q������?��?a����������Y�6 '''!!�������b2$-���3���������wk�RN�4�UmM�}�]bV��;�Ba<�(���IS|A���i
�O��8%%�f��Llx�g�������z�"�w������$M	T�~��)��Z����[�e�������;}�������o�;�{���v��Y^�/�A��)i
?�������>�hxxx\\4�����5
LHSdy���N�~��)���f@�"�k�%��8��_����!M�b�
D�]��K�5�������������ey��<(�u�������������c�����~��u4��HS����{��>�9k�w�L�)�G��		�3g��y�����RC!M1���mH�)���_T5l��	�������7[_7�!M1���0���8�i^�G*SaQ�������G�����p���]�	3�/���W�.���#������������G�����p�WS|s����M�|	��w���;$��{W��j�*�WUU�o�^������j
�OU
%�7a�%x�
�3���(�5�M���
��#g�e��!�H�kg0�)�B�pf����������$��t����(�5�M!�5�=�
		ihh`1���L M1��v�/F�z���D6��UQ���f�k
{�����_��W��?#M1��v�/ M����:�_M��5z�v����}}}c~sA
��V����i
�_������0U������4�Y+�e�i����R��e��� �����~�b�%�Y��5���T��1gff677�Lll�������Vi<����[�E��7��1c���J���---�;z�/��B�������) **���HObc[�����K������,�U����7cW"�����6�)�����{�����vbc[��??""���K,���ADZ�)����dwn�2x���K�8��M�6��(44������kO�0!...++���#�\hy�122����e���0;�^�a����t�I�&�]����f�Sc�����c���S�}���H�m���������s������g���H<m��O?�Tv�=����/�*//G��?�Yc��5��'�J,%>>���D#����N���?��E����G������E��?���axx8|�)����������b�",��~�e}�mzQS�y��3g���q�EVWW���-\�������o����O?����/�l�����������W^y��Z�h�eee�����}��-X�`�Sc���������{������ug7o�<Yb������7;;����K�����+V�3�^�1o���g�����)c&VRPPgPMll�1���������}WW���s�o�po����>
sHW���#�<��S���NHSdy=nFp��ii$�vD677� z����ZF�5`����#�;4e+i��DV�o���W�^�Yps(K|��I��K�,���=�rBB��q�0 }����LM|b�)z������ebc[�d�*/�����G���@��E��/^,������!U5�CBB>��#�AB'��)�K&��-5�rj������111�
�7������<����
���G�0��M�6]�|y�S�%���|�V[�+����  8q�De���B�������e��y��A�)������XL��2�����7o��1#v�� x�o�po���!5M��@S�lc^?���>�6y����L63��j��'�xB}����� f�R#���g��2e�8,���yJ�De=�e]�Z�������=���.]�x�`Z����j�W��������������,�k�.�Ymo�po����SO=��#�w�~x���K�Q��M�KJJT�L�0A�AN�8�j�����t�?~�S�%*��'8��/��+<t��Z��������1��)<�h
�8��ow�+H�4%b]p5��-J�fK-c>����W_U����3x����P��� g�����������S�������v��������Y�����;44��W_��1Cjg���0�@�d��H��wuJ�De=�����F/~�^��������1�]�f�g��
i���"���������m=
���E��+++k�l)]@�n�)�A�"��q3��mhh���E��crrr~~�_|�Nutt<�����GFFfff���Oj���z��U������=��s]]]c��.QYO=A�RFFn&!U�W�����7SSSYdEE�G�8�)z36n�8��}^�
o�v������4x#|���)�7oFv����n�)�A�"�k�%�x�
a�C-�;�B��'qJJ��f����6��!��*����7����9�)���_��`hh(//
���T��j�ek�������o��6��Q�B����=�`kC�'O~���M*���+22����S�bcc
/�<M��I����lx��@o!M�A�;�W���&iJ@���<q�D�jee��fKMM5�����4E���K@����<%%e�����������^i
A�C�"�k�% ��~������2�����'�4�����4E���K@hs�����wGGGo��������j���86��X�w���mH�F/%))I |����_|q��]eee555���_~��O��L��v�����]b)W�\�����������^��g�;v���(33�pA	BM�E�N�|	��w������@GG��qK|��l6������b�b�$%85���0�<
~���Y��>�)CCC������h����z�7�$��^��:t�A	NM!o'����_���cVz��iJzz�7@c3k�������N��TM��5z�v�������0C��`��t:��d����&L#����
��n���=hpppxx�$VEY~��)�g��?�)���_�G M!~�W����x�������������(��o�4�����G�gSSd���S�Xe}�m���;���^)�<M�l��������U-{����~e�� +_����p��)�FO�����y���[���������:��


���S�L�������:u���+�����������|���������?~|JJ��~�3�y���2�������l_>�CKKK������f`@�"�]MQQQEEEz�z�����������)�J]]]%%%K�.��l�0��������/������.\@���suf'M��G�)���H�@S��/w8���j��������\�x��x��>}:�����,;���)���_Bi��5E�{���F5�'���g�i���
��K�6l����1i���k����H����+++�����������;�N�
��g�>r���m�������;���H��999����>�4�K���_��k��M����t��y��&$$�����c6������Z�����)���j`lTMlT��{������������b�
,Z��eeeN����c��},PM�����%��v �osss{{{8��yyy����8FWU��H[����[�n]uu���e��_��k�=edd�9sf```����%�4����#Z�'� �4��'�J�Jjj�������j���0X�����������?~<�q#}KmLP��cG���'Y���A����|��78�z��4822�c��k�;��%K���bBv����t��h�����4FOA���,888�`dd��&5�������I�����Ha�5'�������Z�t����X����$���dffn�������)�����(���KG��,�i������1�x��k����~��izJ7PS�4<<�����Q��v���@�����4;�����y}��'�J��Kkk�jb��N�zJhh��H(���>;e�i��y=���1���$S�4E��V�����*�v��
i�,�5�Nh@��q^��`��}�vW��4���{_s��u��WDD���?~��W�{?[�?~�	�����B8��!NMa��f����t��1�k��&M�� ��K�IM�4E���f'4<M��}zzzcc�vb[�o�����_�;w7�����0m��p�B���ob|njj�3g�t\�lYYYYww7�������������d��?^�a��r�v;n�q��F�3(c��������'P�����������������k�M�@A��&5�~Dk��������l��q���og�[~~�}���o��o���/��������������+++���D:VWW�Z�*117�S�Ly������D�����&M�
��%��Q`WMv)=�������T�1��}��K�,A��i4J����U�5�����h7����#������A�Z����)R��������3�W��lHSdy����)n�3nM����[�J;�
i�,��������������U#4�]���-<����Z	���BW���{�YV
�4�7�GO�N�O?�5z���9�*(aaa===�U�<M��I����lY���w������$M�Gf����C�������4E���K@�Vl����w��u M!~HSdy��hoo�	Jll��k���C�iJRR��e,�����8�i^+1B��������
A�F�\�b�����kkk+**�����Q�+x}���q�������RUUeq��L��������.466���%#�/���W�.��������9kE��Ezzz�����$%%��[Fj���Pooo{{;�SSS=A|	��w����ve��4�Y+������4e�����l����n��t"�IKKkkk���I���������� g���RZZ��������� ��v��Wv����=hpppxx�JoW��� g���y�k�����N�>�+���)��FO�N����=a8���o�m������N�4�J�^ig"��_�;w�����/_���������M�����
��W�&M!~�W�N�jjj��tuua�b�����M��3@S
]})X�����@�W6�
����_;�~����>�6IS�MA������f��=k%��������?������Z��hN�Un��F>p��?��������������`��
������$���X�����������S��u�]����Z��k��G�FO�.��g��/w8��=k%YF4��i�"##��=��"���G�.1H�������������2��}��n�:1����6a��������#G��,���g�����O?�T��,f������0�������t��y��%$$�������,���4E
���/����T_0�CS}}}��=k%e���.DN�<�������.\�������_/�������6�����<~��x6$$�U�H&�s���Vy���q�������[�"�h�"=vmx��[����>��kE;��ft�@�����qR���VRf�~�:"qW��P��X����G���Z8y�$��%K��g����D$�q]]����?�<���g���A###��!�!M�A;����)<+���w�����-..�Y6PS:;;�<%&&FYQ)X�8��!8q�D����$$$(��<�<f�[w�X�B�~xk��h�V?��4��'���*(���ommUM�Y+)3~�������gA�)���d#6�q�lhh����y����"M���~�e}�m��_����IKK��i���R0���}��_Az����4F����������D���g_^�z����+i���T��gV���\�uu����Chc^?���>�69������	����v;����ISp��������Vbo��q��%������A�a
\���wgggK- �xq�^�[�y��71Mhjj�3g�t<g���-BFWk�����m;�6�)���_B'��~����)�36n�8��}<���	�������}CCCnn���1WJNN������/��2������0����������c�#��[����;v�~���qV���Y
�i�,������ ����4��������l6��
oC�9{�,��|���F�B���G��F�,�$��������?�Z����[�e������N�>�t:���[�p#��h�5�B�~�3z3���4�Y+?�eA� ���4��������>������qqq������5�B�Y^�/�P�r�������0��qee�[_C��EIA��)i�!�
oC� M������OF���������1�������^{M�� ��WIS�lxi�,����~�`�Y#dee!^�l�4�X�~�F��e��� h�^������/~�B�����:���+W������O�Fz����A�������FUCi��6$��������_�r�=��?������!>11���Z�A1���_U
z�c�]�	3���#M��5|�����t:e���]F��`AAsc=eP��l�WU�^G M1���0�<
~����>Ri�����x����A�C�b8�7a�%x�
�3���(�5�M����r��aY|yy������Zo�������:��)��F�v�&����_���c�����9r�� ����l��f����ZAHH�[��<0�u�WSx0OShn���������|��u:�Vz��_s�����Q���^B������_�~��[��?��Ob��ZArr2����cX�vU5�u��-��!�>����D��2<<l��+��i��V��e��������{�����G��&s�����I�&I#�����*
z��-��A�"�k�%0	��C M1���mH�d�����1��`��+��i��V���HSx��}���#������0U����F��������w�]})X���mH�d�p��A�O����^�HLL\�`��C�d	�*�������I�������31gs	��1DEE�>�&S��G=F8���>}����
U�� �� �?����~Go�%��/��y���Y�f!��>P&0� ?�4EOb�A���HOb��bFFF���U?L��g������\�A���	~��.������<�M��*APA��'������vb���Y@��W###��=���:��I��HSdy��"��544�&�������{��s!���3����*/;.--�7o�orrrp�g�����j�����_RR�����������u���1hR�Z��l��exxXjADO�:`v�rI����B^'L����u���Y���5k���XO����_�Tee����cbb�������O�s���)�����DVa�s��M�������=�b���/�a{h<����������h��X�����]u�"hn,|��q����`��������q�������[�
������)<�h
������O51gs)-tuu!r���b��/�l��~��������~g_y�m#�Y4�p��SSS���������������_/H�,�
��2eJUU����?����NAA0��d��q� ����{O���~��_"����,�U��w�Aw�q�����!f������W#�����j}�����e}8
r��w4w2�N�b�)����2���	��%Kp\WW��555��R__������;=�4EObU0��6C������_�.Hf�2FFF���;��hgQ�����_l=E��T�_Dss3~���b�����-CP���wB��]	AL[p��?`F���"����ch��,�U����bA��]KHH@_���@2
���K�.i4�/x���U���k��M���{4����x���@RCS�g���K�����GSp}���N���0�e�=V�2���Y�����~�Xog�e�`�<�����g�IIIa�t`H7/U3��R0zK�����QmR�Y�&�����!���I��_q�I�0:�S"��L+,,�LL>��cC���NSx��e��vRW	��������-������r4F��[[[Us6�����~�H�	�O<!�>�����}g�C�Kid�,�u@�����{�����~5����)���,�_������j�������q�=%��//^		y���p�t�R$�
�jv��4E�i��������Dk�g_nU�[�����<�D�&����}��_A�)���&L� �O�8!����z�,u���@��7o����%%%������������=��=:LM�>1;v��_��3�i���Y�b�0�7�P.H�jv�����=~�)��5�E�9=k�nU�[�����}�/H$==���Q;1gs17n��t����S������;44���3f��K����YT� ��SO	�������2���0�����w��^�����������;������n��
�[��n�-��3�x�
at���8��Kt��>|X�d��Y�f���i��GzP�����������0{�>������du]J���q��1��3DS0���������_������~^�4����Mci���������4��������������6HX~~�_|!����������^�U}�8z�wpiA�O�R�����DW����7SSS����
��_ M���c)?g�������z�"\���)(%%�f��Llx�&>���###���)R������>}��t~�������)�i���Z����[g�@�vghhV�l�����H	�^VZZ����������AS����]#.���"##���:u*66��"������E y�_���&O�,�����G�FOXFss��������6�M�[i��������f����g���p��z��R^^�����~hJEE����
/�4� �!M�B��gY�~�����wv�����O^i
A�C�"�z�r�����wGGGo��������j���86� ����4E��^a�������w��UVVVSS������_�T_���������=�r���nonn�������G����>����cEEE�������z��-HS30��AY+KWe``����������%>J�6�
.Z]]}��A�Y�����|SShVN�A\\i����������v�JSSS=�`���/@PT�vA��4+'����_���cV���iJzz��E�Pg��Y������&	Jpj
��	3�/���W�.���.�k�"�� �<M��5�[4+'����_���c�2���IS��4�gq:���D���d[[�wa�%x�
�����3ze��4�	iJ0��222�.�;��������0�<
~����wH�"�BMa���
�0���uE�"�ZM!�4�$�p�� |
Z�'� |�� e��n}HS����w�����!M	fHS��Q�k���4%�!M!G���[��`���	���=H����]�]b"� M!�B�B�)�]HS�@�4� �i
H�=AxZ�'	z�� �e��n}HS����w�����!M	fHS��Q�k���4%�!M!G���[��`���	���=H����]�]b"� M!�B�B�)�]HS�@�4� �K�iJRR�@+qqq���K�-D�Y:<g�	�5zp�������\[[[QQ��&p�q�q���o;#A����200���q����������D0�+�����'x�	�p����������v,MMM�D0�+�����'x�	�p����T)�Smkk��&p�q�q���o;#A����222�������}}}=D0�+�����'x�	�p���;� �	��{�	����� �� M!� ��4� �0
�� �(HS� � M!� ��4� �0
�� �(HS� � M!� ��4� �0
�� �(HS� � M!� ��4� �0
�� �(HS� � M!� ��4� �0
�� �(HS� � M!� ��4� �0
�� �(HS� �Pj
AAAAA��t��
endstream
endobj
34 0 obj
39644
endobj
35 0 obj
<< /Length 36 0 R /Filter /FlateDecode >>
stream
x��SMs�0��W��=t�		$�v���2��������T����"�����	y����y�������h��(��`4x~��2D(��_�AVn/8�h��lf���!�<�O��b<��/ 4�'�L�]��hh6��5� H�c(����p�?2I��������D���o^�d��7^@/^��fy�<I���v�����5Zs���5z��t�&���S���%���2)�uM���y>�k���%�}8�������A�U�Yj���V�:�������7��T�N�[�_Lq�]k%*)�����[�����b�[EUu�4 5�����A�$����O��"��PD��������y�Mw�X	���	�*�@J�V;n	g��$�����^8�
endstream
endobj
32 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Contents 35 0 R
>>

endobj
36 0 obj
385
endobj
38 0 obj
<< /URI (http://archives.postgresql.org/message-id/1347669575-14371-6-git-send-email-andres@2ndquadrant.com)
/S /URI >>
endobj
39 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 225.996 418.096 512.28 428.896 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 38 0 R
/H /I

>>
endobj
41 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 403.696 99.0 414.496 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 38 0 R
/H /I

>>
endobj
42 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 105.996 403.696 487.956 414.496 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 38 0 R
/H /I

>>
endobj
43 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 389.296 192.024 400.096 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 38 0 R
/H /I

>>
endobj
44 0 obj
<< /URI (http:http://archives.postgresql.org/message-id/1347669575-14371-3-git-send-email-andres@2ndquadrant.com)
/S /URI >>
endobj
45 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 168.012 202.873 490.932 213.673 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 44 0 R
/H /I

>>
endobj
46 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 497.928 202.873 548.616 213.673 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 44 0 R
/H /I

>>
endobj
47 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 188.473 545.304 199.273 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 44 0 R
/H /I

>>
endobj
48 0 obj
<< /Length 49 0 R /Filter /FlateDecode >>
stream
x��X�n��}�W�)�6G���2�g1@vv
L��E���-���%����I�M�������[�����NU��N�_�#��gY�����?,�,J��$�2�� ba�������0��[����_��2���4p��f����������x�b���.xc��?�}���/���F?���������L��q{>2�0��2�X��<	� ��c����o��f��j���iv=���}���,��~f����m,%^N���QN6s����-����g?h3�e��??^:���Z�����>�c���#���	�:��+\�i������~�
��_����L�_��Y�	
��88b���m��~1�-���� W�S����X��ao��y�\�����L�CS��S�����!�	�	%j���l�X�i5�T�+�$���������t�+F6��������o:���W���K��2\�z��7����gI��^���9e���Qo�a*Gv��]]|���wj��}��H��i
���*�#v&*� ["Hc�<���5L���� QLo��ea��KS"%�E;[��+�=�����������{k��S]3��z�U�
r���hp���{���8���f4+D������������).g.��U<;(wd;G�kS�6��hz��=�Tkm�,��ADV�D�0�!��[dl�I����1%��[ie�o��S[����W�	�:s9��r$C��T��.o����L�<����v�q���Ibo�3a�{m��Ou���5#q����������O�z)���fEU5��tX��B�"�m��sp{uE�~t5�8���d��"���L
T��1��wk+��@�����d��L�S���8�xQ��#���aP/��9��}�(��g��a<.��`XW4-%S�����@��
����Q�D�I#���0����e�[���dd"���*��oL�����)�iM��$T�z^u�L/�|E�q, i��i��H��C_h�2�h�PA[)<��3�XW<Q�Q${A;3q�[/zd��^�=�)�}RNw'�sT����VR>�rD��"����0���sv@\
�H�<�iq�
���Y�C+'[�����I8�T�a��b�V�����IWZHr��2��BC��R(-�}l�*V�������RCGBG�C�(��4PX���+ ��_1��El�h��0���vK��H�HV�%�%��#�0}`��m/h���/6���h+ J��o�<*^(�FT}�)�VZ����F�����W��$�dE���]"�e�dP�S3��������%�~C�m�C�+�t�
�0�a �r�4�j%l����H�#���0@{t��_RO]�5�
7	�k�0��
oS���J���K�L��I�c������ah���2^Y�����	wL�\������w�����t4�<�\�w�}����@�L�<N�!�T<$�f|0 ��"~X�$M96D�x���a��a_=OE5�1��}3�8��U����o���c)jO��,��������CK!x��%�x�������_��k������D�C�g��uPO�Q�3�R��w����Q��zpE�j����%�����Sjo	��dxvbN�Y����YMNQWK�O>�q�3�����[�"��UGTF�V����gP��)a%�d��Py��G_*�6�wf�9��n�i�S������a�
�
�E�g����}��q�����X
�A��S.n�9������j}h�!c��xq���l�vd4���!���W�0��4������	�e�s��N��& �+7,�4G�����U~;���)���pCDci
M��y�9��Y��;K�+��[2����
*��������:u!�s?���3��tE��6��O(��vR�I��y���le��ZJ�<D�c2_,�j�%s%)�"�"
�:���jt+���d�\���%��bl���4�T��r�1�\0��S�R
Q*'c��b��V]0�Tm�6�ZB���P��Bf�;�����Z������]�=�����{����Q����
�-�����K���;��[5[�FM!:���K�H�%��d�
N�e���:`���j%��D���S
�D�Xn�tL7<�����cm<
������{�)�v�SJ�:K?�BF�c~N�8��&�~�vyt�����P�����M���v�zs������C	��������M�;4��e(�]�]|v�h��0��M�O��RRM�;���1����<g�y��<B��R����h�o@��U�����'t����������^_�h���t��TW����pG�`
endstream
endobj
40 0 obj
[
39 0 R
41 0 R
42 0 R
43 0 R
45 0 R
46 0 R
47 0 R
]
endobj
37 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Annots 40 0 R
  /Contents 48 0 R
>>

endobj
49 0 obj
2353
endobj
51 0 obj
<< /Length 52 0 R /Filter /FlateDecode >>
stream
x��W�R�H}�W�&�s�4�>A���Y��l�(!
�*�8�����%���5v���.�4��>�O�X=�{~�(/RPKJ~��^�S�p��r�%�M�,"~ ��i�����F�I5���|��`mz���G��ww���y����]��P/���1����w���`��J%c6�����Ca
�1���
����:��:�	<�\������{~^�A��h� ���*}*�pt���a�@���i^�3�/\�Bs�X�w��j#�{W�i��q����*�m�j�e���kq5R��o��i�]��/C�\��Z�R�0qe���U����	n�TX��X���Sc& ���5&p
�L��J�2�������j��|��7��!�_����F|�/^<��������{�]2�G�����?����@G�6v;����\>a:~�gY����%J�a���x������
%��v��t���@�|�0�rx�:�y^������fY��q���0���hD�v�����@��T��+�C58p%ZiS�4\��\_�_����/[LX3�����_��[lD3��a�f�+�b�y�0?�0��B�SXV��D�_��g/�'^���Y;!�>����5�%��b���	�Q��&�I0��5s�^V?-p��xLt�A��m/mn�@Tb_TlOT.�g���{������kp-h�i����,�F��k9�Ff��rf��!����6��h&��N����|��w~B0�]0��@��HZ��T�jo���|���,�A,!��al��&��`�U5��hg�	�'D��E�Iw�l�Y4�um����D�W�`�S��/;[�;�.�q-WRBl�,��`���^���1z]��>�"%,�pB�[��	�Y������of�7��>U���U /�}�
s�O��(
�x������C'Quz^��a1���%
_�L��)�)��~�.� \��$��`�EPd^�{~�T��mo���c�C`H�A6�	d�2�0������j�-� x�]�xSPo���8�����7�A0b����pe��H��R�<EW��D1�i�"`���|�;�Q.s�\�&��T�����q���Q�<��x�a��%(>+t�O���W9���P���r��/�T��b;�m-G.��.�uVi�R}��P�����@>{�H�������BX�<�e�<a��qX���Rj	����L;`�v����"���
endstream
endobj
50 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Contents 51 0 R
>>

endobj
52 0 obj
1240
endobj
54 0 obj
<< /URI (http://archives.postgresql.org/message-id/1347669575-14371-2-git-send-email-andres@2ndquadrant.com)
/S /URI >>
endobj
55 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 185.64 572.582 386.964 583.382 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 54 0 R
/H /I

>>
endobj
57 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 393.96 572.582 537.288 583.382 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 54 0 R
/H /I

>>
endobj
58 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 558.182 430.656 568.982 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 54 0 R
/H /I

>>
endobj
59 0 obj
<< /Length 60 0 R /Filter /FlateDecode >>
stream
x��X�v����+:������cVQ�8�e�x���Eh��A������hP$i=c{F"X]u�q��������K��E>������=�Y�0q���GA�y!r�E9���]>r�����'w�~�x:~w�����}Y4��s��;N���1;,�W���/>���������m_-n>���j��2��AC2Y��<�</���f_���;v'�d��K��
���k�����l��g=@G'��0'��?(�]�+�1�m�j�Zm������.�	���q�|�
�>�c�]����=�����W��?������5�|��?Uuw,�?���6�x��y��QGe�{����i���������	�G���jJ���^T���c�e�(����ak�t-E'K�>A��+���d�,�������O���oo��Z��<����>��w���PW�e=����f��7L4%�s}d�^U�j���6��f��d\�����J���T%�0���1`4�a����L�p�����4� ������jUU������@�T����Ad+dg�����1S I���zC��OjY��hJ|X��)���
Pz�Y���P�d��7�� �mD!�+��$�i�^���`�V3a��s!F��5~/�HPG�����G)7�Q�s/)b6W��f�������3��{9�	H������X-�Fvom
����}�r{GA�_��#����
��#�E�z�����~��U��sW�
St�������U��������!H���]�k�mZ���B!��H�W���A�o��1��P�/C��A^� ���$����d����I��)�������UK�R��FZ�g}+�FU<����D|���e:[ ��K�o�,~���L�i����J��l�S��J��n�n�`E[���U�����rj� ��O��B���s�T
S���x�k41a�������� �T�m��*���^�x�34�0�� ����N���i���9��1�_nnDW�Z0C�X����L�	(%I�T�Rk��KU��i��q/�(L�e��*����R������M������E[_cN���Qq���~-�*I���2oN���N�]�\�5/���N��F��
 ��<�|�����PsZ�z
���=j��T��I���_�<�����;Qid!z�,���)]?�;-��D��0r�A��l�.��C�������^��$��%\or��h?
*�B���!B�y(�4v
�WC�������"�;AnG������b�.��C����V�����rro��bHm������w@?�/��n 	�C������z��}����Khd�{E)u��5E�F���i��!MN��W�e�x�����E(G��>�N��8���i�A1=�U6�-�[�����}a�N���V�"q?���6b�q:2y��e�S�R7�� 06 v�i(���P^bM�B6<��)�O�i�Qx�3�Bc(���-m�5�v�V�/3�].���w3�m������1&�m,���l����P�F���i��	f�����2��1p#1x�D|u��T[;Iqw�7�c�C����V@8z�����Dd9���	w�g��/^�L���O��a�|�8�C`����	�L����|1���Z�[�"%��^��z��%Y4^Sp91�`��lXu,!�.���4�g"�<H����D|���z�������%�{�
}Q`�1gU�A������bc'�ql��������FX��4� 
yA<��WK�8\�?��`Q�O���z�/�Py�����p�r|�hF��]�V|Y��!����,��;��<�#��I�����[��9�h�SgXO��U�DN�9'��)]���������-{��;;���6���^u���E�d�I�n������]����Q����x���j��h
�R|���X<��G��[�$�y�;m���^�}�$��)��j�e��"0}|wvB�.�n_��9K}�D����x�{�r��
������nW�^�|^�����
endstream
endobj
56 0 obj
[
55 0 R
57 0 R
58 0 R
]
endobj
53 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Annots 56 0 R
  /Contents 59 0 R
>>

endobj
60 0 obj
2049
endobj
62 0 obj
<< /Type /Action
/S /GoTo
/D [50 0 R /XYZ 54.0 229.889 null]
>>
endobj
63 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 362.652 757.289 506.292 768.089 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 62 0 R
/H /I

>>
endobj
65 0 obj
<< /Type /Action
/S /GoTo
/D [53 0 R /XYZ 54.0 455.675 null]
>>
endobj
66 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 270.96 716.43 428.628 727.23 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 65 0 R
/H /I

>>
endobj
67 0 obj
<< /URI (http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com)
/S /URI >>
endobj
68 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 239.304 661.171 459.588 671.971 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 67 0 R
/H /I

>>
endobj
69 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 466.584 661.171 495.264 671.971 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 67 0 R
/H /I

>>
endobj
70 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 646.771 545.304 657.571 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 67 0 R
/H /I

>>
endobj
71 0 obj
<< /Length 72 0 R /Filter /FlateDecode >>
stream
x��XYS�H~����@�:F��+�c[l�
)j�[F3��_�=��X�(�!��q___�����wO�x��<��(�����,�>r<����[`��N|����M�<NP��T��O��>������%|�d�v>A�9��
w����:���P���c:,:Ga���
�	���d��������������a
�����%��-K`�x���:/`�/��&�_ �kc��������f���oT��l�������r.��� �J�e9������V(���okzs����3<������-\�G)�8�������k�mS����e���Z��L��},��`��������	��*����I5[.�<Y��%r�|\@����o�3������g�^g��9��[��9��������*% �������P0�9K��< 
k��Nl�@z�@fQ��aU�W	K%J�'E#%Bi��- �!�b]l�Y�5g�������D�I�����rB1���hI��L��e��8���\p4=��Xe��p����p\�mw��J�,�+��\��dgE�,�Mlb�U@�-����e#�S�����P���*��L�8�Z[�74����^]��l�(���4��
bSU�X�V(@�"q&X1g+��Y&d� Z>$m���h�t�W��}i$]a���A9M�.�e�=
��,]n9�-��L�|1���ShpS��C�I@�)���0Mt[�=O7LWa>7�
��n�W�/�X(�`"9HQ�'�Y���-�����7�����qN���������3��=���E�c�{,�j��J�����o�t��Z��M���n���(p|Q�%�����������4@�c��d�xJ��4����JHqYR����u�_�c�W�eX����a����`>��4b!19^�w�X^vOei�D�8H�G,�T�z.^IqDl����.�1���m���ul���$R��Sp��Y#�b�w�2�������ep�bl������_+�o"�n�NF����	�M���4���W(4�b��G���X%i�
��i��G���Ll����}�:!
����x�?���.�N3��trV;����p�0�O����|*c~3�Y0
�CX-.k�������3�
q����q!q4��.8>9���J�h�~+��.�b��`:��t6�����$��-�{�$)��������dZ\���^��C"���b�8SI��\�5~��u��b����� �.j<N3�/�b����� ���P�J�����`�����x2:?�-d�Tf�&����+�?���v�(�%��N>q_Xe.s'.���Vy�vzkQ����ta|>)�-�>���U\a��O��3^� �q�`�"�{��W6T�P����k��r���S��������{7��n�0���i������[��2<jqZ��e^,����ISN���W�i0�������K��:Uyx	3�f�GMv����S������Wy�+�D�5�� &O"���f�z J�1VC�W��V��	�#���Y 9���l8~���Z(U�P��P*>%������M�l�
endstream
endobj
64 0 obj
[
63 0 R
66 0 R
68 0 R
69 0 R
70 0 R
]
endobj
61 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Annots 64 0 R
  /Contents 71 0 R
>>

endobj
72 0 obj
1573
endobj
74 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 226.956 604.082 533.232 614.882 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 15 0 R
/H /I

>>
endobj
76 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 589.682 93.324 600.482 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 15 0 R
/H /I

>>
endobj
77 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 100.32 589.682 534.996 600.482 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 15 0 R
/H /I

>>
endobj
78 0 obj
<< /Type /Action
/S /GoTo
/D [32 0 R /XYZ 54.0 769.889 null]
>>
endobj
79 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 128.328 355.054 286.284 365.854 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 78 0 R
/H /I

>>
endobj
80 0 obj
<< /Length 81 0 R /Filter /FlateDecode >>
stream
x��Z]s��}��@��&��<5�4�������t&�,�k~�e�����)�KJ�83����-B����{ ��D0?����q����u;.X�3/l����n�8s��	���
�����yr��b�bu��
�����Iy���N��)sx�nO�#�w����[C;	���u8lu�������	��_��L8�u��=��$	�qv^��|������Q9{�t�*�eU���*[��O���{;@�>n���^B6a
��69�i��dY��T�Y�J[=��B�q�u��>J����=��a��V�yk�k|I�F#�Y��{zkj��nG�o�a�F1w�Hl1��V�c��fV7��K����;��#w���Of���*uv��Ys���m��Y�.�R�y)�j���J�F^�]b�{wB0=Fz��Y��ZRm������6k�3W(Ybp-v�����iV^UW�������I����
X�kUNq��!"����X�e�R>�a�JU�F����c�����r��J�.��V�����EI^T����j4�.�z� xL����"�0�
�C��J���v�+�����h�KY"��*Ud�`��3���O�'.O<�m��w���o��6��j���^�<������v!����m�����Yb��r��[z	$r
@8�A��uc�Q.�����M���
�Y[V�j���{U��K��&*������M.��2�6v7�&�`q�<�!�h�e�
�\������+0��f/^��@���
�}�Q(��^V-?�]�,xG���n&�� $n�f�LS�J�:KaAn����k
�5�[��b�!&�]ns �Q���i���=�:%r�u0�M���+	k7�0��-S�dl$��b/f���z>��j�Y��3�X��t[�v��<(#5bK�rYW��(�!�'a�:@(F���{.m�'f�����~�=A
Q�N_7������������9���Y��$Wj��g�#\'D�� P��s].�S�B]oeZ������������[��� 
r��m���Hd R� ��"��L�7p))���������Q�]^��H4,5;��.TpL��35��%�1��H_��(������P�2�����[S2)�������V�0e�B1������FI�����q=�����FF(�j�%���S�6C��>�d��(���BYn�)'���IE�Hr���Z�!3(���8�;-
�5\P���wQ%J3S��
����	��=H�(���q�_�Q5��c��q�������U�m�
�7���w�|�
���E`���} �!{	?�M�����RezV���Yn���8E�X�<v�vQ��g{�����\��w7�.�"��.���J{8����.8��m��<���m�����UV�
��s�q�z|�w��7�M�n�j��Clk&���v�J��
t�,����I�h3iF,%����k)f�U��U�:�,��w�������%���r��@�	Y���A��ohe@g�������GA��z�-� ;�[���"����mw���2[m�Sqj��s��
���[#$�����+1�D�3&M�������]8F��$$�Ly�
�Ig��w����NH�}�A�)�>(����
�yny!L%��UO9tuu$����z9dj�6"�R�t+�y�-
�FZ���i���	[���F,��!L�o�U�9P[jW:QFA�����	�t
��)���nD��vp/�?��m'�A��I�o�G��XZ7�@�s�yEq1��B@������k �I��@*�+����0F-�#B;P�(�����5)%�A��B�����!�!I�����Xx[���,$���������@��=F�HYIb����9����q(Y��d�B������EIw�uLkU\�][���S����N�f��b�:%�*��q����
��,F�D����T������4�.���.��yN�F?��{�1��qd���[������Qjx�KA�P5��tq�p������%�U�Amt��|!�~��]����L����s�*!��)�{��I�\f����(L���{�Y�����0v��-l �M���������G���!����N>�q�p��y�Ro����+��GE���^t����{����������:R"rp�7��8x��WGZ���x�e���a0�9m8��z��/����~o�8����%�/���<tC���x��xHe�9�7����,<����8�(�x�A��3��<b�v�$u��_���_��S��6�P�<�)Dz�2%��K/z����D���5������=0��{�8��p}x����M���s�����b��Q�����x~�E����p?@W�.���������9���}�4�-��5et�k�������{���M�
����>5�(���:�@���x�����1B���m�#pu����������}b�[�%�V�r^X���t}Y���q��C��~l1��_���#!~����D=��l�����^z�8��\����:9py������D.��`B�uf��ur������=;d����E��2������h1��"�h�������{%�}qLN��m��{����W(���9�g�Z���A����!?oL�������pz�>�������kY~��z_�,��o���OY���aV(�t�H�oe�g����}�J���?�����|$
f�q�4x
4M�{H��T���>(4)���`sf�=����1N�����{5Y�}����J����� ��x��Po���P��?��l�
endstream
endobj
75 0 obj
[
74 0 R
76 0 R
77 0 R
79 0 R
]
endobj
73 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Annots 75 0 R
  /Contents 80 0 R
>>

endobj
81 0 obj
2804
endobj
83 0 obj
<< /URI (http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-SNAPSHOT-SYNCHRONIZATION)
/S /URI >>
endobj
84 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 57.6 66.961 517.006 75.6 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 83 0 R
/H /I

>>
endobj
86 0 obj
<< /URI (http://www.postgresql.org/docs/devel/static/sql-set-transaction.html)
/S /URI >>
endobj
87 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 57.6 55.441 318.095 64.08 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 86 0 R
/H /I

>>
endobj
88 0 obj
<< /Length 89 0 R /Filter /FlateDecode >>
stream
x��Zks����_��~��D	����M�t������fg�a�>d����sP�K��ug6�$���s�=��7g6��wF?B��a�e~vc�m������rWx��0q������?,"5�a���,QW!�{�
:�k���8����~�
+fq������.�>tt0�������Vg]��_y�lq}p�mq!<��(�-�[���^��5{+oe�~�u�*�uY���*]��������}��s],�!��ND{F�����dX��e����/�fU�Zo�r�3���<�V����
xD����7c��U�n�p
/i�`�Sz��G����[]�����;�� �����#�=�����4���"�4�=RU
�\�l�jh��n�.m��#�K���P?F���#�pC�x(pm;���=z��D[������o��1=�
0�U�����3�
c(z��hf���9��[�u3mH������<���.�0\�&�[���:H^�Yv/�L���<<�O���~�K���O��C��YO�V�`����>#�>{�q��A��G�Z�c��=���m����1��c���~�*�`����Z&�}\��ld5�
���w~������A�������MC��)
��(�$��@A���'9y�j-������^�����5{G�����s�Z��v�gt����~�y��9n��u�8���w_�#����5�a
���eG�����w���������u��f������{*;���:}be��������9��v������a���3�#~B�(_���8�q�39�H��G���J��I���	�Ok���v�����{�=���`fM�
�wZ��7I��XN�"�K�'���!y��a��S�9���n]��������?�����4�������_	+�Y�*I�/�+�A,&�������:^���N��x##Xp�QF����m�+&���4���li���`�fS��*���4�z+3ZsY��z]6����w�����v�����@���)�m�f��[�����r�A2F2+��w�8+W,^.%�S��w��2y��b�m����b�]=�d6��%4�|���<as�*+V^����V~(��)����&����djO�I�["/p����f����F*�q��V��f
�>+�/�
��$��Z�8�,6��d%��r�%l
`�,& ���y�PC��e�v[��b%��U�J(�B�fZ����X�/�6�I���Bj%����b��?^��k��5%�|�Y�-X}�����RI6m�I@X��iW�:��=�:�����w4�����7e������i�c�(������\[=/sY4�(�Y/���4��-�-���2�h�bM��9#����H��s'�8������Z���6Q����;���LB�bs���9��6L��
��}O���E:{�	q�w��z� 7�5�
d�VM���c�t��n%��A�	J�0���#�4@S�ve�Y���,���m��)BQ�e2�����D��$P�"I�Nk��n"�8��.��^h�$�T,,�h��d��is�����}3��N��Af�o������m5�6�Zf{����!����\�������+%%��bJZ;�.��(��r[�j�HT eeL��Ui�H�>�����\��`�F�9T�����;&��<��R�l`�:����>^�t��;�@�i��'�5`r�����%Q����|	nlsb�:8"��������|UI����hLx���
�n��"�7�Q��V�b[���N+35�X%�<�B�cF.cT�H�bX�	�r��3�g�4��yU�1�wQsR���)T����@A�,h�D�f�B�Y	��������'P�
%tU����A��
Q��'�o���;1oq���BD�����%��"��D�6t��i���&N��@s/"#���-�8J�c�<t���!+��k8k1r��\�
�4QTQ����,I�U�%���[���T������SO3����9���9�Q�:xW�`|7jI?�E�����n[�96e��B�&F<e[��R�&U�5��a��=~�T��vRwa��H���"���D�y�G�,)P�T'�OM??�C�+�p�}����d���������?|~��(��!!�cy$.�8�#[S���	LD��G�p�S3�tin���f��B_���t�z���������|�8a*��?�_\��X�yw�./��_�~�8��]���~���B����>�T���;�l�����G��[�!��	Q����d�N�F�Vu�J%�n���|�-n���T��������4���\���B���l�xm��`/�3�/�F�����s�?|�t?����:~��F������>'��Z���D?��'<j4�\�x��;��t�D������2���i6?�����o��q7/��<)��<����5���z[h��$O�n��/�~�Pd���l�]�������.�����p��,��������K
��������=pg�6d�m�H��1���u��
endstream
endobj
85 0 obj
[
84 0 R
87 0 R
]
endobj
82 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Annots 85 0 R
  /Contents 88 0 R
>>

endobj
89 0 obj
2544
endobj
91 0 obj
<<
  /Name /Im2
  /Type /XObject
  /Length 92 0 R
  /Filter /FlateDecode
  /Subtype /Image
  /Width 469
  /Height 351
  /BitsPerComponent 8
  /ColorSpace [/ICCBased 5 0 R]
>>
stream
x���}PG����
�C$��C�����D�L4�#M��D��@BL����rW�U��*Z1Q�JU��h�UiE=��p(�<O��%��K���0�������v1��gv��������c�g�az�w^���=}�6�����C�~��_0��C�~��_0��C�~��_0��C�~��_0��C�E�����������|������[�n
����_�KrPf������/�tuu������$tFy������^x��~���C�y���>	�Q^����~u�w����3g���.]�����'�3���������5��������.\������������~^��c��������~����'O~��������I���zw?���X,&)��qB~�����o��������@$����y��MBB����sS�L�<y2h�v�&�~�_DEy����;�dgg��l�������E�TC��/��������m��}�aZZ�4�����M�6�X������+W������h���,�6w����{�I�����7�|�����IMM

r�E[3C�`���W��R^�����^����c�{�w�����,v�������'$$��L�������_|�����is��1,cXXKV[[�����'$$��l6[UU+D�������������|���I)�������J����V�����,622���G�a���{��9�!����;�6���HeJ����������]l��,�������<)ev���i��~?:���)S����R��4, �*q���������]����T��@MMMJJ�|��Y
�~�p���h��2;���@�`�	px4OJ�g�SF�e����t�
�~���8<�'���3�)����~+++�*�f����Uf������Z�{���4,�f�	\�~���$!!!""������b1��������>��3
�~���B�������.]�|�rQQ����~�G ���jX2���=z4�W����f��	����J��j���W�?;�����q��%%%�1�����.���@��#�?�Y�F���_`f4��|�{��r"7�JU���bY���s,����@������;::�-�f�	������//^d��&IQ�W�/K�l������=\J.��|+9��!�����`���g������(>>��+O<�����2�������������R6y����a������������3	p@�"��5[[[�����f�H��	�����*++�{7������g ���h��������������g�V�Z�Uk���m��5..N���oW�
R�

���_�b�������k��3r�%�3C���DVVc�����V�5::z��I���4���6���Y�7�x_^��r��������v�niiihh�N�/)MZ���|���J�h������mmm.$s��&}���R
�w~��w��v�z��9e5bbb���m6���',X���K����fll���W;;;���>��w�_��{��
�_�x�|���f����k��o��6uG9��~���m�<�������=z����ORo<==���~�yKKK)��?�A�S�N����7����������i���j����i��������+^�����B���YZ���n��:#�),,<}����g�.]�UkR?3--M��/p��S�����g�G�_�c����j5�]��}���k����M���^������S�N%''�����D��w�C�w���L�������[�5X�l���([���{���|�������O���Gb��y���6�������iTk�[�����~M��{/\���K/y]�����o{�����s��

���+((p����a�V����V��:��m/���i����Gn	p@�&a��9�|��������u!.Z��N����@# ��$��=����g���v�AT�:�
�$�����3gRC�8q��	p�������;wn��)�'O�m��$���V�����������{������I�>�v��1�/������&�n��%222""���?��E���O?����JKN��@�@`�_������[��Ms0~�x���i��>}:���3����r���c��1���766R�O?����y�P���5��{��q��
��!�~m6{��N�Cn��IQ6l��z�����6T��g�����~��@�b���6mI�G�2e
�?q�D�<���<��m��z�{������/��Cx�
����
�$������G�B� p������E�����_�
,�,���9s����C� ���(**j��u7o�T�B���ms�733��������E��$99y���\,���jR���~�������r'��O�O��r�~kjjRSS����)�Wm\\\LL�{��G���4i��K?a����pz��j	�������o:;��0�644,Y���BT[��5�X�;������7**j���~���~��6������p��1�����h311������_��=�AY�\����mmm�k);%�����~����s���+�������kmm��B�������u�*�|8v��#G�H	�~�������� �/�WWW�I����'�~����_��1��&���������>.��P��ESRR���v�JJJ�6>������no
>P�x�����W��*k�w��}�/h����SWW��r�~
app������G�L���P�@dgg;[�.���u�Z		7n\ee�?�������K��?���4�5	S�Ne�_�?x]��$���~�*/�������K=���f�����3�%t�Lo�z�w���Aos�'��1�^+**jkk����iMt���M����������---


UUU�����!�=d�_y�>{=��C�q�[
�O�����~�e�}Z��~9t���7:;;/^�H>s�L����s���!=�����sd�z���s�b�jj���}Z��~9t�oqq�+�����a���K�C������5+??���C��@g�_��&�/03��+�/03��+�/03��k��o��L��H��5
���H��5
�7 ��(�_0��8�_��~�p��~���
H��5
�7 ��(�_0��8�_��~�p��~uf��Q�b����Z�3	p@�:�����W����_`f �Wgjkk��MKK��p��H������~���g�}�a��/03���?��y���W���_`f �W���5k�h[2��$����qi!%�/03��k�����/n@��!����>}Z�b�_`f �W$\�fkkkvv�7�	p@�"���]]]eee������+��8t���E�rrr���233322�NX�j�V�����u����8�o��]=n!�����r'{��������Q(�>�W�^}��WG�=c�.cOOOIIIBBBDD��O?M{�\�bu����z`�

JNN��g�je�u
UO��@����������n�[ZZ���KJ��,//�<y�k���a����v�~�a�y~q���Ka��B&}�T����_��XXXH);::(KQQ����O����O]}��������b�.�����c�yU�������~9t���7��q��E2��3g��/^�Ik��������~��d��m�]^�Cn�{����}�/�����,L:�>0����,#�9)�<j����(;�r���>��c������������p_��k�zr�f��M�[VT?K���;�M����F"L
u�N�>}����K�j����JKK3��;88�����
�����L��n�]�����E8�m:;zIIIHHE����Z�J*�
><���������W
K�vQC��#<���;�Oo&a����~�-�,[���B������b�
��K�MMM)))>>����6A;sss_~�������,����\�t���+W�|��w��A6lpq��W�R�|@=ag����7�,J�]�v�?��t]C��#<��k��������^z��B����}��s~��l<A�jzO�+��`?��:�~z#����9����_�`W }�������?��'N�8�IP��GQ�J�^�C�Z�~��zry���&L�@�z����9�������H��5	s�����o���{�5y]���$�gddxs�?��u��~M�����=z��Ym�G��Q@��I�9s&5��'��~�0������%K��)�����s�UUU555i����4���e�iY���+!!��811�����V�A�=p���p	��a�����S��*��~��.�1<��{�n��={�X%9�f�"�?~��b���0�����G���;f��+W�[1����+++�_ 0�K�lC����Px�
���
a�=v��D�p	����_�C�����~��@��I`�=z�(������]�jx����Ou��h��9s����_��5�n���7o*cN��u������$�666�~U���L������0BCC���W�X����<��;RSS�Vktt��I��	5v'�Z���III[�nUfq]%�E+$�x������+W�~���6o����cw,�4q�D��"��MNN��s'p�7��3������Q��M��)��Wu-���������������inn��l'O�T>F��|U�[�n�����,���\m�
�#F:t�tJ�5k�P#R�������x��/�?��Oe	D�o������Q�
8	�������/V�R�jz����(���Nm�9RJ�m��
		7nu���(++��Qw����C�e��g�y�^�L�����`�mhh���������K!x����T}4�3��X����:���M�:5++����w��~�:���|U������j��)�������A������?�g���U�l��e���(���>�Y���/]�tJ�z.�~�$��[WW��c��a�F��W��{CfQ�$	~�����Wnry-��o����g������E��!�=��}��I���?���7%1w��������+�;{���U]��r-g��v�������]K��6m�����QYm��}o�J��Vptvv�������
,�������	�C����o���������c�=���*��~9K�Rx������6###O�8A{XR(�����)7��R�\�^����V�~�
}����7{OO��oJb��������:����2��b'ka�|�7o^cc#}'�:uJ9��lU*�����8f���zg�UVIu�
���H��w���|^�
��R9eq����BCC�����r�~�B��kk�,--�Z��&�{��%%%�M.��Z�t1����
����>�#9�_�W�~UW�P]�"\������8:��?�eddP�������P�����v��\O>��e�W-�����h���>����L5�+��-����?�Q�;�;Y�9�&����W��~9H�&���t%��������2���K���{����F�B���]�n(����������m�����?�w�^�o`��������a$���I�:u*��R���B��/	H�C7�����������f!������9�~ummmVV�>�	��@������k/^���o���jjjv{��)//����~�e�}Z��~9t��������������������466�������?�A���@��������W^a�}Y;^z��e�a��%��d�F�(��g���������[k�3�/~z	��H��	��H��	��H��5����k^&��e9V����(�qB�_`,�v�	��Q�_���_�dA;KJJ"""�~�iGT����"�r'���,_�E�]pHg	�]��k�_���W�(,,�b;::����"GT�\�E��j���2�]pp_UX��5��Q��_�W���|�^��vJK?0�QY�rQ�js�_e�v'�5�j���u18�b��@�F�������W�p_�����`�������,_�LV�����]���(N��^���� 77���K.\`C��GT-\�
��(����,_�LV��`o�]�	�k�_�d���]TTO�=�����Z�rQ��
>(�W�
	�v!���.<�5
��hV�`@����5��y"==]��K�\��rL���Wg���9�VVVjUx����*8��u33L���Wgjkk��MKK��p�W+�����o��.������~���g�}�a���V@�����a��k�_�������������/�����������i����2V��y�!!!���KJJ1b����XEEE#G���I�s�=o����L�/fiSu����?�Ja�G��=��X8�b��k�_�����Y�m���~�]8p��D�511��$$$�=J;Qluu5K�B�����Y����#��K"66������?�\R	?���H�^�:::�Y]���� �9�����!w/�]��W��]�>.-�����l6�J�����=UUU��e�����������[�L���X���7����;V����������zgE����t��~����Y�����in�B~~�?N�!�ew�)))���]>p����t�/O�B����j�{��%%%�%�Q�V+7� /��%����?~<�iOzz::�L����jkkO�>�y���Wu���]�<���2�( �.<�	���2�����W*w�exQ@�rL���W$����n����T��.<A7�.Z�('''+++333##���U�V������~����_�(�U'��a����={��^�a�t��i���~���������������
8����iMm�T��wV=����R��������A���	�]�F7���q�������d�3g�4',^�X���}��c��������w�}�.�"J�������I�&�5HY��n4��w�4�A�%�� ��(�Z���gg�N������0�~M>k��/�]�F7���05����O�>{����K�i��B�����Yx��1��
��M[P������K�TN|��T�6p�$�%\��=*W���Y|�A>x����vqW���I�?���~K�e��y]���N�@�%���i��+���M|��Y������bH��D	��Gp�N����"�
&^L�p��$���.\x����.�����/��(���(�-p�jkk}�Q6����� ������6��/+J9Q����U'X����6��`��'@�&a��9�|����O�9����78
��i�����g=z�����|p����~M���3�!N�8�� 1�K�;yyy�P�6U�Q���b=M��=�\UUUSS�y����@�b�7��e���VMfr�8p��r������+W��0����e�d����N����C��5���~��@��G�����R�n��S�������5��{��q��
��!�~�����+�@�kkkc+��d���7n�`a�Io���[����:���={vee%���C����X�Oo�eX2��{�O`������c��A�@T�_1��z,������c6	�k�~�=j~�?~�{�
�`6	����H���x��9����������b?��F��Xnrr�_���������L�occ�������bj?�@�"��^���Ibc���m�z�w�����oCC����a�?��r�������JN��@MMMjj*u���9p��`�+���L�$�%K�x]��---e�(..N�@W\�r��_��"���2�����\�������mkk��k�V�c���_}�������(���_���CQ�~9D���=�������u���Z�_g������~���k3�Wy`L�<Y�C��~9��~��)jjjRRR�
>8+���~���gr�J#��5@	��!�~�`pp�����0�"������#f��|����/h~ *&���[�ZK��7����������~����_���AjM�D�l0�a�_�1u�T���������H�C7��U^|����;�zz�����3g;g�K�@��B����yw���BOx��g��TTT���fee�����/�~9t�owww{{{kkkKKKCCCUU��!*=�b�r�?�W����^���{�V��S������n��E����x����~o�����y��E2��3g���c���9
���#C�{H���C#WSs��������/�~9t�oqq�+���_��\��t��%�!G����g���������[k������Oo"�3	p@�"�3	p@�"�3	p@�"�3	p@�"����b1�
>a��C��H@��cf}I�����	p@�"�����%���+��w��%�l��M��H�U���"H���f

JMM=x� ���"_��K��uv8������#F����K���h���R�;v�������y��wU3�=QQQ+W�T�})J�_�Xz}���(��3����)S|l�l����$����n��R������2p���~�!<<\�����������H{����
�~)�����w�_:��?���}�;��X��������9���1c�0m����f\�zutt4�<���Q�Q������{%K��g^�~��8�_�O�R��B�%���Q��tj���U�oMM
����R�/�e���y8I���$H�7n���;v,m�k}}��JFFF677S�+@|��u�,�^z�!�a�������H��	��+_���U�/��OII����p����N>, ?W�{������6������$n<A��j�Z�	�o_���.�*�e�

(����{p��$����`���`�'���{��t8��2��O?�t����!N�<���x
$�����P5�e��36�+��c�5�;�����5U��	(�~E�f��~E�f��~E�f��~E�f��~E�f��~E�f��~E�f��~E�f��~E�Ek���fgg{s����+�����UVV��iV�H��	�5����n����;v���Z�����I�&Y��>�������$:��GJp������NV&�/..f�7�x��|�������+V�����\Q��(����~EB������'OVJ�bbb���m6���',X�v��e�[�nQ5(�]�d�O>�����O?�����^�z�$y�}�9����`[[����������$�����5�~�m�1r�U�J��rS�N�����y�������*�K�������d���;}�t�1c��%//o��M�����W���:::F��������&p���
�k���;@�"��fKKKZZ����k��m��}��������Mc;9�JP2.����AAA(@�,��S������Ta7��Z�<���@ �W$�����Mw�Z�w��y���6������
����[3fLWW�<��w���y��
R��:�;V$7��Z�<���@ �W$����}�F���~322bcc�V+uV��K��C��O>I��K��8q�Dz�t@�Gy����3����������T��gw�$�������p����k��8�_�f������<@��H�_ $��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/0�����]�}���W$��/f�g�g#**j��u7o�T�B��H�_8�e�����w����B��H@��X,w2}�t�4J���+����:.V��bSSS��������i�\��{<�<��O��N��`K�A�@�"���oBBBuuuoo���fW�^M������_.
��}���Z[[����+~�oMM��f#	3�FFF677�������`i�7.
�xCCC�������XH��	����MII�>���Z�Vi3==]�-������uuu����+��/0������.g���+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+�/03��+��;v���Z�����I�&���J�RX��r'lOhhh||��+zzzXbz�yyy�P�6(�w�}7e����'S�6i�T�k���<(�H�����oLLLss��f;y����n�W5<88����p�B20������������}�QdddDD��-[hs��ER!qqq---v���@��H���M�6-��m���s��S�feem��y�����_��uk�n��:::F�������/_fa20���X�2--m����������B�n����nwO�>�����&v�rH�N�_�����k��o��v�Z2���������W
+������G������B���������_�	p@�"��7o^cc��f;u�TLL���Q��G>�����k�W�	����v���B���J���V��7�655���@�~��~E" ����AT`rr2��U�o]]��	���~��#G�(��~z���+((�~z�u�~z������m(0e�e���~�
$���D@�[ �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$\�fkkkvv�7�	p@�"���]]]eee���XG$�����}}}[�n���������p���IIITi�T�;v���Z�����I�&Y��\�r�����S�Le��M���(���W�^}��WG�Mu�1c�������$!!!""����vV%g@yB�����������@v]C
���k�j��oP�W�\I������zj����5�
H��	yk���O�<����;��[��>111v�hgss��f;y������|�	����~��Le�\$��� 77��v�g��ER����_�u.caa!�����,EEE.���q����X���}!����������u]C
�wkKK�jMF�q��!20���Y���988�Z��8�_�����o���w�K�m�����������j�e����S�feeQl�����_��bL�>�%�z}�2�n��|x��e&�R��Jfs�����R��JQQQ�1E���r�R/����}���!�. ==]�&+V� S�������{Q��G6m���g	��~E�kM�������X�v�]��k��m�������M�r�������8�hS�Le��
�7U�K;����T�B����<tIIIHHE����Z�J~B(��>�:�<00@z��Tm5�uh�q*��y���LJJ:x����+�X��&�	p@�"�l���n��_2�[o�E�1c�tuuqG�7o^cc��f;u��r$��w���v��
�e���o^^}/;`�377���_����j(�VPP@	.]�t�����*�������POX�>�=���#R�y��]���g;]�����RRR���n1���(@�ecz~8 �W$����}�����~��;�'�|��K�.�:����Z�������T��'R��y��2�.�+?��-�����8''g������r��������������x�	gUR~����R�������_����uuu&L�c=���G�ai\�P
���z>��#�������������$�������.��kx���3-��+�y��jG�
H�����_`~ �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`,t������5�1��H��_L+���FTT��u�n�����8�_�0P�p2���~'''������8�_��~��X�d����i�b!�W$T[Su1OW��bSSS����g���iJO��?Q�R-��*����/}�J��*��H������P]]���{������WGGGS8$$�~g�����{{����Vy,$����_�[SSc��H������������l388XZ����&������2��w9��+��/���"|(--�Z��fzz�$[.
�����Tc!�W$��_`,���l
>U �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f �W$�_`f ���h��������������g��;wncc���g���_���k��8<�`}�owww{{{kkkKKKCCC�tb���I�SM'��k��8<�`}���7:;;/^�H�:s�L3�tb���I�SM'��k��8<�`}���f#��Q������?�������L��N�o�
	pxt������~*�������+�����K'�N5�p��4�����E���!��N���~�����]���C�~��_0��C�~��_0��C�~��_0��CP���?�x
�
endstream
endobj
92 0 obj
13245
endobj
93 0 obj
<< /Length 94 0 R /Filter /FlateDecode >>
stream
x��U���6��+�L�)�zp�N�����8'�L��y"���3_P�=��Q��<�%...��$�����R��,%l���`O!���)/�`��$TH��.�L����G��������w�)�	�?��_�D�d�O�B�S���Y
'|���P'�����R��K�RJ���� ��9��IX��������{�M7p���-l]�]m6�y����A��x8���pc�t
���<��`,��|�{����/��<'�����E)��Bf�^��sr��������\�������uM��B?����p�(����-J��TJl\T���m�����T7�
��NE�
".����<�jS1U���C����������C5����u:p[P`�������D�r
��18���Y�[D���ep����NI���qD����HI��xU}S6�Z� �4������&��\"8�q�Y1�{�u�m�
?�Z����+&�q�h�`����Vu~����+�iW!c��b����g~n��k��X����k[��#����`�����*���H���<��l��1���7�IFJ^ �9[D6�����&�q�CL�h�A����C�m�g<T���H�����'IRZr	�����D�B%4��K�^Y�6������T���D�S}u��+2�2Fr&(�����>�����t�����
�a����VaV��#�2�|����h�x����=����[|�
hp*�hR���wj�A���o���� �����������#���nu�m����L��o�<B���s\^q��r1���w�i��*���Y����{�>6�7�f���Q���50�k�wl���U�m�HUfD���3��V��v���y�"M���d�.38�;���;��Do���?G���
endstream
endobj
90 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Contents 93 0 R
>>

endobj
94 0 obj
908
endobj
96 0 obj
<< /Type /Action
/S /GoTo
/D [95 0 R /XYZ 54.0 379.228 null]
>>
endobj
97 0 obj
<< /Type /Action
/S /GoTo
/D [95 0 R /XYZ 54.0 127.217 null]
>>
endobj
98 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 625.828 138.536 636.628 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 96 0 R
/H /I

>>
endobj
100 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.188 625.828 559.188 636.628 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 96 0 R
/H /I

>>
endobj
101 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 611.428 212.711 622.228 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 97 0 R
/H /I

>>
endobj
102 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.093 611.428 559.093 622.228 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 97 0 R
/H /I

>>
endobj
103 0 obj
<< /Type /Action
/S /GoTo
/D [8 0 R /XYZ 54.0 670.133 null]
>>
endobj
104 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 597.028 196.878 607.828 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 103 0 R
/H /I

>>
endobj
105 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.007 597.028 559.007 607.828 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 103 0 R
/H /I

>>
endobj
106 0 obj
<< /Type /Action
/S /GoTo
/D [8 0 R /XYZ 54.0 489.982 null]
>>
endobj
107 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 582.628 301.706 593.428 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 106 0 R
/H /I

>>
endobj
108 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 552.773 582.628 558.773 593.428 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 106 0 R
/H /I

>>
endobj
109 0 obj
<< /Type /Action
/S /GoTo
/D [8 0 R /XYZ 54.0 203.518 null]
>>
endobj
110 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 568.228 238.437 579.028 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 109 0 R
/H /I

>>
endobj
111 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.007 568.228 559.007 579.028 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 109 0 R
/H /I

>>
endobj
112 0 obj
<< /Type /Action
/S /GoTo
/D [23 0 R /XYZ 54.0 146.089 null]
>>
endobj
113 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 553.828 248.331 564.628 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 112 0 R
/H /I

>>
endobj
114 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.007 553.828 559.007 564.628 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 112 0 R
/H /I

>>
endobj
115 0 obj
<< /Type /Action
/S /GoTo
/D [26 0 R /XYZ 54.0 333.569 null]
>>
endobj
116 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 539.428 107.305 550.228 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 115 0 R
/H /I

>>
endobj
117 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.186 539.428 559.186 550.228 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 115 0 R
/H /I

>>
endobj
118 0 obj
<< /Type /Action
/S /GoTo
/D [29 0 R /XYZ 54.0 697.889 null]
>>
endobj
119 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 66.0 525.028 165.87 535.828 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 118 0 R
/H /I

>>
endobj
120 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.105 525.028 559.105 535.828 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 118 0 R
/H /I

>>
endobj
121 0 obj
<< /Type /Action
/S /GoTo
/D [29 0 R /XYZ 54.0 653.63 null]
>>
endobj
122 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 510.628 159.518 521.428 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 121 0 R
/H /I

>>
endobj
123 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.183 510.628 559.183 521.428 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 121 0 R
/H /I

>>
endobj
124 0 obj
<< /Type /Action
/S /GoTo
/D [29 0 R /XYZ 54.0 427.147 null]
>>
endobj
125 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 496.228 166.837 507.028 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 124 0 R
/H /I

>>
endobj
126 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.183 496.228 559.183 507.028 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 124 0 R
/H /I

>>
endobj
127 0 obj
<< /Type /Action
/S /GoTo
/D [37 0 R /XYZ 54.0 769.889 null]
>>
endobj
128 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 481.828 202.747 492.628 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 127 0 R
/H /I

>>
endobj
129 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.092 481.828 559.092 492.628 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 127 0 R
/H /I

>>
endobj
130 0 obj
<< /Type /Action
/S /GoTo
/D [37 0 R /XYZ 54.0 387.496 null]
>>
endobj
131 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 467.428 239.687 478.228 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 130 0 R
/H /I

>>
endobj
132 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 552.918 467.428 558.918 478.228 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 130 0 R
/H /I

>>
endobj
133 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 453.028 186.205 463.828 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 62 0 R
/H /I

>>
endobj
134 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 553.092 453.028 559.092 463.828 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 62 0 R
/H /I

>>
endobj
135 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 438.628 200.137 449.428 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 65 0 R
/H /I

>>
endobj
136 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 547.134 438.628 559.134 449.428 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 65 0 R
/H /I

>>
endobj
137 0 obj
<< /Type /Action
/S /GoTo
/D [73 0 R /XYZ 54.0 446.675 null]
>>
endobj
138 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 424.228 180.937 435.028 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 137 0 R
/H /I

>>
endobj
139 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 547.134 424.228 559.134 435.028 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 137 0 R
/H /I

>>
endobj
140 0 obj
<< /Type /Action
/S /GoTo
/D [82 0 R /XYZ 54.0 421.27 null]
>>
endobj
141 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 409.828 240.35 420.628 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 140 0 R
/H /I

>>
endobj
142 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 546.999 409.828 558.999 420.628 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 140 0 R
/H /I

>>
endobj
143 0 obj
<< /Type /Action
/S /GoTo
/D [90 0 R /XYZ 54.0 387.219 null]
>>
endobj
144 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 395.428 262.024 406.228 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 143 0 R
/H /I

>>
endobj
145 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 547.0 395.428 559.0 406.228 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 143 0 R
/H /I

>>
endobj
146 0 obj
<< /Type /Action
/S /GoTo
/D [90 0 R /XYZ 54.0 246.626 null]
>>
endobj
147 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 90.0 381.028 259.036 391.828 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 146 0 R
/H /I

>>
endobj
148 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 547.135 381.028 559.135 391.828 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 146 0 R
/H /I

>>
endobj
149 0 obj
<< /Length 150 0 R /Filter /FlateDecode >>
stream
x���_o#�u����}l����s��� ��O��H��`��+q$�R!���O��4#�������wG����~��ew����)��1�c�Wf>�||������$���f+-5%���J�>N�7��!���Cy����Ev�|?����]�����82�S�>�Ka����}��W���[Z��k(�V�������^W�-�����o���u�"o<���-�Gp��_\�t��iy������\������w?��X�2����eyUrv��SN+��R*>��8����������_�_�_�f���Rb��<f�������!�~������h-Oel����v~������~���y�~�9n�v���a���n{�~8o�/�c���c�d�y�����|^��l���v�����7�����9���_����r������bc�_��h������n���������������������_�nW��rC�b���&����\/����r����V7���������c�
�/�:������z�o9/��������0�l�;mv�/�j����]kl�.�{�x�/S_������v�����&6��-o�UH�+9/,�r1o��������7i�(�(�Iz��,.,�X�Q<�,�HsA�� �\��0Ua��T��*SU��L�2��T+SmL�1��TS�L�3��T��S5��L���3UG�%!���jIHu��dq�Z2R-���TKA��0Ua��T��*SU��L�2��T+S�L�1��TS�L�3��T��S5�jL���3UG����$�*	��cqG����d�*�JA�R��0Ua��T��*SU��L�2��T+SmL�1��T;S�L�3��T��Se���fIX�$�[R�-)���uK��%e���nIY���[R�-)���uK��%e���nIY���[R�-)���uK��%e���nIY���[R�-)���uK��%e���nIY������{K���@�}�����[��� L9����>v�����a�8�)u��7��������������x��w��.�V='w����W��*2�����xK�0����,n(��$�,�Q<��I\X�����3Ug��TKB�%!���j�H�d�Z2R-���TKA��0Ua��T��*SU��LU�je���V���jc���v���jg���S5�jL���3UG����$�*	�JB����d�*�JA�R���*LU��0Ua��T��*S�L�2��TSmL�1��T;S�L�3Uc��T��:Su��L���&��	�jB����f���jA�Z������
S�*LU��2Ue���V�Z�je��[R�-)���uK��%e���nIY���[R�-EM�V��Q4��#M^�zZ�?:�W&���o�����;���{������ww���m��7|z���,�n)U��{�|5������R|a�UW6��(.,�X�Q|<Q�����x����8S5�jL���1Ug��T�F7��H5�a�H5�aG��
�xA��
�8S�*LU��2Ue��T��V�Z�j��R�?4����Y\Y���H���xcqGq+,^Y��:Su��H5�aG��
�8R�n��jt�$^�jt�(�T��
S�*LU��2Ue���V�Z�jc���6���jg���v�jL���1Ug��T��:R�n��jt�$��jt�(�T�Fq��0�3Ua��T��*SU��L�2��T+S�L�1��TS�L�3��T��S5�jL5��e���^un�Ss�*C���������{{	���x���/��=����W7������/~�������3_o����r��WG�}�S���.���\������������ve����M��7wo�P�����1Uc��T��:Su��HUR��T%!U�HU2R��T� U)HU
R��T��
S��LU��2��T+S�L�2��TSmL�3��T;S5�jL���1Ug��T�jB����&���jF����f���jA�Z��0Ua��T��*SU��L�2��T+SmL�1��T;S�L�3��T��S5��L5���R������������B N�
!Z�f�R�/��=�e���������������-���}[;�qZL��t��.�r5tl��R\
}��rlfW�$^Rbqa�������xeq�Z
R-���T��
S�*LU��2Ue���V�Z�jc���6���jg���v�jL���1Ug��T��:R�+<G�s�����#���#q�:Wx$�T��
S��LU��2��T+S�L�2��TSmL�3��T;S5�jL���1Ug��T���#�����T�
����\��xA�s�G�LU��0Ua��T��*SU�Z�je���6���jc���v���jg��T���nIY�t���F�:,�.������K�2���W����k>��	�3��gs���9|q���k�/�x����������\
������,��uW�Y\Y����]���������:�,�T��
S��LU��2Ue���V�ZC������x�,�,�Q�'o,�(n��+�3Ug��T���m(�T��6G�q}�#�����R���P��
S�*LU��2Ue��T+S�L�2��TSmL�1��T;S�L���1Uc��T��:Su����8R���H<#���
��j\���H5�oCq�*LU��0Ue��T��V�Z�je���6���jc���v����1Uc��T����l��wdwX]4a�C�m�T��y.�H���y�DB�]��~���:�x��b�p9������R��0���o�HZP��t�1������xkD������"�V��$�	����#��fNH3'��3�����DsA�� �\j.LU��0Ua��T��*S�L�2��T+SmL�1��T;S�L�3Uc��T��Su��L��j|���H5>m#��T�-!�#�����R�O�P��
S�*LU��2Ue��T+S�L�2��TSmL�3��T;S�L���1Uc��T��:R�O�P���m(�T��6G��i�����8S�*LU��0Ue��T��V�Z�je���6��
%a���JIX�$�T�*	����J��%a���jIX���[R�-)���uK��%e���nIY���[R�-)���uK��%e���nIY���[R�-)���uK��%e���nIY���[R�-)���uK��%e�R�w���{���,����-%m�0V������9'��q[J���������n�������������������b_/�j�x8�<o�����	8��w��xeqC���W�(�B���LU�je���V���jc���v���jg���S5�jL���3UG�1
�#�����j�E��H5��!��Tc.g��T��
S��LU��2��T+S�L�1��TSmL�3��T;S5�jL���3Ug��T��\4(�Tc.�H5��Aq�s��8R��hP��
S�*LU��2Ue���V�Z�je���6���jg���v�jL���1Uc��T��:R��hP��\4$��j�E��HUY���[R�-)���uK��%e���nIY���[R�-)���uK��%e���nIY���[R�-)���uK��%e���nIY�4�E�JJ��B�����TSN�#����q��Vm�r���j�y>�<^������>m7��vZ�������.,�rWc}e���ya�����+��������%���T���������xeqC��Y\Y��xc���6���jg���v�jL���1Ug��T��:R-	���TKB�%#���j�H��Z
R-���T��
S��LU��2��T+S���K>�����x+,^Y�P�gW�(n������1Ug��T�JB����$�*�JF����d�*�JA�R��0Ua��T��*SU��L�2��T+SmL�1��T;S�L�3��T��S5��L���#UMHUR��T5!U�HU3R��T� U-HUS�*LU��0Ue��T��V�Z�je���6���jc���v����1Uc��T������V|!�������q ^�.}�����I��y���3���oo�7���������?����W�y����}�Sqo&�[����������xeqCq�,�,�T+S�L�2��TSmL�1��T;S�L���1Uc��T��:Su�S	�8R���H<#��J��jL%D���TB(�T��
S��LU��2Ue���V�Z�jc���6���jg���v�jL���1Ug��T��TB(�Tc*!G�1��#��J��R���P��
S�*LU��2Ue��T+S�L�2��TSmL�1��T;S�L���1Uc��T��:Su�S	�8R���H<#��J����nIY���[R�-)���uK��%e���nIY���[R�-)���uK��%e���nIY���[R�-)���uK��%e��|�@n�%�������T��_��_}����_�W��������y�;l����NW������E�,,�b)W�jYY|koR_z]�V7��,H\Y��x�gA��������S5��L���#����j���H5��Aq����xA�1/�3Ua��T��
SU��LU�je���V���jc���6���jg���S5�jL���3Ug��Tc^G�1/�g����8R�yaP���0(�T��
S��LU��2��T+S�L�2��TSmL�3��T;S5�jL���1Ug��T���0(�Tc^�H5��Aq����xA�1/�3Ua��T��*SU��LU�*���uK��%e���nIY���[R�-)���uK��%e���nIY�4�5�Y��(����K�1����������t���\����~�w�?<������
��������������X~��;�:�j���xI�,����
�W7��-Q\Y������������j�o��H5��$qa��T��*SU��LU�je����P}y�����Y\Y��xO,.,�X�Q�
�Wg��T��:R��-Q����(�T�|KG�q�%���[�8S�*LU��0Ue��T��V�Z�je���6���jc���v����1Uc��T��:Su��H5��Dq��[�xF�q�%�#�8���j�o��LU��0Ua��T��*S�L�2��T+SmL�1��T;S�L�3Uc��T��S��i/�y)�,u[yL�<�����..>�.#dV�tq�|��-Q��������������o���RRM)�,,�b!WC�J�l��_z�����$>�z�xeqC������;����(.,�TS�L�3��T��S5��L���3UG�10�#����3R���Q��\�(�Tc.`g��T��
SU��LU�je����P�4~����;������
�{fqe����X\X��Su��L��j���H5�&��Tc.`G�10��s�8S�*LU��2Ue��T��V�Z�je���6���jg���v����1Uc��T��:Su�s�8R���Q��\�(�Tc.`/H5�Fq�*LU��0Ua��T��*S�L�2��TSmL�1��T;S�L�3Uc��T��:S�����V�	�>��z�/T��M��t��j�*[W��>}��n�x���.n��i��tiL����e_,���M���y�������xcqG�qP����
��A=�+�w7�jL���1Ug��T����(�T�\e�H5�UFq��*�xA�q�2�3Ua��T��*SU��LU�je���V���jc���v���jg���S5�jL���3UG�q�2�#�8W��j����H5�U&��T�\eg��T��
S��LU��2��T+S�L�1��TSmL�3��T;S5�jL���3Ug��T����(�T�\e�H5�UFq��*�8R�s�Q��
S�*LU��2Ue��[R�-)���uK��%e���nIY���[R�-)���uKq�q�����&Scp��J[���j�q,}9������m���~:=>�.n�����v��W�1l\������T#����0����o�!����M5$.,�X�Q<����+��7���jc���v���jg��T��Su��L��jL���H5��Eq����8R�ipI� �����0Ua��T��*SU��L�2��T+SmL�1��TS�L�3��T��S5��L���3UG�1
.�#����3R��,P��4�(�Tc\g��T��
SU��LU�je���V�Z�jc���6���jg���S5�jL���3Ug��Tc\G�1
.�g����8R�ipI� ����*���uK��%e���nIY���[R�-)���uK��%e���nIY���[R�-)���uK�������Y1�	_��WCs^(����H��22;OU[Vu�
��������������������.n��o7������{��Vy��/�_�\i5��$Z�^�b���������,n(>��P\Y��xA�q�?�#�������
S��LU��2��T+S�L�2��TSmL�3��T;S5�jL���1Ug��T��E�(�T���H5.�Gq���xA�q�?�3Ua��T��*SU��LU�je���V���jc���v���jg���S5�jL���3UG�q�?�#�����j\���H5.�'��T��g��T��
S��LU��2��T+S�L�1��TSmL�3��T;S5�jLu>!�?wn�g�q��n���aah.%Z|���D����D�itr4u?��������ns���s������������a���_�|R��/���_,b����2�Z�V}��A���%���xcqGq),^Y�P\��2Ue���V�ZCu�x-?*�(�
�W7�����;�[bqaq�jL���3UG����8R����3R����j�?$^�j�?(�T��
S��LU��2Ue���V�Z�jc���6���jg���v�jL���1Ug��T�F���H5�G����8R����R�����0Ua��T��*SU��L�2��T+SmL�1��TS�L�3��T��S���$�n�#��0"�B�3�]L����fz^x_�<���v����~s����n77����E`{<>]�%'%��Um�n.�Py�R46�����&���xG�(�H\X����x�A$^Y��*SU��L�2��T�j��e��X�x�A$^Y�P<� W�(e��3Uc��T��:R�� G�s���e�#����T�2����0Ua��T��*SU��L�2��T+SmL�1��T;S�L�3��T��S5��L���#��"q�:�A$�T�2����\�xA�sD�LU��0Ua��T��*SU�Z�je���6���jc���v���jg��T���W	���xX�r������/}i�~�KG�+��������H��P�}������������}��K�^�P�\.��c.�T���<����8��o�>nv�i��x�N����p<M�??>����������t����4=������8�6������O����������i��zw����*���U���y>��Oc��__F�>�d���X�X���a�?nw�oW����n����6����2����x<��
nn�����M<�y���j�_���C��r5��v��~l��x�����i�><l�����V�S�Mu�9n�v_�������tZ?�i����X��]�B�0V�zM�����8jK��q}8c��V=���"bn�����^v��f`%�k�o�z����>����nsX��.�&N�lc=��yd������?�w�����&�����m�BTI����C~���������vw7[���ww����aq��x�wm��r��{���l��0�N��E��^Z������y���y�e��=|?���n��cGz�����K���vsZo��1v�m<��������W<�w�c'������U�gr�4^�yU��������z����L��X��t�n}O��>>^>�w���n�s����v7n���0Vs������q��%-���7�g-���c���z�{�K�o�O���xu�o�����?������dv~
~m�6�z���y?��B��>��k�a�������7�c�0}y��������������f��/�2�[���_��<��+�>��oO����a������w���?���
endstream
endobj
99 0 obj
[
98 0 R
100 0 R
101 0 R
102 0 R
104 0 R
105 0 R
107 0 R
108 0 R
110 0 R
111 0 R
113 0 R
114 0 R
116 0 R
117 0 R
119 0 R
120 0 R
122 0 R
123 0 R
125 0 R
126 0 R
128 0 R
129 0 R
131 0 R
132 0 R
133 0 R
134 0 R
135 0 R
136 0 R
138 0 R
139 0 R
141 0 R
142 0 R
144 0 R
145 0 R
147 0 R
148 0 R
]
endobj
95 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Annots 99 0 R
  /Contents 149 0 R
>>

endobj
150 0 obj
7771
endobj
152 0 obj
<< /Length 153 0 R /Filter /FlateDecode >>
stream
x��S�n�@��W��:�5��U7UJ��p@4���A����B�"8C�h�7��=��a�w�����!,R��q	��j(p�(+��\�Rh�
+8���U�T����^�����eL���(��������	�l���v��k�C@�Bv�lA*��+��+e��i@�!B������+F�Q�Q�����y���v�������Pp@�
�Xr��}D��bv>��g���x{���u�Z�]�9�Z���h���}��!�d�V[���f��Al��~=Yz���w�=H�b�;����JJD��R��%[i�	�b��kE�u�U���)�nwM�����b�����~���j�4b�16U
endstream
endobj
151 0 obj
<<
  /Resources 3 0 R
  /Type /Page
  /MediaBox [0 0 595.275 841.889]
  /CropBox [0 0 595.275 841.889]
  /BleedBox [0 0 595.275 841.889]
  /TrimBox [0 0 595.275 841.889]
  /Parent 1 0 R
  /Contents 152 0 R
>>

endobj
153 0 obj
352
endobj
154 0 obj
<< /Type /Action
/S /GoTo
/D [95 0 R /XYZ 54.0 769.889 null]
>>
endobj
156 0 obj
<< /Title (High Level Design for Logical Replication in Postgres)
 /Parent 155 0 R
 /Next 158 0 R
 /A 154 0 R
>> endobj
157 0 obj
<< /Type /Action
/S /GoTo
/D [95 0 R /XYZ 54.0 685.691 null]
>>
endobj
158 0 obj
<< /Title (Table of Contents)
 /Parent 155 0 R
 /Prev 156 0 R
 /Next 159 0 R
 /A 157 0 R
>> endobj
159 0 obj
<< /Title <FEFF0031002E00A00049006E00740072006F00640075006300740069006F006E>
 /Parent 155 0 R
 /Prev 158 0 R
 /Next 162 0 R
 /First 160 0 R
 /Last 161 0 R
 /Count -2
 /A 96 0 R
>> endobj
160 0 obj
<< /Title <FEFF0031002E0031002E00A000500072006500760069006F00750073002000640069007300630075007300730069006F006E0073>
 /Parent 159 0 R
 /Next 161 0 R
 /A 97 0 R
>> endobj
161 0 obj
<< /Title <FEFF0031002E0032002E00A0004300680061006E006700650073002000660072006F006D002000760031>
 /Parent 159 0 R
 /Prev 160 0 R
 /A 103 0 R
>> endobj
162 0 obj
<< /Title <FEFF0032002E00A0004500780069007300740069006E006700200061007000700072006F0061006300680065007300200074006F0020007200650070006C00690063006100740069006F006E00200069006E00200050006F007300740067007200650073>
 /Parent 155 0 R
 /Prev 159 0 R
 /Next 165 0 R
 /First 163 0 R
 /Last 164 0 R
 /Count -2
 /A 106 0 R
>> endobj
163 0 obj
<< /Title <FEFF0032002E0031002E00A000540072006900670067006500720020006200610073006500640020005200650070006C00690063006100740069006F006E>
 /Parent 162 0 R
 /Next 164 0 R
 /A 109 0 R
>> endobj
164 0 obj
<< /Title <FEFF0032002E0032002E00A0005200650063006F00760065007200790020006200610073006500640020005200650070006C00690063006100740069006F006E>
 /Parent 162 0 R
 /Prev 163 0 R
 /A 112 0 R
>> endobj
165 0 obj
<< /Title <FEFF0033002E00A00047006F0061006C0073>
 /Parent 155 0 R
 /Prev 162 0 R
 /Next 166 0 R
 /A 115 0 R
>> endobj
166 0 obj
<< /Title <FEFF0034002E00A0004E006500770020004100720063006800690074006500630074007500720065>
 /Parent 155 0 R
 /Prev 165 0 R
 /First 167 0 R
 /Last 176 0 R
 /Count -10
 /A 118 0 R
>> endobj
167 0 obj
<< /Title <FEFF0034002E0031002E00A0004F0076006500720076006900650077>
 /Parent 166 0 R
 /Next 168 0 R
 /A 121 0 R
>> endobj
168 0 obj
<< /Title <FEFF0034002E0032002E00A00053006300680065006D00610074006900630073>
 /Parent 166 0 R
 /Prev 167 0 R
 /Next 169 0 R
 /A 124 0 R
>> endobj
169 0 obj
<< /Title <FEFF0034002E0033002E00A000570041004C00200065006E00720069006300680065006D0065006E0074>
 /Parent 166 0 R
 /Prev 168 0 R
 /Next 170 0 R
 /A 127 0 R
>> endobj
170 0 obj
<< /Title <FEFF0034002E0034002E00A000570041004C002000700061007200730069006E0067002000260020006400650063006F00640069006E0067>
 /Parent 166 0 R
 /Prev 169 0 R
 /Next 171 0 R
 /A 130 0 R
>> endobj
171 0 obj
<< /Title <FEFF0034002E0035002E00A0005400580020007200650061007300730065006D0062006C0079>
 /Parent 166 0 R
 /Prev 170 0 R
 /Next 172 0 R
 /A 62 0 R
>> endobj
172 0 obj
<< /Title <FEFF0034002E0036002E00A00053006E0061007000730068006F00740020006200750069006C00640069006E0067>
 /Parent 166 0 R
 /Prev 171 0 R
 /Next 173 0 R
 /A 65 0 R
>> endobj
173 0 obj
<< /Title <FEFF0034002E0037002E00A0004F0075007400700075007400200050006C007500670069006E>
 /Parent 166 0 R
 /Prev 172 0 R
 /Next 174 0 R
 /A 137 0 R
>> endobj
174 0 obj
<< /Title <FEFF0034002E0038002E00A0005300650074007500700020006F00660020007200650070006C00690063006100740069006F006E0020006E006F006400650073>
 /Parent 166 0 R
 /Prev 173 0 R
 /Next 175 0 R
 /A 140 0 R
>> endobj
175 0 obj
<< /Title <FEFF0034002E0039002E00A00044006900730061006400760061006E007400610067006500730020006F0066002000740068006500200061007000700072006F006100630068>
 /Parent 166 0 R
 /Prev 174 0 R
 /Next 176 0 R
 /A 143 0 R
>> endobj
176 0 obj
<< /Title <FEFF0034002E00310030002E00A00055006E00660069006E00690073006800650064002F0055006E00640065006300690064006500640020006900730073007500650073>
 /Parent 166 0 R
 /Prev 175 0 R
 /A 146 0 R
>> endobj
177 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Helvetica
  /Encoding /WinAnsiEncoding
>>

endobj
178 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Times-Roman
  /Encoding /WinAnsiEncoding
>>

endobj
179 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Helvetica-Bold
  /Encoding /WinAnsiEncoding
>>

endobj
180 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Courier
  /Encoding /WinAnsiEncoding
>>

endobj
181 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Times-Italic
  /Encoding /WinAnsiEncoding
>>

endobj
182 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Times-Bold
  /Encoding /WinAnsiEncoding
>>

endobj
183 0 obj
<< /Limits [(_changes_from_v1) (_changes_from_v1)] /Names [(_changes_from_v1) 103 0 R] >>

endobj
184 0 obj
<< /Limits [(_disadvantages_of_the_approach) (_disadvantages_of_the_approach)] /Names [(_disadvantages_of_the_approach) 143 0 R] >>

endobj
185 0 obj
<< /Limits [(_existing_approaches_to_replication_in_postgres) (_existing_approaches_to_replication_in_postgres)] /Names [(_existing_approaches_to_replication_in_postgres) 106 0 R] >>

endobj
186 0 obj
<< /Limits [(_goals) (_goals)] /Names [(_goals) 115 0 R] >>

endobj
187 0 obj
<< /Limits [(_introduction) (_introduction)] /Names [(_introduction) 96 0 R] >>

endobj
188 0 obj
<< /Limits [(_new_architecture) (_new_architecture)] /Names [(_new_architecture) 118 0 R] >>

endobj
189 0 obj
<< /Limits [(_output_plugin) (_output_plugin)] /Names [(_output_plugin) 137 0 R] >>

endobj
190 0 obj
<< /Limits [(_overview) (_overview)] /Names [(_overview) 121 0 R] >>

endobj
191 0 obj
<< /Limits [(_previous_discussions) (_previous_discussions)] /Names [(_previous_discussions) 97 0 R] >>

endobj
192 0 obj
<< /Limits [(_recovery_based_replication) (_recovery_based_replication)] /Names [(_recovery_based_replication) 112 0 R] >>

endobj
193 0 obj
<< /Limits [(_schematics) (_schematics)] /Names [(_schematics) 124 0 R] >>

endobj
194 0 obj
<< /Limits [(_setup_of_replication_nodes) (_setup_of_replication_nodes)] /Names [(_setup_of_replication_nodes) 140 0 R] >>

endobj
195 0 obj
<< /Limits [(_trigger_based_replication) (_trigger_based_replication)] /Names [(_trigger_based_replication) 109 0 R] >>

endobj
196 0 obj
<< /Limits [(_unfinished_undecided_issues) (_unfinished_undecided_issues)] /Names [(_unfinished_undecided_issues) 146 0 R] >>

endobj
197 0 obj
<< /Limits [(_wal_enrichement) (_wal_enrichement)] /Names [(_wal_enrichement) 127 0 R] >>

endobj
198 0 obj
<< /Limits [(_wal_parsing_amp_decoding) (_wal_parsing_amp_decoding)] /Names [(_wal_parsing_amp_decoding) 130 0 R] >>

endobj
199 0 obj
<< /Limits [(idp29887312) (idp29887312)] /Names [(idp29887312) 154 0 R] >>

endobj
200 0 obj
<< /Limits [(snapbuilder) (snapbuilder)] /Names [(snapbuilder) 65 0 R] >>

endobj
201 0 obj
<< /Limits [(tx-reassembly) (tx-reassembly)] /Names [(tx-reassembly) 62 0 R] >>

endobj
202 0 obj
<< /Limits [(_changes_from_v1) (tx-reassembly)] /Kids [183 0 R 184 0 R 185 0 R 186 0 R 187 0 R 188 0 R 189 0 R 190 0 R 191 0 R 192 0 R 193 0 R 194 0 R 195 0 R 196 0 R 197 0 R 198 0 R 199 0 R 200 0 R 201 0 R] >>

endobj
203 0 obj
<< /Dests 202 0 R >>

endobj
1 0 obj
<< /Type /Pages
/Count 14
/Kids [95 0 R 8 0 R 23 0 R 26 0 R 29 0 R 32 0 R 37 0 R 61 0 R 50 0 R 53 0 R 73 0 R 82 0 R 90 0 R 151 0 R ] >>
endobj
2 0 obj
<<
  /Type /Catalog
  /Pages 1 0 R
  /Metadata 7 0 R
  /Lang (en)
  /PageLabels 9 0 R
  /Outlines 155 0 R
  /PageMode /UseOutlines
  /Names 203 0 R
>>

endobj
3 0 obj
<<
/Font <<
  /F1 177 0 R
  /F5 178 0 R
  /F3 179 0 R
  /F9 180 0 R
  /F6 181 0 R
  /F7 182 0 R
>>
/ProcSet [ /PDF /ImageB /ImageC /Text ]
/XObject <<
  /Im1 33 0 R
  /Im2 91 0 R
>>
/ColorSpace <<
  /DefaultRGB 6 0 R
>>
>>
endobj
9 0 obj
<< /Nums [0 << /P (1) >>
 1 << /P (2) >>
 2 << /P (3) >>
 3 << /P (4) >>
 4 << /P (5) >>
 5 << /P (6) >>
 6 << /P (7) >>
 7 << /P (8) >>
 8 << /P (9) >>
 9 << /P (10) >>
 10 << /P (11) >>
 11 << /P (12) >>
 12 << /P (13) >>
 13 << /P (14) >>
] >>

endobj
155 0 obj
<< /First 156 0 R
 /Last 166 0 R
>> endobj
xref
0 204
0000000000 65535 f 
0000108516 00000 n 
0000108667 00000 n 
0000108834 00000 n 
0000000015 00000 n 
0000000267 00000 n 
0000002949 00000 n 
0000002982 00000 n 
0000006841 00000 n 
0000109072 00000 n 
0000004010 00000 n 
0000004131 00000 n 
0000006779 00000 n 
0000004268 00000 n 
0000004406 00000 n 
0000004543 00000 n 
0000004664 00000 n 
0000004799 00000 n 
0000004939 00000 n 
0000005076 00000 n 
0000007082 00000 n 
0000007103 00000 n 
0000007123 00000 n 
0000008767 00000 n 
0000007144 00000 n 
0000008992 00000 n 
0000010188 00000 n 
0000009013 00000 n 
0000010413 00000 n 
0000011772 00000 n 
0000010434 00000 n 
0000011997 00000 n 
0000052354 00000 n 
0000012018 00000 n 
0000051871 00000 n 
0000051893 00000 n 
0000052579 00000 n 
0000056341 00000 n 
0000052599 00000 n 
0000052735 00000 n 
0000056272 00000 n 
0000052874 00000 n 
0000053008 00000 n 
0000053148 00000 n 
0000053285 00000 n 
0000053426 00000 n 
0000053566 00000 n 
0000053706 00000 n 
0000053843 00000 n 
0000056583 00000 n 
0000057920 00000 n 
0000056604 00000 n 
0000058145 00000 n 
0000060883 00000 n 
0000058166 00000 n 
0000058302 00000 n 
0000060842 00000 n 
0000058441 00000 n 
0000058580 00000 n 
0000058717 00000 n 
0000061125 00000 n 
0000063840 00000 n 
0000061146 00000 n 
0000061226 00000 n 
0000063785 00000 n 
0000061366 00000 n 
0000061446 00000 n 
0000061583 00000 n 
0000061719 00000 n 
0000061859 00000 n 
0000061999 00000 n 
0000062136 00000 n 
0000064082 00000 n 
0000067666 00000 n 
0000064103 00000 n 
0000067618 00000 n 
0000064243 00000 n 
0000064379 00000 n 
0000064518 00000 n 
0000064598 00000 n 
0000064738 00000 n 
0000067908 00000 n 
0000071093 00000 n 
0000067929 00000 n 
0000068066 00000 n 
0000071059 00000 n 
0000068199 00000 n 
0000068305 00000 n 
0000068439 00000 n 
0000071335 00000 n 
0000085816 00000 n 
0000071356 00000 n 
0000084810 00000 n 
0000084832 00000 n 
0000086041 00000 n 
0000100549 00000 n 
0000086061 00000 n 
0000086141 00000 n 
0000086221 00000 n 
0000100242 00000 n 
0000086358 00000 n 
0000086499 00000 n 
0000086637 00000 n 
0000086778 00000 n 
0000086858 00000 n 
0000086997 00000 n 
0000087139 00000 n 
0000087219 00000 n 
0000087358 00000 n 
0000087500 00000 n 
0000087580 00000 n 
0000087719 00000 n 
0000087861 00000 n 
0000087942 00000 n 
0000088081 00000 n 
0000088223 00000 n 
0000088304 00000 n 
0000088443 00000 n 
0000088585 00000 n 
0000088666 00000 n 
0000088804 00000 n 
0000088946 00000 n 
0000089026 00000 n 
0000089165 00000 n 
0000089307 00000 n 
0000089388 00000 n 
0000089527 00000 n 
0000089669 00000 n 
0000089750 00000 n 
0000089889 00000 n 
0000090031 00000 n 
0000090112 00000 n 
0000090251 00000 n 
0000090393 00000 n 
0000090531 00000 n 
0000090672 00000 n 
0000090810 00000 n 
0000090951 00000 n 
0000091032 00000 n 
0000091171 00000 n 
0000091313 00000 n 
0000091393 00000 n 
0000091531 00000 n 
0000091673 00000 n 
0000091754 00000 n 
0000091893 00000 n 
0000092031 00000 n 
0000092112 00000 n 
0000092251 00000 n 
0000092393 00000 n 
0000100792 00000 n 
0000101244 00000 n 
0000100814 00000 n 
0000101471 00000 n 
0000101492 00000 n 
0000109335 00000 n 
0000101573 00000 n 
0000101703 00000 n 
0000101784 00000 n 
0000101893 00000 n 
0000102090 00000 n 
0000102270 00000 n 
0000102431 00000 n 
0000102765 00000 n 
0000102966 00000 n 
0000103171 00000 n 
0000103299 00000 n 
0000103499 00000 n 
0000103632 00000 n 
0000103788 00000 n 
0000103964 00000 n 
0000104168 00000 n 
0000104335 00000 n 
0000104518 00000 n 
0000104686 00000 n 
0000104906 00000 n 
0000105138 00000 n 
0000105351 00000 n 
0000105459 00000 n 
0000105569 00000 n 
0000105682 00000 n 
0000105788 00000 n 
0000105899 00000 n 
0000106008 00000 n 
0000106116 00000 n 
0000106266 00000 n 
0000106467 00000 n 
0000106545 00000 n 
0000106643 00000 n 
0000106754 00000 n 
0000106856 00000 n 
0000106943 00000 n 
0000107065 00000 n 
0000107206 00000 n 
0000107299 00000 n 
0000107440 00000 n 
0000107578 00000 n 
0000107722 00000 n 
0000107830 00000 n 
0000107965 00000 n 
0000108058 00000 n 
0000108150 00000 n 
0000108248 00000 n 
0000108477 00000 n 
trailer
<<
/Size 204
/Root 2 0 R
/Info 4 0 R
/ID [<48BE81A62C6C142941526E1A0D78060D> <48BE81A62C6C142941526E1A0D78060D>]
>>
startxref
109388
%%EOF
#34md@rpzdesign.com
md@rpzdesign.com
In reply to: Andres Freund (#33)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

Andres, nice job on the writeup.

I think one aspect you are missing is that there must be some way for
the multi-masters to
re-stabilize their data sets and quantify any data loss. You cannot do
this without
some replication intelligence in each row of each table so that no
matter how disastrous
the hardware/internet failure in the cloud, the system can HEAL itself
and keep going, no human beings involved.

I am laying down a standard design pattern of columns for each row:

MKEY - Primary key guaranteed unique across ALL nodes in the CLOUD with
NODE information IN THE KEY. (A876543 vs B876543 or whatever)(network
link UP or DOWN)
CSTP - create time stamp on unix time stamp
USTP - last update time stamp based on unix time stamp
UNODE - Node that updated this record

Many applications already need the above information, might as well
standardize it so external replication logic processing can self heal.

Postgresql tables have optional 32 bit int OIDs, you may want consider
having a replication version of the ROID, replication object ID and then
externalize the primary
key generation into a loadable UDF.

Of course, ALL the nodes must be in contact with each other not allowing
signficant drift on their clocks while operating. (NTP is a starter)

I just do not know of any other way to add self healing without the
above information, regardless of whether you hold up transactions for
synchronous
or let them pass thru asynch. Regardless if you are getting your
replication data from the WAL stream or thru the client libraries.

Also, your replication model does not really discuss busted link
replication operations, where is the intelligence for that in the
operation diagram?

Everytime you package up replication into the core, someone has to tear
into that pile to add some extra functionality, so definitely think
about providing sensible hooks for that extra bit of customization to
override the base function.

Cheers,

marco

Show quoted text

On 9/22/2012 11:00 AM, Andres Freund wrote:

This time I really attached both...

#35Peter Geoghegan
peter@2ndquadrant.com
In reply to: Andres Freund (#9)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com> wrote:

(0008-Introduce-wal-decoding-via-catalog-timetravel.patch)

This patch is the 8th of 8 in a patch series that covers different
aspects of the bi-directional replication feature planned for
PostgreSQL 9.3. For those that are unfamiliar with the BDR projection,
a useful summary is available in an e-mail sent to this list by Simon
Riggs back in April [1]http://archives.postgresql.org/message-id/CA+U5nMLk-Wt806zab7SJ2x5X4pqC3WE-hFctONakTqSAgbqTYQ@mail.gmail.com. I should also point out that Andres has made
available a design document that discusses aspects of this patch in
particular, in another thread [2]http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com. That document, "High level design
for logical replication in Postgres", emphasizes source data
generation in particular: generalising how PostgreSQL's WAL stream is
generated to represent the changes it describes logically, without
pointers to physical heap tuples and so forth (data generation as a
concept is described in detail in an earlier design document [3]http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com).
This particular patch can be thought of as a response to the earlier
discussion [4]http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com surrounding how to solve the problem of keeping system
catalogs consistent during WAL replay on followers: "catalog time
travel" is now used, rather than maintaining a synchronized catalog at
the decoding end. Andres' git tree ("xlog-decoding-rebasing-cf2"
branch) [5]http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf2 provides additional useful comments in commit messages (he
rebases things such that each commit represents a distinct piece of
functionality/ patch for review).

This patch is not strictly speaking an atomic unit. It is necessary to
apply all 8 patches in order to get the code to compile. However, it
is approximately an atomic unit, that represents a subdivision of the
entire BDR patch that it is manageable and logical to write a discrete
review for. This is idiomatic use of git-format-patch, but it is
unusual enough within our community for me feel the need to note these
facts.

I briefly discussed this patch with Andres off-list. His feeling is
that the review process ought to focus on the design of WAL decoding,
including how it fits within the larger set of replication features
proposed for 9.3. There are a number of known omissions in this patch.
Andres has listed some of these above, and edge-cases and so on are
noted next to XXX and FIXME comments in the patch itself. I am
inclined to agree with Andres' view that we should attempt to solidify
community support for this prototype patch's design, or some variant,
before fixing the edge-cases and working towards committable code. I
will try my best to proceed on that basis.

What follows is an initial overview of the patch (or at least my
understanding of the patch, which you may need to correct), and some
points of concern.

* applycache module which reassembles transactions from a stream of interspersed changes

This is what the design doc [2]http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com refers to as "4.5. TX reassembly".

This functionality is concentrated in applycache.c. As [2]http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com notes, the
reassembly component "was previously coined ApplyCache because it was
proposed to run on replication consumers just before applying changes.
This is not the case anymore. It is still called that way in the
source of the patch recently submitted."

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I meant to avoid talking about anything other than the bigger picture
for now, but I must ask: Why the frequent use of malloc(),
particularly within applycache.c? The obvious answer is that it's
rough code and that that will change, but that still doesn't comport
with my idea about how rough Postgres code should look, so I have to
wonder if there's a better reason.

applycache.c has an acute paucity of comments, which makes it really
hard to review well. [2]http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com doesn't have all that much to say about it
either. I'm going to not comment much on this here, except to say that
I think that the file should be renamed to reassembly.c or something
like that, to reflect its well-specified purpose, and not how it might
be used. Any cache really belongs in src/backend/utils/cache/ anyway.

Applycache is presumably where you're going to want to spill
transaction streams to disk, eventually. That seems like a
prerequisite to commit.

By the way, I see that you're doing this here:

+ /* most callers don't need snapshot.h */
+ typedef struct SnapshotData *Snapshot;

Tom, Alvaro and I had a discussion about whether or not this was an
acceptable way to reduce build dependencies back in July [8]http://archives.postgresql.org/message-id/CAEYLb_Uvbi9mns-uJWUW4QtHqnC27SEyyNmj1HKFY=5X5wwdgg@mail.gmail.com – I lost
that one. You're relying on a Gnu extension/C11 feature here (that is,
typedef redefinition). If you find it intensely irritating that you
cannot do this while following the standard to the letter, you are not
alone.

* snapbuilder which builds catalog snapshots so that tuples from wal can be understood

This component analyses xlog and builds a special kind of Snapshot.
This has been compared to the KnownAssignedXids machinery for Hot
Standby [6]http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com (see SnapBuildEndTxn() et al to get an idea of what is
meant by this). Since decoding only has to occur within a single
backend, I guess it's sufficient that it's all within local memory (in
contrast to the KnownAssignedXids array, which is in shared memory).

The design document [2]http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com really just explains the problem (which is the
need for catalog metadata at a point in time to make sense of heap
tuples), without describing the solution that this patch offers with
any degree of detail. Rather, [2]http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com says "How we build snapshots is
somewhat intricate and complicated and seems to be out of scope for
this document", which is unsatisfactory. I look forward to reading the
promised document that describes this mechanism in more detail. It's
really hard to guess what you might have meant to do, and why you
might have done it, much less verifying the codes correctness.

This functionality is concentrated in snapbuild.c. A comment in decode.c notes:

+ *    Its possible that the separation between decode.c and snapbuild.c is a
+ *    bit too strict, in the end they just about have the same struct.

I prefer the current separation. I think it's reasonable that decode.c
is sort of minimal glue code.

[2]: http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

* wal decoding into an applycache

This functionality is concentrated in decode.c (not applycache.c –
decoding just call those functions).

Decoding means breaking up individual XLogRecord structs, and storing
them in an applycache (applycache does this, and stores them as
ApplyCacheChange records), while building a snapshot (which is needed
in advance of adding tuples from records). It can be thought of as the
small piece of glue between applycache and snapbuild that is called by
XLogReader (DecodeRecordIntoApplyCache() is the only public function,
which will be called by many xlogreader_state.finished_record-hooked
plugin functions in practice, including this example one). An example
of what belongs in decode.c is the way it ignores physical
XLogRecords, because they are not of interest.

By the way, why not just use struct assignment here?:

+ memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));

* decode_xlog(lsn, lsn) debugging function

You consider this to be a throw-away function that won't ever be
committed. However, I strongly feel that you should move it into
/contrib, so that it can serve as a sort of reference implementation
for authors of decoder client code, in the same spirit as numerous
existing contrib modules (think contrib/spi). I think that such a
module could even be useful to people that were just a bit
intellectually curious about how WAL works, which is something I'd
like to encourage. Besides, simply having this code in a module will
more explicitly demarcate client code (just look at logicalfuncs.c –
it is technically client code, but that's too subtle right now).

I don't like this code in decode_xlog():

+ apply_state = (ReaderApplyState*)xlogreader_state->private_data;

Why is it necessary to cast here? In other words, why is private_data
a void pointer at all? Are we really well-served by presuming
absolutely nothing about XlogReader's state? Wouldn't an “abstract
base class” pointer be more appropriate a type for private_data? I
don't think it's virtuous to remove type-safety any more than is
strictly necessary. Note that I'm not asserting that you shouldn't do
this – I'm merely asking the question. When developing a user-facing
API, it is particularly crucial to make interfaces easy to use
correctly and hard to use incorrectly.

The applycache provides 3 major callbacks:
* apply_begin
* apply_change
* apply_commit

These are callbacks intended to be used by third-party modules,
perhaps including a full multi-master replication implementation
(though this patch isn't directly concerned with that), or even a
speculative future version of a logical replication system like Slony.
[2]: http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com
for the hook types involved (incidentally, don't like the name of
these types – I think you should lose the CB):

+/* XXX: were currently passing the originating subtxn. Not sure thats
necessary */
+typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache,
ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change);
+typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn);

So we register these callbacks like this in the patch:

+	/*
+	 * allocate an ApplyCache that will apply data using lowlevel calls
+	 * without type conversion et al. This requires binary compatibility
+	 * between both systems.
+	 * XXX: This would be the place too hook different apply methods, like
+	 * producing sql and applying it.
+	 */
+	apply_cache = ApplyCacheAllocate();
+	apply_cache->begin = decode_begin_txn;
+	apply_cache->apply_change = decode_change;
+	apply_cache->commit = decode_commit_txn;

The decode_xlog(lsn, lsn) debugging function that Andres has played
with [6]http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com (that this patch makes available, for now) is where this code
comes from.

Whenever ApplyCache calls an "apply_change" callback for a single
change (that is, a INSERT|UPDATE|DELETE) it locally overrides the
normal SnapshotNow semantics used for catalog access with a previously
built snapshot. Behaviour should now be consistent with a normal
snapshotNow acquired when the tuple change was originally written to
WAL.

Having commented specifically on modules that Andres highlighted, I'd
like to highlight one myself: tqual.c . This module has had
significant new functionalty added, so it would be an omission to not
pass remark on it in this opening e-mail, having mentioned all other
modules with significant new pieces of functionality. The file has had
new utility functions added, that pertain to snapshot visibility
during decoding - "time travel".

I draw attention to this. This code is located within the new function
HeapTupleSatisfiesMVCCDuringDecoding(), which is analogous to what is
done for "dirty snapshots" (dirty snapshots are used with
ri_triggers.c, for example, when even uncommitted tuple should be
visible). Both of these functions are generally accessed through
function pointers. Anyway, here's the code:

+ 	/*
+ 	 * FIXME: The not yet existing decoding infrastructure will need to force
+ 	 * the xmin to stay lower than what they are currently decoding.
+ 	 */
+ 	bool fixme_xmin_horizon = false;

I'm sort of wondering what this is going to look like in the finished
patch. This FIXME is rather hard for me to take at face value. It
seems to me that the need to coordinate decoding with xmin horizon
itself represents a not insignificant engineering challenge. So, with
all due respect, it would be nice if I wasn't asked to make that leap
of faith. The xmin horizon prepared transaction hack needs to go.

Within tqual.h, shouldn't you have something like this, but for time
travel snapshots during decoding?:

#define InitDirtySnapshot(snapshotdata) \
((snapshotdata).satisfies = HeapTupleSatisfiesDirty)

Alternatively, adding a variable to these two might be appropriate:

static SnapshotData CurrentSnapshotData = {HeapTupleSatisfiesMVCC};
static SnapshotData SecondarySnapshotData = {HeapTupleSatisfiesMVCC};

In any case, assigning this hook in snapbuild.c looks like a
modularity violation to me. See also my observations on initialising
ReaderApplyState below.

My general feeling is that the code is very under-commented, and in
need of a polish, though I'm sure that you are perfectly well aware of
that. The basic way all of these components that I have described
separately fit together is: (if others want to follow this, refer to
decode_xlog())

1. Start with some client code “output plugin” (for now, a throw-away
debugging function “decode_xlog()”)
|
\ /
2. Client allocates an XLogReaderState. (This module is a black box to
me, though it's well encapsulated so that shouldn't matter much.
Heikki is reviewing this [7]http://archives.postgresql.org/message-id/5056DFAB.3050707@vmware.com. Like I said, this isn't quite an atomic
unit I'm reviewing.)
|
\ /
3. Plugin registers various callbacks (within logicalfuncs.c). These
callbacks, while appearing in this patch, are mostly NO-OPS, and are
somewhat specific to XLogReader's concerns. I mostly defer to Heikki
here.
|
\ /
4. Plugin allocates an “ApplyCache”. Plugin assigns some more
callbacks to “ApplyCache”. This time, they're the aforementioned 3
apply cache functions.
|
\ /
5. Plugin assigns this new ApplyCache to variable within private state
of the XLogReader (this private state is a subset of its main state,
and is opaque to XLogReader).
|
\ /
6. Finally, plugin calls XLogReader(main_state).
|
\ /
7. At some point during its magic,
XLogReader calls the hook registered in step 3, finished_record.
This is all it does directly
with the plugin, which it makes minimal assumptions about.
|
\ /
8. finished_record (which is
logically a part of the “plugin”) knows what type the opaque
private_data
actually is. It casts it
to an apply_state, and calls the decoder (supplying the apply_state as
an argument to
DecodeRecordIntoApplyCache()).
|
\ /
9. During the first call
(within the first record within a call to decode_xlog()), we allocate
a snapshot reader.
|
\ /
10. Builds snapshot callback.
This scribbles on our snapshot state, which essentially encapsulates a
snapshot.
The state (and
snapshot) changes continually, once per call.
|
\ /
11. Looks at XLogRecordBuffer
(an XLogReader struct). Looks at an XLogRecord. Decodes based on
record type.
Let's assume it's an
XLOG_HEAP_INSERT.
|
\ /
12. DecodeInsert() called.
This in turn calls DecodeXLogTuple(). We store the tuple metadata in
our
ApplyCache. (some
ilists, somewhere, each corresponding to an XID). We don't store the
relation oid, because we don't
know it yet (only
relfilenode is known from WAL).
/
/
\ /
13. We're back in XLogReader(). It
calls the only callback of interest to
us covered in step 3 (and
not of interest to XlogReader()/Heikki) – decode_change(). It does
this through the
apply_cache.apply_change
hook. This happens because we encounter another record, this time a
commit record (in the same
codepath as discussed in step 12).
|
\ /
14. In decode_change(), the actual
function that raises the interesting WARNINGs within Andres'
earlier example [6]http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com, showing
actual integer/varchar Datum value for tuples previously inserted.
Resolve table oid based on
relfilenode (albeit unsatisfactorily).
Using a StringInfo,
tupledescs, syscache and typcache, build WARNING string.

So my general opinion of how all this fits together is that it isn't
quite right. Problems include:

* Why does the ReaderApplyState get magically initialised in two
stages? apply_cache is initialised in decode_xlog (or whatever
plugin). Snapstate is allocated within DecodeRecordIntoApplyCache()
(with a dependency on apply_cache). Shouldn't this all be happening
within a single function? As you yourself have point out, not everyone
needs to know about these snapshots.

* Maybe I've missed something, but I think you need a more
satisfactory example plugin. What happens in step 14 is plainly
unacceptable. You haven't adequately communicated to me how this is
going to be used in logical replication. Maybe I just haven't got that
far yet. I'm not impressed by the InvalidateSystemCaches() calls here
and elsewhere.

* Please break-out the client code as a contrib module. That
separation would increase the readability of the patch.

That's all I have for now...

[1]: http://archives.postgresql.org/message-id/CA+U5nMLk-Wt806zab7SJ2x5X4pqC3WE-hFctONakTqSAgbqTYQ@mail.gmail.com

[2]: http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

[3]: http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com

[4]: http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com

[5]: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf2

[6]: http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com

[7]: http://archives.postgresql.org/message-id/5056DFAB.3050707@vmware.com

[8]: http://archives.postgresql.org/message-id/CAEYLb_Uvbi9mns-uJWUW4QtHqnC27SEyyNmj1HKFY=5X5wwdgg@mail.gmail.com

#36Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#35)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote:

On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com> wrote:

(0008-Introduce-wal-decoding-via-catalog-timetravel.patch)

This patch is the 8th of 8 in a patch series that covers different
aspects of the bi-directional replication feature planned for
PostgreSQL 9.3. For those that are unfamiliar with the BDR projection,
a useful summary is available in an e-mail sent to this list by Simon
Riggs back in April [1]. I should also point out that Andres has made

Does this design allow replication/communcation between clusters running
different major versions of Postgres?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#37Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#36)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

Does this design allow replication/communcation between clusters running
different major versions of Postgres?

Just catching up on your email, hmmm?

Yes, that's part of the design 2Q presented.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#38anarazel@anarazel.de
andres@anarazel.de
In reply to: Bruce Momjian (#36)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

Bruce Momjian <bruce@momjian.us> schrieb:

On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote:

On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com>

wrote:

(0008-Introduce-wal-decoding-via-catalog-timetravel.patch)

This patch is the 8th of 8 in a patch series that covers different
aspects of the bi-directional replication feature planned for
PostgreSQL 9.3. For those that are unfamiliar with the BDR

projection,

a useful summary is available in an e-mail sent to this list by Simon
Riggs back in April [1]. I should also point out that Andres has made

Does this design allow replication/communcation between clusters
running
different major versions of Postgres?

This patchset only contains only the decoding/changeset generation part of logical replication. It provides (as in the debugging example) the capabilities to generate a correct textual format and thus can be used to build a solution with support for cross version/arch replication as long as the text format of the used types is compatible.

Does that answer the question?

Andres

--- 
Please excuse the brevity and formatting - I am writing this on my mobile phone.
#39Bruce Momjian
bruce@momjian.us
In reply to: anarazel@anarazel.de (#38)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Thu, Oct 11, 2012 at 01:34:58AM +0200, anarazel@anarazel.de wrote:

Bruce Momjian <bruce@momjian.us> schrieb:

On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote:

On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com>

wrote:

(0008-Introduce-wal-decoding-via-catalog-timetravel.patch)

This patch is the 8th of 8 in a patch series that covers different
aspects of the bi-directional replication feature planned for
PostgreSQL 9.3. For those that are unfamiliar with the BDR

projection,

a useful summary is available in an e-mail sent to this list by Simon
Riggs back in April [1]. I should also point out that Andres has made

Does this design allow replication/communcation between clusters
running
different major versions of Postgres?

This patchset only contains only the decoding/changeset generation part of logical replication. It provides (as in the debugging example) the capabilities to generate a correct textual format and thus can be used to build a solution with support for cross version/arch replication as long as the text format of the used types is compatible.

Does that answer the question?

Yes. This was posted so long ago I couldn't remember if that was part
of the design.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#40Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#35)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining. With streaming
replication as it exists today, a transaction that modifies a huge
amount of data (such as a bulk load) can be applied on the standby as
it happens. The rows thus inserted will become visible only if and
when the transaction commits on the master and the commit record is
replayed on the standby. This has a number of important advantages,
perhaps most importantly that the lag between commit and data
visibility remains short. With the proposed system, we can't start
applying the changes until the transaction has committed and the
commit record has been replayed, so a big transaction is going to have
a lot of apply latency.

Now, I am not 100% opposed to a design that surrenders this property
in exchange for other important benefits, but I think it would be
worth thinking about whether there is any way that we can design this
that either avoids giving that property up at all, or gives it up for
the time being but allows us to potentially get back to it in a later
version. Reassembling complete transactions is surely cool and some
clients will want that, but being able to apply replicated
transactions *without* reassembling them in their entirety is even
cooler, and some clients will want that, too.

If we're going to stick with a design that reassembles transactions, I
think there are a number of issues that deserve careful thought.
First, memory usage. I don't think it's acceptable for the decoding
process to assume that it can allocate enough backend-private memory
to store all of the in-flight changes (either as WAL or in some more
decoded form). We have assiduously avoided such assumptions thus far;
you can write a terabyte of data in one transaction with just a
gigabyte of shared buffers if you so desire (and if you're patient).
Here's you making the same point in different words:

Applycache is presumably where you're going to want to spill
transaction streams to disk, eventually. That seems like a
prerequisite to commit.

Second, crash recovery. I think whatever we put in place here has to
be able to survive a crash on any node. Decoding must be able to
restart successfully after a system crash, and it has to be able to
apply exactly the set of transactions that were committed but not
applied prior to the crash. Maybe an appropriate mechanism for this
already exists or has been discussed, but I haven't seen it go by;
sorry if I have missed the boat.

You consider this to be a throw-away function that won't ever be
committed. However, I strongly feel that you should move it into
/contrib, so that it can serve as a sort of reference implementation
for authors of decoder client code, in the same spirit as numerous
existing contrib modules (think contrib/spi).

Without prejudice to the rest of this review which looks quite
well-considered, I'd like to add a particular +1 to this point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#41Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#40)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Wed, Oct 10, 2012 at 09:10:48PM -0400, Robert Haas wrote:

You consider this to be a throw-away function that won't ever be
committed. However, I strongly feel that you should move it into
/contrib, so that it can serve as a sort of reference implementation
for authors of decoder client code, in the same spirit as numerous
existing contrib modules (think contrib/spi).

Without prejudice to the rest of this review which looks quite
well-considered, I'd like to add a particular +1 to this point.

The review was _HUGE_! :-O

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#40)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining.

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable? What happens when they aren't?
Or even just that the effective commit order is not XID order?

regards, tom lane

#43Greg Stark
stark@mit.edu
In reply to: Tom Lane (#42)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining.

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable? What happens when they aren't?
Or even just that the effective commit order is not XID order?

Firstly, I haven't read the code but I'm confident it doesn't make the
elementary error of assuming commit order == xid order. I assume it's
applying the reassembled transactions in commit order.

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants. The data written by
this transaction is either written or not when the commit happens and
it's all written or not at that time. Even in non-serializable mode
updates take row locks and nobody can see the data or modify it until
the transaction commits.

I have to say I was curious about Robert's point as well when I read
Peter's review. Especially because this is exactly how other logical
replication systems I've seen work too and I've always wondered about
it in those systems. Both MySQL and Oracle reassemble transactions and
don't write anything until they have the whole transaction
reassembled. To me this always struck me as a bizarre and obviously
bad thing to do though. It seems to me it would be better to create
sessions (or autonomous transactions) for each transaction seen in the
stream and issue the DML as it shows up, committing and cleaning each
up when the commit or abort (or shutdown or startup) record comes
along.

I imagine the reason lies with dealing with locking and ensuring that
you get the correct results without deadlocks when multiple
transactions try to update the same record. But it seems to me that
the original locks the source database took should protect you against
any problems. As long as you can suspend a transaction when it takes a
lock that blocks and keep processing WAL for other transactions (or an
abort for that transaction if that happened due to a deadlock or user
interruption) you should be fine.

--
greg

#44Andres Freund
andres@2ndquadrant.com
In reply to: Peter Geoghegan (#35)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

Hi,

First of: Awesome review.

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:

What follows is an initial overview of the patch (or at least my
understanding of the patch, which you may need to correct), and some
points of concern.

* applycache module which reassembles transactions from a stream of
interspersed changes

This is what the design doc [2] refers to as "4.5. TX reassembly".

This functionality is concentrated in applycache.c. As [2] notes, the
reassembly component "was previously coined ApplyCache because it was
proposed to run on replication consumers just before applying changes.
This is not the case anymore. It is still called that way in the
source of the patch recently submitted."

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I meant to avoid talking about anything other than the bigger picture
for now, but I must ask: Why the frequent use of malloc(),
particularly within applycache.c? The obvious answer is that it's
rough code and that that will change, but that still doesn't comport
with my idea about how rough Postgres code should look, so I have to
wonder if there's a better reason.

Several reasons, not sure how good:
- part of the code (was) supposed to be runnable on a target system without a
full postgres backend arround
- All of the allocations would basically have to be in TopMemoryContext or
something equally longlived as we don't have a transaction context or anything
like that
- the first revision showed that allocating memory was the primary bottleneck
so I added a small allocation cache to the critical pieces which solved that.
After that there seems no point in using mctx''s for that kind of memory
anymore.

applycache.c has an acute paucity of comments, which makes it really
hard to review well.

Working on that. I don't think its internals are really all that interesting
atm.

I'm going to not comment much on this here, except to say that
I think that the file should be renamed to reassembly.c or something
like that, to reflect its well-specified purpose, and not how it might
be used.

Agreed.

Applycache is presumably where you're going to want to spill
transaction streams to disk, eventually. That seems like a
prerequisite to commit.

Yes.

By the way, I see that you're doing this here:

+ /* most callers don't need snapshot.h */
+ typedef struct SnapshotData *Snapshot;

Tom, Alvaro and I had a discussion about whether or not this was an
acceptable way to reduce build dependencies back in July [8] – I lost
that one. You're relying on a Gnu extension/C11 feature here (that is,
typedef redefinition). If you find it intensely irritating that you
cannot do this while following the standard to the letter, you are not
alone.

Yuck :(. So I will just use struct SnapshotData directly instead of a
typedef...

* snapbuilder which builds catalog snapshots so that tuples from wal can
be understood

This component analyses xlog and builds a special kind of Snapshot.
This has been compared to the KnownAssignedXids machinery for Hot
Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is
meant by this). Since decoding only has to occur within a single
backend, I guess it's sufficient that it's all within local memory (in
contrast to the KnownAssignedXids array, which is in shared memory).

I haven't found a convincing argument to share that information itself. If we
want to parallelise this I think sharing the snapshots should be sufficient.
Given the snapshot only covers system catalogs the changerate should be fine.

The design document [2] really just explains the problem (which is the
need for catalog metadata at a point in time to make sense of heap
tuples), without describing the solution that this patch offers with
any degree of detail. Rather, [2] says "How we build snapshots is
somewhat intricate and complicated and seems to be out of scope for
this document", which is unsatisfactory. I look forward to reading the
promised document that describes this mechanism in more detail. It's
really hard to guess what you might have meant to do, and why you
might have done it, much less verifying the codes correctness.

Will concentrate on finishing that document.

This functionality is concentrated in snapbuild.c. A comment in decode.c
notes:

+ * Its possible that the separation between decode.c and snapbuild.c is
a + * bit too strict, in the end they just about have the same struct.

I prefer the current separation. I think it's reasonable that decode.c
is sort of minimal glue code.

Good.

Decoding means breaking up individual XLogRecord structs, and storing
them in an applycache (applycache does this, and stores them as
ApplyCacheChange records), while building a snapshot (which is needed
in advance of adding tuples from records). It can be thought of as the
small piece of glue between applycache and snapbuild that is called by
XLogReader (DecodeRecordIntoApplyCache() is the only public function,
which will be called by many xlogreader_state.finished_record-hooked
plugin functions in practice, including this example one). An example
of what belongs in decode.c is the way it ignores physical
XLogRecords, because they are not of interest.

By the way, why not just use struct assignment here?:

+ memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));

At the time I initially wrote the code that seemed to be the project standard.
I think this only recently changed.

* decode_xlog(lsn, lsn) debugging function

You consider this to be a throw-away function that won't ever be
committed.

Only in its current hacked-together form.

However, I strongly feel that you should move it into
/contrib, so that it can serve as a sort of reference implementation
for authors of decoder client code, in the same spirit as numerous
existing contrib modules (think contrib/spi). I think that such a
module could even be useful to people that were just a bit
intellectually curious about how WAL works, which is something I'd
like to encourage. Besides, simply having this code in a module will
more explicitly demarcate client code (just look at logicalfuncs.c –
it is technically client code, but that's too subtle right now).

I definitely think we need something like this, but it might look a bit
different. For reasons you found later (xmin horizon) I don't think we can run
it in the context of a normal backend for one.

I don't like this code in decode_xlog():

+ apply_state = (ReaderApplyState*)xlogreader_state->private_data;

Why is it necessary to cast here? In other words, why is private_data
a void pointer at all?

The reason is that for a long time I hoped to keep ApplyCache's generic enough
to be usable outside of this exact scenario. I am not sure if thats a
worthwile goal anymore, especially as there are some layering violations now.

The applycache provides 3 major callbacks:
* apply_begin
* apply_change
* apply_commit

These are callbacks intended to be used by third-party modules,
perhaps including a full multi-master replication implementation
(though this patch isn't directly concerned with that), or even a
speculative future version of a logical replication system like Slony.
[2] refers to these under "4.7. Output Plugin". These are the typedef
for the hook types involved (incidentally, don't like the name of
these types – I think you should lose the CB):

Not particularly attached to the CB, so I can loose it.

+/* XXX: were currently passing the originating subtxn. Not sure thats
necessary */
+typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache,
ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change);
+typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn);

So we register these callbacks like this in the patch:

+	/*
+	 * allocate an ApplyCache that will apply data using lowlevel calls
+	 * without type conversion et al. This requires binary compatibility
+	 * between both systems.
+	 * XXX: This would be the place too hook different apply methods, like
+	 * producing sql and applying it.
+	 */
+	apply_cache = ApplyCacheAllocate();
+	apply_cache->begin = decode_begin_txn;
+	apply_cache->apply_change = decode_change;
+	apply_cache->commit = decode_commit_txn;

The decode_xlog(lsn, lsn) debugging function that Andres has played
with [6] (that this patch makes available, for now) is where this code
comes from.

Whenever ApplyCache calls an "apply_change" callback for a single
change (that is, a INSERT|UPDATE|DELETE) it locally overrides the
normal SnapshotNow semantics used for catalog access with a previously
built snapshot. Behaviour should now be consistent with a normal
snapshotNow acquired when the tuple change was originally written to
WAL.

I additionally want pass a mvcc-ish Snapshot (should just need a a change in
function signatures) so the output plugins can query the catalog manually.

Having commented specifically on modules that Andres highlighted, I'd
like to highlight one myself: tqual.c . This module has had
significant new functionalty added, so it would be an omission to not
pass remark on it in this opening e-mail, having mentioned all other
modules with significant new pieces of functionality. The file has had
new utility functions added, that pertain to snapshot visibility
during decoding - "time travel".

For me that mentally was part of catalog timetravel, thats why I didn't
highlight it separately. But yes, this needs to be discussed.

I draw attention to this. This code is located within the new function
HeapTupleSatisfiesMVCCDuringDecoding(), which is analogous to what is
done for "dirty snapshots" (dirty snapshots are used with
ri_triggers.c, for example, when even uncommitted tuple should be
visible).

Not sure where you see the similarity with dirty snapshots?

Both of these functions are generally accessed through
function pointers. Anyway, here's the code:

+ 	/*
+ 	 * FIXME: The not yet existing decoding infrastructure will need to
force + 	 * the xmin to stay lower than what they are currently 

decoding.

+ */
+ bool fixme_xmin_horizon = false;

I'm sort of wondering what this is going to look like in the finished
patch. This FIXME is rather hard for me to take at face value. It
seems to me that the need to coordinate decoding with xmin horizon
itself represents a not insignificant engineering challenge. So, with
all due respect, it would be nice if I wasn't asked to make that leap
of faith. The xmin horizon prepared transaction hack needs to go.

The idea is to do the decoding inside (or in something very similar) like
walsenders. Those already have the capability to keep the xmin horizon nailed
to some point for the hot_standby_feedback feature. There is a need for
something similar like what prepared statements do for restartability after a
restart though.

Within tqual.h, shouldn't you have something like this, but for time
travel snapshots during decoding?:

#define InitDirtySnapshot(snapshotdata) \
((snapshotdata).satisfies = HeapTupleSatisfiesDirty)

Hm. Ok.

Alternatively, adding a variable to these two might be appropriate:

static SnapshotData CurrentSnapshotData = {HeapTupleSatisfiesMVCC};
static SnapshotData SecondarySnapshotData = {HeapTupleSatisfiesMVCC};

Not sure where youre going from this?

My general feeling is that the code is very under-commented, and in
need of a polish, though I'm sure that you are perfectly well aware of
that.

Totally aggreed.

The basic way all of these components that I have described
separately fit together is: (if others want to follow this, refer to
decode_xlog())

1. Start with some client code “output plugin” (for now, a throw-away
debugging function “decode_xlog()”)

The general idea is to integrate this into the walsender framework in a
command very roughly looking like:
START_LOGICAL_REPLICATION $plugin LSN

\ /
2. Client allocates an XLogReaderState. (This module is a black box to
me, though it's well encapsulated so that shouldn't matter much.
Heikki is reviewing this [7]. Like I said, this isn't quite an atomic
unit I'm reviewing.)

\ /
3. Plugin registers various callbacks (within logicalfuncs.c). These
callbacks, while appearing in this patch, are mostly NO-OPS, and are
somewhat specific to XLogReader's concerns. I mostly defer to Heikki
here.

\ /
4. Plugin allocates an “ApplyCache”. Plugin assigns some more
callbacks to “ApplyCache”. This time, they're the aforementioned 3
apply cache functions.

\ /
5. Plugin assigns this new ApplyCache to variable within private state
of the XLogReader (this private state is a subset of its main state,
and is opaque to XLogReader).

\ /
6. Finally, plugin calls XLogReader(main_state).

\ /
7. At some point during its magic,
XLogReader calls the hook registered in step 3, finished_record.
This is all it does directly
with the plugin, which it makes minimal assumptions about.

\ /
8. finished_record (which is
logically a part of the “plugin”) knows what type the opaque
private_data
actually is. It casts it
to an apply_state, and calls the decoder (supplying the apply_state as
an argument to
DecodeRecordIntoApplyCache()).

\ /
9. During the first call
(within the first record within a call to decode_xlog()), we allocate
a snapshot reader.

\ /
10. Builds snapshot callback.
This scribbles on our snapshot state, which essentially encapsulates a
snapshot.
The state (and
snapshot) changes continually, once per call.

It changes only if there were catalog changes in the transaction and/or we
haven't yet build an initial snapshot.

\ /
11. Looks at XLogRecordBuffer
(an XLogReader struct). Looks at an XLogRecord. Decodes based on
record type.
Let's assume it's an
XLOG_HEAP_INSERT.

\ /
12. DecodeInsert() called.
This in turn calls DecodeXLogTuple(). We store the tuple metadata in
our
ApplyCache. (some
ilists, somewhere, each corresponding to an XID). We don't store the
relation oid, because we don't
know it yet (only
relfilenode is known from WAL).
/
/
\ /
13. We're back in XLogReader(). It
calls the only callback of interest to
us covered in step 3 (and
not of interest to XlogReader()/Heikki) – decode_change(). It does
this through the
apply_cache.apply_change
hook. This happens because we encounter another record, this time a
commit record (in the same
codepath as discussed in step 12).

\ /
14. In decode_change(), the actual
function that raises the interesting WARNINGs within Andres'
earlier example [6], showing
actual integer/varchar Datum value for tuples previously inserted.
Resolve table oid based on
relfilenode (albeit unsatisfactorily).
Using a StringInfo,
tupledescs, syscache and typcache, build WARNING string.

So my general opinion of how all this fits together is that it isn't
quite right. Problems include:

I think its more understandable if you think that normally the initialization
code shouldn't be repeated in the plugins but the plugins are just called from
a walsender.

* Why does the ReaderApplyState get magically initialised in two
stages? apply_cache is initialised in decode_xlog (or whatever
plugin). Snapstate is allocated within DecodeRecordIntoApplyCache()
(with a dependency on apply_cache). Shouldn't this all be happening
within a single function? As you yourself have point out, not everyone
needs to know about these snapshots.

The reason is/was that I didn't want the outside know anything about the
Snapshot building, that seems to be only relevant to decode.c (and somewhat
applycache.c).
I can look at it though.

* Maybe I've missed something, but I think you need a more
satisfactory example plugin. What happens in step 14 is plainly
unacceptable. You haven't adequately communicated to me how this is
going to be used in logical replication. Maybe I just haven't got that
far yet.

Well, what would you like to happen instead? Does it make more sense with the
concept that the output plugin will eventually run inside walsender(-ish) and
stream changes via COPY? I just couldn't finish a POC for that in time.

I'm not impressed by the InvalidateSystemCaches() calls here
and elsewhere.

Yes, those definitely got to go and we have nearly all the infrastructure for
it from Hot Standby. It was noted as a deficiency in the overview mail & the
commit message though ;). I have that locally.

The rough idea here is to reuse xl_xact_commit->nmsgs to handle the cache
behaviour. There are some more complications but let me first write the
timetravel document.

* Please break-out the client code as a contrib module. That
separation would increase the readability of the patch.

Well, as I said above the debugging function in its current form is not going
to walk, as soon we have aggreed on how to integrate this into walsender I am
definitely going to provide a example/debugging plugin which I definitely intend
to submit.

Thanks!

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#45Bruce Momjian
bruce@momjian.us
In reply to: Greg Stark (#43)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Thu, Oct 11, 2012 at 03:16:39AM +0100, Greg Stark wrote:

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining.

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable? What happens when they aren't?
Or even just that the effective commit order is not XID order?

Firstly, I haven't read the code but I'm confident it doesn't make the
elementary error of assuming commit order == xid order. I assume it's
applying the reassembled transactions in commit order.

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants. The data written by
this transaction is either written or not when the commit happens and
it's all written or not at that time. Even in non-serializable mode
updates take row locks and nobody can see the data or modify it until
the transaction commits.

How does Slony write its changes without causing serialization replay
conflicts?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#43)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

Greg Stark <stark@mit.edu> writes:

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable? What happens when they aren't?
Or even just that the effective commit order is not XID order?

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants.

I would believe that argument if the "apply" operations were at a
similar logical level to our current WAL records, namely drop these bits
into that spot. Unfortunately, they're not. I think this argument
falls to the ground entirely as soon as you think about DDL being
applied by transactions A,B,C and then needing to express what
concurrent transactions X,Y,Z did in "source" terms. Even something as
simple as a few column renames could break it, let alone anything as
esoteric as changing the meaning of datatype literals.

regards, tom lane

#47Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#40)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Thursday, October 11, 2012 03:10:48 AM Robert Haas wrote:

On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com>

wrote:

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining. With streaming
replication as it exists today, a transaction that modifies a huge
amount of data (such as a bulk load) can be applied on the standby as
it happens. The rows thus inserted will become visible only if and
when the transaction commits on the master and the commit record is
replayed on the standby. This has a number of important advantages,
perhaps most importantly that the lag between commit and data
visibility remains short. With the proposed system, we can't start
applying the changes until the transaction has committed and the
commit record has been replayed, so a big transaction is going to have
a lot of apply latency.

I don't think there is a fundamental problem here, just an incremental ones.

The major problems are:
* transactions with DDL & DML currently need to be reassembled, it might be
possible to resolve this though, haven't thought about it too much
* subtransaction are only assigned to toplevel transactions at commit time
* you need a variable amount of backends/parallel transactions open at the
target system to apply all the transactions concurrently. You can't smash them
together because one of them might rollback.

All of those seem solveable to me, so I am not too worried about addition of a
streaming mode somewhere down the line. I don't want to focus on it right now
though. Ok?

Here's you making the same point in different words:

Applycache is presumably where you're going to want to spill
transaction streams to disk, eventually. That seems like a
prerequisite to commit.

Second, crash recovery. I think whatever we put in place here has to
be able to survive a crash on any node. Decoding must be able to
restart successfully after a system crash, and it has to be able to
apply exactly the set of transactions that were committed but not
applied prior to the crash. Maybe an appropriate mechanism for this
already exists or has been discussed, but I haven't seen it go by;
sorry if I have missed the boat.

I have discussed it privately & roughly prototyped, but not publically. There
are two pieces to this:
1) restartable after a crash/disconnection/shutdown
2) pick of exactly where it stopped

Those are somewhat different because 1) is relevant on the source side and be
solved there. 2) depends on the target system because it needs to ensure that
it safely received the changes up to some point.

The idea for 1) is to serialize the applycache whenever we reach a checkpoint
and have that as a starting point for every confirmed flush location of 2).

Obviously 2) will need cooperation by the receiving side.

You consider this to be a throw-away function that won't ever be
committed. However, I strongly feel that you should move it into
/contrib, so that it can serve as a sort of reference implementation
for authors of decoder client code, in the same spirit as numerous
existing contrib modules (think contrib/spi).

Without prejudice to the rest of this review which looks quite
well-considered, I'd like to add a particular +1 to this point.

So were in violent agreement here ;)

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#48Andres Freund
andres@2ndquadrant.com
In reply to: Greg Stark (#43)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Thursday, October 11, 2012 04:16:39 AM Greg Stark wrote:

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining.

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable? What happens when they aren't?
Or even just that the effective commit order is not XID order?

Firstly, I haven't read the code but I'm confident it doesn't make the
elementary error of assuming commit order == xid order. I assume it's
applying the reassembled transactions in commit order.

Yes, its commit order.

Imo commit order is more like assuming all transactions are done in read
committed and not above than assuming serializable? Or am I missing something?

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants. The data written by
this transaction is either written or not when the commit happens and
it's all written or not at that time. Even in non-serializable mode
updates take row locks and nobody can see the data or modify it until
the transaction commits.

Yes. There will be problems if you want to make serializable work across
nodes, but that seems like something fiendishly complex anyway. I don't plan to
work on it in the forseeable future.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#49Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#46)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote:

Greg Stark <stark@mit.edu> writes:

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable? What happens when they aren't?
Or even just that the effective commit order is not XID order?

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants.

I would believe that argument if the "apply" operations were at a
similar logical level to our current WAL records, namely drop these bits
into that spot. Unfortunately, they're not. I think this argument
falls to the ground entirely as soon as you think about DDL being
applied by transactions A,B,C and then needing to express what
concurrent transactions X,Y,Z did in "source" terms. Even something as
simple as a few column renames could break it,

Not sure what youre getting at here? Are you talking about the problems at the
source side or the target side?

During decoding such problems should be handled already. As we reconstruct a
Snapshot that lets catalog access look like it did back when the tuple was
thrown into was we have the exact column names, data types and everything. The
locking used when making the original changes prevents the data types, column
names to be changed mid-transaction.

If youre talking about the receiving/apply side: Sure, if you do careless DDL
and you don't replicate DDL (not included here, separate project), youre going
to have problems. I don't think there is much we can do about that.

let alone anything as esoteric as changing the meaning of datatype literals.

Hm. You mean like changing the input format of a datatype? Yes, sure, that
will cause havoc.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#50Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#49)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Thursday, October 11, 2012 04:49:20 AM Andres Freund wrote:

On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote:

Greg Stark <stark@mit.edu> writes:

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Isn't there an even more serious problem, namely that this assumes
all transactions are serializable? What happens when they aren't?
Or even just that the effective commit order is not XID order?

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants.

I would believe that argument if the "apply" operations were at a
similar logical level to our current WAL records, namely drop these bits
into that spot. Unfortunately, they're not. I think this argument
falls to the ground entirely as soon as you think about DDL being
applied by transactions A,B,C and then needing to express what
concurrent transactions X,Y,Z did in "source" terms. Even something as
simple as a few column renames could break it,

Not sure what youre getting at here? Are you talking about the problems at
the source side or the target side?

During decoding such problems should be handled already

Btw, the introductionary email upthread shows a trivial example. As submitted
the code cannot handle intermingled DDL/DML transactions, but I fixed that now.

There are some problems with CLUSTER/VACUUM FULL on system catalogs, but thats
going to be separate thread...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#51Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#33)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 22.09.2012 20:00, Andres Freund wrote:

[[basic-schema]]
.Architecture Schema
["ditaa"]
------------------------------------------------------------------------------
Traditional Stuff

+---------+---------+---------+---------+----+
| Backend | Backend | Backend | Autovac | ...|
+----+----+---+-----+----+----+----+----+-+--+
|        |          |         |      |
+------+ | +--------+         |      |
+-+      | | | +----------------+      |
|        | | | |                       |
|        v v v v                       |
|     +------------+                   |
|     | WAL writer |<------------------+
|     +------------+
|       | | | | |
v       v v v v v       +-------------------+
+--------+ +---------+   +->| Startup/Recovery  |
|{s}     | |{s}      |   |  +-------------------+
|Catalog | |   WAL   |---+->| SR/Hot Standby    |
|        | |         |   |  +-------------------+
+--------+ +---------+   +->| Point in Time     |
^          |            +-------------------+
---|----------|--------------------------------
|       New Stuff
+---+          |
|              v            Running separately
| +----------------+  +=-------------------------+
| | Walsender  |   |  |                          |
| |            v   |  |    +-------------------+ |
| +-------------+  |  | +->| Logical Rep.      | |
| |     WAL     |  |  | |  +-------------------+ |
+-|  decoding   |  |  | +->| Multimaster       | |
| +------+------/  |  | |  +-------------------+ |
| |            |   |  | +->| Slony             | |
| |            v   |  | |  +-------------------+ |
| +-------------+  |  | +->| Auditing          | |
| |     TX      |  |  | |  +-------------------+ |
+-| reassembly  |  |  | +->| Mysql/...         | |
| +-------------/  |  | |  +-------------------+ |
| |            |   |  | +->| Custom Solutions  | |
| |            v   |  | |  +-------------------+ |
| +-------------+  |  | +->| Debugging         | |
| |   Output    |  |  | |  +-------------------+ |
+-|   Plugin    |--|--|-+->| Data Recovery     | |
+-------------/  |  |    +-------------------+ |
|                |  |                          |
+----------------+  +--------------------------|
------------------------------------------------------------------------------

This diagram triggers a pet-peeve of mine: What do all the boxes and
lines mean? An architecture diagram should always include a key. I find
that when I am drawing a diagram myself, adding the key clarifies my own
thinking too.

This looks like a data-flow diagram, where the arrows indicate the data
flows between components, and the boxes seem to represent processes. But
in that case, I think the arrows pointing from the plugins in walsender
to Catalog are backwards. The catalog information flows from the Catalog
to walsender, walsender does not write to the catalogs.

Zooming out to look at the big picture, I think the elephant in the room
with this whole effort is how it fares against trigger-based
replication. You list a number of disadvantages that trigger-based
solutions have, compared to the proposed logical replication. Let's take
a closer look at them:

* essentially duplicates the amount of writes (or even more!)

True.

* synchronous replication hard or impossible to implement

I don't see any explanation it could be implemented in the proposed
logical replication either.

* noticeable CPU overhead
* trigger functions
* text conversion of data

Well, I'm pretty sure we could find some micro-optimizations for these
if we put in the effort. And the proposed code isn't exactly free, either.

* complex parts implemented in several solutions

Not sure what this means, but the proposed code is quite complex too.

* not in core

IMHO that's a good thing, and I'd hope this new logical replication to
live outside core as well, as much as possible. But whether or not
something is in core is just a political decision, not a reason to
implement something new.

If the only meaningful advantage is reducing the amount of WAL written,
I can't help thinking that we should just try to address that in the
existing solutions, even if it seems "easy to solve at a first glance,
but a solution not using a normal transactional table for its log/queue
has to solve a lot of problems", as the document says. Sorry to be a
naysayer, but I'm pretty scared of all the new code and complexity these
patches bring into core.

PS. I'd love to see a basic Slony plugin for this, for example, to see
how much extra code on top of the posted patches you need to write in a
plugin like that to make it functional. I'm worried that it's a lot..

- Heikki

#52Hannu Krosing
hannu@krosing.net
In reply to: Tom Lane (#46)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On 10/11/2012 04:31 AM, Tom Lane wrote:

Greg Stark <stark@mit.edu> writes:

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable? What happens when they aren't?
Or even just that the effective commit order is not XID order?

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants.

I would believe that argument if the "apply" operations were at a
similar logical level to our current WAL records, namely drop these bits
into that spot. Unfortunately, they're not. I think this argument
falls to the ground entirely as soon as you think about DDL being
applied by transactions A,B,C and then needing to express what
concurrent transactions X,Y,Z did in "source" terms. Even something as
simple as a few column renames could break it, let alone anything as
esoteric as changing the meaning of datatype literals.

This is the whole reason of moving the reassembly to the source
side and having the possibility to use old snapshots to get the
catalog information.

Also, the locks that protect you from effects of field name changes
by DDL concurrent transactions protect also the logical reassembly
if done in the commit order.

Show quoted text

regards, tom lane

#53Hannu Krosing
hannu@krosing.net
In reply to: Robert Haas (#40)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On 10/11/2012 03:10 AM, Robert Haas wrote:

On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining.

The lack of pipelining (and the following complexity of applycache
and spilling to disk) is something we have discussed with Andres and
to my understanding it is not a final design decision but just stepping
stones in how this quite large development is structured.

The pipelining (or parallel apply as I described it) requires either a
large
number of apply backends and code to manage them or autonomous
transactions.

It could (arguably !) be easier to implement autonomous transactions
instead of apply cache, but Andres had valid reasons to start with apply
cache and move to parallel apply later .

As I understand it the parallel apply is definitely one of the things that
will be coming and after that the performance characteristics (fast AND
smooth) will be very similar to current physical WAL streaming.

Show quoted text

With streaming
replication as it exists today, a transaction that modifies a huge
amount of data (such as a bulk load) can be applied on the standby as
it happens. The rows thus inserted will become visible only if and
when the transaction commits on the master and the commit record is
replayed on the standby. This has a number of important advantages,
perhaps most importantly that the lag between commit and data
visibility remains short. With the proposed system, we can't start
applying the changes until the transaction has committed and the
commit record has been replayed, so a big transaction is going to have
a lot of apply latency.

Now, I am not 100% opposed to a design that surrenders this property
in exchange for other important benefits, but I think it would be
worth thinking about whether there is any way that we can design this
that either avoids giving that property up at all, or gives it up for
the time being but allows us to potentially get back to it in a later
version. Reassembling complete transactions is surely cool and some
clients will want that, but being able to apply replicated
transactions *without* reassembling them in their entirety is even
cooler, and some clients will want that, too.

If we're going to stick with a design that reassembles transactions, I
think there are a number of issues that deserve careful thought.
First, memory usage. I don't think it's acceptable for the decoding
process to assume that it can allocate enough backend-private memory
to store all of the in-flight changes (either as WAL or in some more
decoded form). We have assiduously avoided such assumptions thus far;
you can write a terabyte of data in one transaction with just a
gigabyte of shared buffers if you so desire (and if you're patient).
Here's you making the same point in different words:

Applycache is presumably where you're going to want to spill
transaction streams to disk, eventually. That seems like a
prerequisite to commit.

Second, crash recovery. I think whatever we put in place here has to
be able to survive a crash on any node. Decoding must be able to
restart successfully after a system crash, and it has to be able to
apply exactly the set of transactions that were committed but not
applied prior to the crash. Maybe an appropriate mechanism for this
already exists or has been discussed, but I haven't seen it go by;
sorry if I have missed the boat.

You consider this to be a throw-away function that won't ever be
committed. However, I strongly feel that you should move it into
/contrib, so that it can serve as a sort of reference implementation
for authors of decoder client code, in the same spirit as numerous
existing contrib modules (think contrib/spi).

Without prejudice to the rest of this review which looks quite
well-considered, I'd like to add a particular +1 to this point.

#54Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#51)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:

On 22.09.2012 20:00, Andres Freund wrote:

[[basic-schema]]
.Architecture Schema
["ditaa"]
-------------------------------------------------------------------------
-----

Traditional Stuff

+---------+---------+---------+---------+----+

| Backend | Backend | Backend | Autovac | ...|

+----+----+---+-----+----+----+----+----+-+--+

+------+ | +--------+ | |

+-+ | | | +----------------+ |

| v v v v |
|
| +------------+ |
|
| | WAL writer |<------------------+
|
| +------------+

v v v v v v +-------------------+

+--------+ +---------+ +->| Startup/Recovery |

|{s} | |{s} | | +-------------------+
|Catalog | | WAL |---+->| SR/Hot Standby |
|
| | | | | +-------------------+

+--------+ +---------+ +->| Point in Time |

^ | +-------------------+

---|----------|--------------------------------

| New Stuff

+---+ |

| v Running separately
|
| +----------------+ +=-------------------------+
|
| | Walsender | | | |
| |
| | v | | +-------------------+ |
|
| +-------------+ | | +->| Logical Rep. | |
|
| | WAL | | | | +-------------------+ |

+-| decoding | | | +->| Multimaster | |

| +------+------/ | | | +-------------------+ |
|
| | | | | +->| Slony | |
| |
| | v | | | +-------------------+ |
|
| +-------------+ | | +->| Auditing | |
|
| | TX | | | | +-------------------+ |

+-| reassembly | | | +->| Mysql/... | |

| +-------------/ | | | +-------------------+ |
|
| | | | | +->| Custom Solutions | |
| |
| | v | | | +-------------------+ |
|
| +-------------+ | | +->| Debugging | |
|
| | Output | | | | +-------------------+ |

+-| Plugin |--|--|-+->| Data Recovery | |

+-------------/ | | +-------------------+ |

+----------------+ +--------------------------|

-------------------------------------------------------------------------
-----

This diagram triggers a pet-peeve of mine: What do all the boxes and
lines mean? An architecture diagram should always include a key. I find
that when I am drawing a diagram myself, adding the key clarifies my own
thinking too.

Hm. Ok.

This looks like a data-flow diagram, where the arrows indicate the data
flows between components, and the boxes seem to represent processes. But
in that case, I think the arrows pointing from the plugins in walsender
to Catalog are backwards. The catalog information flows from the Catalog
to walsender, walsender does not write to the catalogs.

The reason I drew it that way is that it actively needs to go back to the
catalog and query it which is somewhat different of the rest which basically
could be seen as a unidirectional pipeline.

Zooming out to look at the big picture, I think the elephant in the room
with this whole effort is how it fares against trigger-based
replication. You list a number of disadvantages that trigger-based
solutions have, compared to the proposed logical replication. Let's take

a closer look at them:

* essentially duplicates the amount of writes (or even more!)

True.

By now I think its essentially unfixable.

* synchronous replication hard or impossible to implement
I don't see any explanation it could be implemented in the proposed

logical replication either.

Its basically the same as its for synchronous streaming repl. At the place
where SyncRepWaitForLSN() is done you instead/also wait for the decoding to
reach that lsn (its in the wal, so everything is decodable) and for the other
side to have confirmed reception of those changes. I think this should be
doable with only minor code modifications.

The existing support for all that is basically the reason we want to reuse the
walsender framework. (will start a thread about that soon)

* noticeable CPU overhead

* trigger functions
* text conversion of data

Well, I'm pretty sure we could find some micro-optimizations for these
if we put in the effort.

Any improvements there are a good idea independent from this proposal but I
don't see how we can fundamentally improve from the status quo.

And the proposed code isn't exactly free, either.

If you don't have frequent DDL its really not all that expensive. In the
version without DDL support I didn't manage to saturate the ApplyCache with
either parallel COPY in individual transactions (repeated 100MB files) or with
pgbench.
Also its basically doing work that the trigger/queue based solutions have to do
as well, just that they do it via far less optimized sql statements.

DDL support doesn't really change much as the overhead for transactions without
DDL and without concurrently running DDL should be fairly minor (the submitted
version is *not* finialized there, it builds a new snapshot instead of
copying/referencing the old one).

* complex parts implemented in several solutions

Not sure what this means, but the proposed code is quite complex too.

It is, agreed.

What I mean is that significantly complex logic is burried in the encoding,
queuing and decoding/ordering logic of every trigger based replication. Thats
not a good thing.

* not in core

IMHO that's a good thing, and I'd hope this new logical replication to
live outside core as well, as much as possible.

I don't agree there, but I would like to keep that a separate discussion.

For now I/we only want to submit the changes that technically need in-core
support to work sensibly (this, background workers, some walsender
integration). The goal of working nearly completely without special in-core
support held the existing solutions back quite a bit imo.

But whether or not something is in core is just a political decision, not a
reason to implement something new.

Isn't it both? There are things you simply cannot do unless youre inside core.

Politically I think the external status of all those logical replication
projects grew to be an adoption barrier. I don't even want to think about how
many bad home-grown logical replication solutions I have seen out there that
implement everything from the get-go.

If the only meaningful advantage is reducing the amount of WAL written,
I can't help thinking that we should just try to address that in the
existing solutions, even if it seems "easy to solve at a first glance,
but a solution not using a normal transactional table for its log/queue
has to solve a lot of problems", as the document says.

Youre welcome to make suggestions, but everything I could think of that didn't
fall short of reality ended up basically duplicating the amount of writes &
fsyncs, even if not going through the WAL.

You need to be crash safe/restartable (=> writes, fsyncs) and you need to
reduce the writes (in memory, => !writes). There is only one authoritative
point where you can rely on a commit to have been successfull and thats when
the commit record has been written to the WAL. You can't send out the data to
be committed before thats written because that could result in spuriously
committed transactions on the remote side and you can't easily do it afterwards
because you can crash after the commit.

Sorry to be a naysayer, but I'm pretty scared of all the new code and
complexity these patches bring into core.

Understandable. I tried to keep the introduction of complexity in existing code
paths relatively minor and I think I mostly succeeded there but it still needs
to be maintained.

PS. I'd love to see a basic Slony plugin for this, for example, to see
how much extra code on top of the posted patches you need to write in a
plugin like that to make it functional. I'm worried that it's a lot..

I think before its possible to do something like that a bit more design
decisions need to be made. Mostly the walsender(ish) integration needs to be
done.

After that I can imagine writing a demo plugin that outputs changes in a slony
compatible format, but I would like to see some slony/londiste person
cooperating on receiving/applying those.

What complications are you imagining?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#55Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#45)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On 10/10/12 7:26 PM, Bruce Momjian wrote:

How does Slony write its changes without causing serialization replay
conflicts?

Since nobody from the Slony team answered this:

a) Slony replicates *rows*, not *statements*
b) Slony uses serializable mode internally for row replication
c) it's possible (though difficult) for creative usage to get Slony into
a deadlock situation

FWIW, I have always assumed that is is impossible (even theoretically)
to have statement-based replication without some constraints on the
statements you can run, or some replication failures. I think we should
expect 9.3's logical replication out-the-gate to have some issues and
impose constraints on users, and improve with time but never be perfect.

The design Andres and Simon have advanced already eliminates a lot of
the common failure cases (now(), random(), nextval()) suffered by pgPool
and similar tools. But remember, this feature doesn't have to be
*perfect*, it just has to be *better* than the alternatives.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#56Christopher Browne
cbbrowne@gmail.com
In reply to: Bruce Momjian (#45)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On Wed, Oct 10, 2012 at 10:26 PM, Bruce Momjian <bruce@momjian.us> wrote:

How does Slony write its changes without causing serialization replay
conflicts?

It uses a sequence to break any ordering conflicts at the time that
data is inserted into its log tables.

If there are two transactions, A and B, that were "fighting" over a
tuple on the origin, then either:

a) A went first, B went second, and the ordering in the log makes that
order clear;
b) A succeeds, then B fails, so there's no conflict;
c) A is doing its thing, and B is blocked behind it for a while, then
A fails, and B gets to go through, and there's no conflict.

Switch A and B as needed.

The sequence that is used establishes what is termed a "compatible
ordering." There are multiple possible compatible orderings; ours
happen to interleave transactions together, with the sequence
guaranteeing absence of conflict.

If we could get commit orderings, then a different but still
"compatible ordering" would be to have each transaction establish its
own internal sequence, and apply things in order based on
(commit_tx_order, sequence_within).
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Stark (#43)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On 11 October 2012 03:16, Greg Stark <stark@mit.edu> wrote:

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining.

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable? What happens when they aren't?
Or even just that the effective commit order is not XID order?

Firstly, I haven't read the code but I'm confident it doesn't make the
elementary error of assuming commit order == xid order. I assume it's
applying the reassembled transactions in commit order.

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants. The data written by
this transaction is either written or not when the commit happens and
it's all written or not at that time. Even in non-serializable mode
updates take row locks and nobody can see the data or modify it until
the transaction commits.

This uses Commit Serializability, which is valid, as you say.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#58Steve Singer
steve@ssinger.info
In reply to: Josh Berkus (#55)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

On 12-10-11 06:27 PM, Josh Berkus wrote:

On 10/10/12 7:26 PM, Bruce Momjian wrote:

How does Slony write its changes without causing serialization replay
conflicts?

Since nobody from the Slony team answered this:

a) Slony replicates *rows*, not *statements*

True, but the proposed logical replication also would replicate rows not
the original statements. I don't consider this proposal to be an
example of 'statement' replication like pgpool is. If the original
SQL was 'update foo set x=x+1 where id > 10'; there will be a WAL
record to decode for each row modified by the table. In a million row
table I'd expect the replica will have to apply a million records
(whether they be binary heap tuples or SQL statements).

b) Slony uses serializable mode internally for row replication

Actually recent versions of slony apply transactions against the replica
in read committed mode. Older versions used serializable mode but with
the SSI changes in 9.1 we found slony tended to have serialization
conflicts with itself on the slony internal tables resulting in a lot of
aborted transactions.

When slony applies changes on a replica table it does so in a single
transaction. Slony finds a set of transactions that committed on the
master in between two SYNC events. It then applies all of the rows
changed by any of those transactions as part of a single transaction on
the replica. Chris's post explains this in more detail.

Conflicts with user transactions on the replica are possible.

Show quoted text

c) it's possible (though difficult) for creative usage to get Slony into
a deadlock situation

FWIW, I have always assumed that is is impossible (even theoretically)
to have statement-based replication without some constraints on the
statements you can run, or some replication failures. I think we should
expect 9.3's logical replication out-the-gate to have some issues and
impose constraints on users, and improve with time but never be perfect.

The design Andres and Simon have advanced already eliminates a lot of
the common failure cases (now(), random(), nextval()) suffered by pgPool
and similar tools. But remember, this feature doesn't have to be
*perfect*, it just has to be *better* than the alternatives.

#59Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#51)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

IMHO that's a good thing, and I'd hope this new logical replication to live
outside core as well, as much as possible. But whether or not something is
in core is just a political decision, not a reason to implement something
new.

If the only meaningful advantage is reducing the amount of WAL written, I
can't help thinking that we should just try to address that in the existing
solutions, even if it seems "easy to solve at a first glance, but a solution
not using a normal transactional table for its log/queue has to solve a lot
of problems", as the document says. Sorry to be a naysayer, but I'm pretty
scared of all the new code and complexity these patches bring into core.

I think what we're really missing at the moment is a decent way of
decoding WAL. There are a decent number of customers who, when
presented with replication system, start by asking whether it's
trigger-based or WAL-based. When you answer that it's trigger-based,
their interest goes... way down. If you tell them the triggers are
written in anything but C, you lose a bunch more points. Sure, some
people's concerns are overblown, but it's hard to escape the
conclusion that a WAL-based solution can be a lot more efficient than
a trigger-based solution, and EnterpriseDB has gotten comments from a
number of people who upgraded to 9.0 or 9.1 to the effect that SR was
way faster than Slony.

I do not personally believe that a WAL decoding solution adequate to
drive logical replication can live outside of core, at least not
unless core exposes a whole lot more interface than we do now, and
probably not even then. Even if it could, I don't see the case for
making every replication solution reinvent that wheel. It's a big
wheel to be reinventing, and everyone needs pretty much the same
thing.

That having been said, I have to agree that the people working on this
project seem to be wearing rose-colored glasses when it comes to the
difficulty of implementing a full-fledged solution in core. I'm right
on board with everything up to the point where we start kicking out a
stream of decoded changes to the user... and that's about it. To pick
on Slony for the moment, as the project that has been around for the
longest and has probably the largest user base (outside of built-in
SR, perhaps), they've got a project that they have been developing for
years and years and years. What have they been doing all that time?
Maybe they are just stupid, but Chris and Jan and Steve don't strike
me that way, so I think the real answer is that they are solving
problems that we haven't even started to think about yet, especially
around control logic: how do you turn it on? how do you turn it off?
how do you handle node failures? how do you handle it when a node
gets behind? We are not going to invent good solutions to all of
those problems between now and January, or even between now and next
January.

PS. I'd love to see a basic Slony plugin for this, for example, to see how
much extra code on top of the posted patches you need to write in a plugin
like that to make it functional. I'm worried that it's a lot..

I agree. I would go so far as to say that if Slony can't integrate
with this work and use it in place of their existing change-capture
facility, that's sufficient grounds for unconditional rejection.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#60Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#59)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Monday, October 15, 2012 04:54:20 AM Robert Haas wrote:

On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas

<hlinnakangas@vmware.com> wrote:

IMHO that's a good thing, and I'd hope this new logical replication to
live outside core as well, as much as possible. But whether or not
something is in core is just a political decision, not a reason to
implement something new.

If the only meaningful advantage is reducing the amount of WAL written, I
can't help thinking that we should just try to address that in the
existing solutions, even if it seems "easy to solve at a first glance,
but a solution not using a normal transactional table for its log/queue
has to solve a lot of problems", as the document says. Sorry to be a
naysayer, but I'm pretty scared of all the new code and complexity these
patches bring into core.

I do not personally believe that a WAL decoding solution adequate to
drive logical replication can live outside of core, at least not
unless core exposes a whole lot more interface than we do now, and
probably not even then. Even if it could, I don't see the case for
making every replication solution reinvent that wheel. It's a big
wheel to be reinventing, and everyone needs pretty much the same
thing.

Unsurprisingly I aggree.

That having been said, I have to agree that the people working on this
project seem to be wearing rose-colored glasses when it comes to the
difficulty of implementing a full-fledged solution in core.

That very well might be true. Sometimes rose-colored glasses can be quite
productive in getting something started...

Note at this point were only want wal decoding, background workers and related
things to get integrated...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#61Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#60)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Mon, Oct 15, 2012 at 09:57:19AM +0200, Andres Freund wrote:

On Monday, October 15, 2012 04:54:20 AM Robert Haas wrote:

On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas

<hlinnakangas@vmware.com> wrote:

IMHO that's a good thing, and I'd hope this new logical replication to
live outside core as well, as much as possible. But whether or not
something is in core is just a political decision, not a reason to
implement something new.

If the only meaningful advantage is reducing the amount of WAL written, I
can't help thinking that we should just try to address that in the
existing solutions, even if it seems "easy to solve at a first glance,
but a solution not using a normal transactional table for its log/queue
has to solve a lot of problems", as the document says. Sorry to be a
naysayer, but I'm pretty scared of all the new code and complexity these
patches bring into core.

I do not personally believe that a WAL decoding solution adequate to
drive logical replication can live outside of core, at least not
unless core exposes a whole lot more interface than we do now, and
probably not even then. Even if it could, I don't see the case for
making every replication solution reinvent that wheel. It's a big
wheel to be reinventing, and everyone needs pretty much the same
thing.

Unsurprisingly I aggree.

That having been said, I have to agree that the people working on this
project seem to be wearing rose-colored glasses when it comes to the
difficulty of implementing a full-fledged solution in core.

That very well might be true. Sometimes rose-colored glasses can be quite
productive in getting something started...

Note at this point were only want wal decoding, background workers and related
things to get integrated...

Well, TODO does have:

Move pgfoundry's xlogdump to /contrib and have it rely more closely on
the WAL backend code

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#62Andres Freund
andres@2ndquadrant.com
In reply to: Bruce Momjian (#61)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Monday, October 15, 2012 08:19:54 PM Bruce Momjian wrote:

On Mon, Oct 15, 2012 at 09:57:19AM +0200, Andres Freund wrote:

On Monday, October 15, 2012 04:54:20 AM Robert Haas wrote:

On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas

<hlinnakangas@vmware.com> wrote:

IMHO that's a good thing, and I'd hope this new logical replication
to live outside core as well, as much as possible. But whether or
not something is in core is just a political decision, not a reason
to implement something new.

If the only meaningful advantage is reducing the amount of WAL
written, I can't help thinking that we should just try to address
that in the existing solutions, even if it seems "easy to solve at a
first glance, but a solution not using a normal transactional table
for its log/queue has to solve a lot of problems", as the document
says. Sorry to be a naysayer, but I'm pretty scared of all the new
code and complexity these patches bring into core.

I do not personally believe that a WAL decoding solution adequate to
drive logical replication can live outside of core, at least not
unless core exposes a whole lot more interface than we do now, and
probably not even then. Even if it could, I don't see the case for
making every replication solution reinvent that wheel. It's a big
wheel to be reinventing, and everyone needs pretty much the same
thing.

Unsurprisingly I aggree.

That having been said, I have to agree that the people working on this
project seem to be wearing rose-colored glasses when it comes to the
difficulty of implementing a full-fledged solution in core.

That very well might be true. Sometimes rose-colored glasses can be quite
productive in getting something started...

Note at this point were only want wal decoding, background workers and
related things to get integrated...

Well, TODO does have:

Move pgfoundry's xlogdump to /contrib and have it rely more closely on
the WAL backend code

Uhm. How does that relate to my statement?

The xlogreader code I submitted does contain a very small POC xlogdump with
almost no code duplication. It needs some work to be really useful though.

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I aggree and I don't think I have ever said something contrary. I just don't
want to be the only one working on slony integration. I am ready to do a good
part of that, but somebody with slony experience needs to help, especially on
consuming those changes.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#63Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Andres Freund (#54)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 10/11/2012 01:42 PM, Andres Freund wrote:

On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:
...
If the only meaningful advantage is reducing the amount of WAL written,
I can't help thinking that we should just try to address that in the
existing solutions, even if it seems "easy to solve at a first glance,
but a solution not using a normal transactional table for its log/queue
has to solve a lot of problems", as the document says.
Youre welcome to make suggestions, but everything I could think of that didn't
fall short of reality ended up basically duplicating the amount of writes &
fsyncs, even if not going through the WAL.

You need to be crash safe/restartable (=> writes, fsyncs) and you need to
reduce the writes (in memory, => !writes). There is only one authoritative
point where you can rely on a commit to have been successfull and thats when
the commit record has been written to the WAL. You can't send out the data to
be committed before thats written because that could result in spuriously
committed transactions on the remote side and you can't easily do it afterwards
because you can crash after the commit.

Just curious here, but do you know how is this part solved in current sync
wal replication - you can get "spurious" commits on slave side id master
dies while waiting for confirmation.

What complications are you imagining? Greetings, Andres

Hannu

#64Andres Freund
andres@2ndquadrant.com
In reply to: Hannu Krosing (#63)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Monday, October 15, 2012 08:38:07 PM Hannu Krosing wrote:

On 10/11/2012 01:42 PM, Andres Freund wrote:

On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:
...
If the only meaningful advantage is reducing the amount of WAL written,
I can't help thinking that we should just try to address that in the
existing solutions, even if it seems "easy to solve at a first glance,
but a solution not using a normal transactional table for its log/queue
has to solve a lot of problems", as the document says.
Youre welcome to make suggestions, but everything I could think of that
didn't fall short of reality ended up basically duplicating the amount
of writes & fsyncs, even if not going through the WAL.

You need to be crash safe/restartable (=> writes, fsyncs) and you need to
reduce the writes (in memory, => !writes). There is only one
authoritative point where you can rely on a commit to have been
successfull and thats when the commit record has been written to the
WAL. You can't send out the data to be committed before thats written
because that could result in spuriously committed transactions on the
remote side and you can't easily do it afterwards because you can crash
after the commit.

Just curious here, but do you know how is this part solved in current sync
wal replication - you can get "spurious" commits on slave side id master
dies while waiting for confirmation.

Synchronous replication is only synchronous in respect to the COMMIT reply sent
to the user. First the commit is written to WAL locally, so it persists across
a crash (c.f. RecordTransactionCommit). Only then we wait for the standby
(SyncRepWaitForLSN). After that finished the shared memory on the primary gets
updated (c.f. ProcArrayEndTransaction in CommitTransaction) and soon after that
the user gets the response to the COMMIT back.

I am not really sure what you were asking for, does the above explanation
answer this?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#65Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Robert Haas (#59)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 10/15/2012 04:54 AM, Robert Haas wrote:

On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

IMHO that's a good thing, and I'd hope this new logical replication to live
outside core as well, as much as possible. But whether or not something is
in core is just a political decision, not a reason to implement something
new.

If the only meaningful advantage is reducing the amount of WAL written, I
can't help thinking that we should just try to address that in the existing
solutions, even if it seems "easy to solve at a first glance, but a solution
not using a normal transactional table for its log/queue has to solve a lot
of problems", as the document says. Sorry to be a naysayer, but I'm pretty
scared of all the new code and complexity these patches bring into core.

I think what we're really missing at the moment is a decent way of
decoding WAL. There are a decent number of customers who, when
presented with replication system, start by asking whether it's
trigger-based or WAL-based. When you answer that it's trigger-based,
their interest goes... way down. If you tell them the triggers are
written in anything but C, you lose a bunch more points. Sure, some
people's concerns are overblown, but it's hard to escape the
conclusion that a WAL-based solution can be a lot more efficient than
a trigger-based solution, and EnterpriseDB has gotten comments from a
number of people who upgraded to 9.0 or 9.1 to the effect that SR was
way faster than Slony.

I do not personally believe that a WAL decoding solution adequate to
drive logical replication can live outside of core, at least not
unless core exposes a whole lot more interface than we do now, and
probably not even then. Even if it could, I don't see the case for
making every replication solution reinvent that wheel. It's a big
wheel to be reinventing, and everyone needs pretty much the same
thing.

That having been said, I have to agree that the people working on this
project seem to be wearing rose-colored glasses when it comes to the
difficulty of implementing a full-fledged solution in core. I'm right
on board with everything up to the point where we start kicking out a
stream of decoded changes to the user... and that's about it. To pick
on Slony for the moment, as the project that has been around for the
longest and has probably the largest user base (outside of built-in
SR, perhaps), they've got a project that they have been developing for
years and years and years. What have they been doing all that time?
Maybe they are just stupid, but Chris and Jan and Steve don't strike
me that way, so I think the real answer is that they are solving
problems that we haven't even started to think about yet, especially
around control logic: how do you turn it on? how do you turn it off?
how do you handle node failures? how do you handle it when a node
gets behind? We are not going to invent good solutions to all of
those problems between now and January, or even between now and next
January.

I understand the the current minimal target is to get on par to current WAL
streaming in terms of setup ease and performance with additional
benefit of having read-write subscribers with at least conflict detection
and logging.

Hoping that we have something by january that solves all possible
replication scenarios out of the box is unrealistic.

Show quoted text

PS. I'd love to see a basic Slony plugin for this, for example, to see how
much extra code on top of the posted patches you need to write in a plugin
like that to make it functional. I'm worried that it's a lot..

I agree. I would go so far as to say that if Slony can't integrate
with this work and use it in place of their existing change-capture
facility, that's sufficient grounds for unconditional rejection.

#66Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Andres Freund (#64)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 10/15/2012 08:44 PM, Andres Freund wrote:

On Monday, October 15, 2012 08:38:07 PM Hannu Krosing wrote:

On 10/11/2012 01:42 PM, Andres Freund wrote:

On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:
...
If the only meaningful advantage is reducing the amount of WAL written,
I can't help thinking that we should just try to address that in the
existing solutions, even if it seems "easy to solve at a first glance,
but a solution not using a normal transactional table for its log/queue
has to solve a lot of problems", as the document says.
Youre welcome to make suggestions, but everything I could think of that
didn't fall short of reality ended up basically duplicating the amount
of writes & fsyncs, even if not going through the WAL.

You need to be crash safe/restartable (=> writes, fsyncs) and you need to
reduce the writes (in memory, => !writes). There is only one
authoritative point where you can rely on a commit to have been
successfull and thats when the commit record has been written to the
WAL. You can't send out the data to be committed before thats written
because that could result in spuriously committed transactions on the
remote side and you can't easily do it afterwards because you can crash
after the commit.

Just curious here, but do you know how is this part solved in current sync
wal replication - you can get "spurious" commits on slave side id master
dies while waiting for confirmation.

Synchronous replication is only synchronous in respect to the COMMIT reply sent
to the user. First the commit is written to WAL locally, so it persists across
a crash (c.f. RecordTransactionCommit). Only then we wait for the standby
(SyncRepWaitForLSN). After that finished the shared memory on the primary gets
updated (c.f. ProcArrayEndTransaction in CommitTransaction) and soon after that
the user gets the response to the COMMIT back.

I am not really sure what you were asking for, does the above explanation
answer this?

I think I mostly got it if master crashes before the commit confirmation
comes back then it _will_ get it after restart.

To client it looks like it doid not commit, but it is no different in this
respect than any other crash-before-confirmation and thus client can
not rely on commit not happening and has to check it.

Show quoted text

Greetings,

Andres

#67Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#62)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Mon, Oct 15, 2012 at 08:26:08PM +0200, Andres Freund wrote:

I do not personally believe that a WAL decoding solution adequate to
drive logical replication can live outside of core, at least not
unless core exposes a whole lot more interface than we do now, and
probably not even then. Even if it could, I don't see the case for
making every replication solution reinvent that wheel. It's a big
wheel to be reinventing, and everyone needs pretty much the same
thing.

Unsurprisingly I aggree.

That having been said, I have to agree that the people working on this
project seem to be wearing rose-colored glasses when it comes to the
difficulty of implementing a full-fledged solution in core.

That very well might be true. Sometimes rose-colored glasses can be quite
productive in getting something started...

Note at this point were only want wal decoding, background workers and
related things to get integrated...

Well, TODO does have:

Move pgfoundry's xlogdump to /contrib and have it rely more closely on
the WAL backend code

Uhm. How does that relate to my statement?

The xlogreader code I submitted does contain a very small POC xlogdump with
almost no code duplication. It needs some work to be really useful though.

I just meant that dumping xlog contents is something we want to improve.

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I aggree and I don't think I have ever said something contrary. I just don't
want to be the only one working on slony integration. I am ready to do a good
part of that, but somebody with slony experience needs to help, especially on
consuming those changes.

Agreed.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#68Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Robert Haas (#59)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 10/15/2012 04:54 AM, Robert Haas wrote:

PS. I'd love to see a basic Slony plugin for this, for example, to see how

much extra code on top of the posted patches you need to write in a plugin
like that to make it functional. I'm worried that it's a lot..

I agree. I would go so far as to say that if Slony can't integrate
with this work and use it in place of their existing change-capture
facility, that's sufficient grounds for unconditional rejection.

The fact that current work starts with "apply cache" instead of streaming
makes the semantics very close to how londiste and slony do this.

Therefore I don't think there will be any problems with "can't" though
it may be
that there will be nobody actually doing it, at least not before January.

------------
Hannu

#69Peter Geoghegan
peter@2ndquadrant.com
In reply to: Bruce Momjian (#61)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future? That would be (with respect to the Slony
guys) a commercial risk that is fairly heavily concentrated with
Afilias. So, if you're going to attach as a condition to its
acceptance that the Slony guys be able to use it immediately (because
"can integrate" really means "will integrate", right?), you're
attaching it to a rather arbitrary condition that has nothing much to
do with the technical merit of the patches proposed. The fact of the
matter is that Slony was originally designed with a somewhat different
set of constraints to those that exist today, so I don't doubt that
this is something that they're going to need to integrate over time,
probably in a separate release branch, to get the upsides of in-core
logical replication, along with the great flexibility that Slony
currently offers (and that Afilias undoubtedly depend upon today).

Another way of putting this is that Postgres should go first because
we will get huge benefits even if only one of the trigger-based
logical replication systems adopts the technology. Though I hope and
expect that the Slony guys will be able to work with what we're doing,
surely a logical replication system with all the benefits implied by
being logical, but with with only some subset of Slony's functionality
is still going to be of great benefit.

My view is that the only reasonable approach is to build something
solid, well-integrated and generic, in core. I'd certainly like to
hear what the Slony guys have to say here, though.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#70Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#69)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future?

Clearly, core needs to go first. However, before we commit, I would
like to hear the Slony guys say something like this: We read the
documentation that is part of this patch and if the feature behaves as
advertised, we believe we will be able to use it in place of the
change-capture mechanism that we have now, and that it will be at
least as good as what we have now if not a whole lot better.

If they say something like "I'm not sure we have the right design for
this" or "this wouldn't be sufficient to replace this portion of what
we have now because it lacks critical feature X", I would be very
concerned about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#71Andres Freund
andres@2ndquadrant.com
In reply to: Peter Geoghegan (#69)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future?

Well. I don't think (hope) anybody proposed making something release worthy for
slony but rather a POC patch that proofs the API is generic enough to be used
by them. If I (or somebody else familiar with this) work together with somebody
familiar with with slony internals I think such a POC shouldn't be too hard to
do.
I think some more input from that side is a good idea. I plan to send out an
email to possibly interested parties in about two weeks...

Regards,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#70)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future?

Clearly, core needs to go first. However, before we commit, I would
like to hear the Slony guys say something like this: We read the
documentation that is part of this patch and if the feature behaves as
advertised, we believe we will be able to use it in place of the
change-capture mechanism that we have now, and that it will be at
least as good as what we have now if not a whole lot better.

If they say something like "I'm not sure we have the right design for
this" or "this wouldn't be sufficient to replace this portion of what
we have now because it lacks critical feature X", I would be very
concerned about that.

The other point here is that core code without any implemented use-cases
is unlikely to be worth a tinker's damn. Regardless of what time-frame
the Slony guys are able to work on, I think we need to see working code
(of at least prototype quality) before we believe that we've got it
right. Or if not code from them, code from some other replication
project.

A possibly-useful comparison is to the FDW APIs we've been slowly
implementing over the past couple releases. Even though we *did* have
some use-cases written right off the bat, we got it wrong and had to
change it in 9.2, and I wouldn't bet against having to change it again
in 9.3 (even without considering the need for extensions for non-SELECT
operations). And, despite our very clear warnings that all that stuff
was in flux, people have been griping because the APIs changed.

So if we ship core hooks for logical replication in advance of proof
that they're actually usable by at least one (preferably more than one)
replication project, I confidently predict that they'll be wrong and
will need revision and the potential users will complain about the
API instability.

regards, tom lane

#73Christopher Browne
cbbrowne@gmail.com
In reply to: Peter Geoghegan (#69)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future? That would be (with respect to the Slony
guys) a commercial risk that is fairly heavily concentrated with
Afilias.

Yep, there's something a bit too circular there.

I'd also not be keen on reimplementing the "Slony integration" over
and over if it turns out that the API churns for a while before
stabilizing. That shouldn't be misread as "I expect horrible amounts
of churn", just that *any* churn comes at a cost. And if anything
unfortunate happens, that can easily multiply into a multiplicity of
painfulness(es?).
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#74Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#72)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Monday, October 15, 2012 10:03:40 PM Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com>

wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future?

Clearly, core needs to go first. However, before we commit, I would
like to hear the Slony guys say something like this: We read the
documentation that is part of this patch and if the feature behaves as
advertised, we believe we will be able to use it in place of the
change-capture mechanism that we have now, and that it will be at
least as good as what we have now if not a whole lot better.

If they say something like "I'm not sure we have the right design for
this" or "this wouldn't be sufficient to replace this portion of what
we have now because it lacks critical feature X", I would be very
concerned about that.

The other point here is that core code without any implemented use-cases
is unlikely to be worth a tinker's damn. Regardless of what time-frame
the Slony guys are able to work on, I think we need to see working code
(of at least prototype quality) before we believe that we've got it
right. Or if not code from them, code from some other replication
project.

FWIW we (as in 2ndq), unsurprisingly, have a user of this which is in
development atm.

A possibly-useful comparison is to the FDW APIs we've been slowly
implementing over the past couple releases. Even though we *did* have
some use-cases written right off the bat, we got it wrong and had to
change it in 9.2, and I wouldn't bet against having to change it again
in 9.3 (even without considering the need for extensions for non-SELECT
operations). And, despite our very clear warnings that all that stuff
was in flux, people have been griping because the APIs changed.

On the other hand, I don't think we would have FDWs today at all if it wouldn't
have been done that way. So I really cannot see that as an indication of not
working incrementally.
Obviously thats not an argument for not trying to get the API correct right off
the bat. I seriously hope the userlevel API continues to be simpler than what
FDWs need.

Regards,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#75Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#72)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 15 October 2012 21:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future?

Clearly, core needs to go first. However, before we commit, I would
like to hear the Slony guys say something like this: We read the
documentation that is part of this patch and if the feature behaves as
advertised, we believe we will be able to use it in place of the
change-capture mechanism that we have now, and that it will be at
least as good as what we have now if not a whole lot better.

If they say something like "I'm not sure we have the right design for
this" or "this wouldn't be sufficient to replace this portion of what
we have now because it lacks critical feature X", I would be very
concerned about that.

The other point here is that core code without any implemented use-cases
is unlikely to be worth a tinker's damn. Regardless of what time-frame
the Slony guys are able to work on, I think we need to see working code
(of at least prototype quality) before we believe that we've got it
right. Or if not code from them, code from some other replication
project.

A possibly-useful comparison is to the FDW APIs we've been slowly
implementing over the past couple releases. Even though we *did* have
some use-cases written right off the bat, we got it wrong and had to
change it in 9.2, and I wouldn't bet against having to change it again
in 9.3 (even without considering the need for extensions for non-SELECT
operations). And, despite our very clear warnings that all that stuff
was in flux, people have been griping because the APIs changed.

So if we ship core hooks for logical replication in advance of proof
that they're actually usable by at least one (preferably more than one)
replication project, I confidently predict that they'll be wrong and
will need revision and the potential users will complain about the
API instability.

The prototype we showed at PgCon illustrated a working system, so
we're on the right track.

We've split that in two now, specifically to allow other projects to
use what is being built. The exact API of that split is for discussion
and has been massively redesigned on community advice for the sole
purpose of including other approaches. We can't guarantee that
external open source or commercial vendors will use the API. But we
can say that in-core use cases exist for multiple approaches. We
shouldn't put the decision on that in the hands of others.

Jan spoke at length at PgCon, for all to hear, that what we are
building is a much better way than the trigger logging approach Slony
uses. I don't take that as carte blanche for approval of everything
being done, but its going in the right direction with an open heart,
which is about as good as it gets.

There will be a working system again soon, once we have re-built
things around the new API. The longer it takes to get a stable API the
longer we take to rebuild things around it again.

The goal of the project is to release everything open source, PGDG
copyrighted and TPL licenced and to submit to core. We are signed up
to that in various ways, not least of all our word given publicly.
Please give this your backing, so an open source outcome can be
possible.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#76Andres Freund
andres@2ndquadrant.com
In reply to: Christopher Browne (#73)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote:

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com>

wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future? That would be (with respect to the Slony
guys) a commercial risk that is fairly heavily concentrated with
Afilias.

Yep, there's something a bit too circular there.

I'd also not be keen on reimplementing the "Slony integration" over
and over if it turns out that the API churns for a while before
stabilizing. That shouldn't be misread as "I expect horrible amounts
of churn", just that *any* churn comes at a cost. And if anything
unfortunate happens, that can easily multiply into a multiplicity of
painfulness(es?).

Well, as a crosscheck, could you list your requirements?

Do you need anything more than outputting data in a format compatible to whats
stored in sl_log_*? You wouldn't have sl_actionseq, everything else should be
there (Well, you would need to do lookups to get the tableid, but thats not
really much of a problem). The results would be ordered in complete
transactions, in commit order.

I guess the other tables would stay as they are as they contain the "added
value" of slony?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#77Christopher Browne
cbbrowne@gmail.com
In reply to: Andres Freund (#76)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Mon, Oct 15, 2012 at 4:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote:

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com>

wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future? That would be (with respect to the Slony
guys) a commercial risk that is fairly heavily concentrated with
Afilias.

Yep, there's something a bit too circular there.

I'd also not be keen on reimplementing the "Slony integration" over
and over if it turns out that the API churns for a while before
stabilizing. That shouldn't be misread as "I expect horrible amounts
of churn", just that *any* churn comes at a cost. And if anything
unfortunate happens, that can easily multiply into a multiplicity of
painfulness(es?).

Well, as a crosscheck, could you list your requirements?

Do you need anything more than outputting data in a format compatible to whats
stored in sl_log_*? You wouldn't have sl_actionseq, everything else should be
there (Well, you would need to do lookups to get the tableid, but thats not
really much of a problem). The results would be ordered in complete
transactions, in commit order.

Hmm. We need to have log data that's in a compatible ordering.

We use sl_actionseq, and can mix data from multiple transactions
together; if what you're providing is, instead, in order based on
transaction commit order followed by some sequencing within each
transaction, then that should be acceptable.

The stylized query on sl_log_* looks like...

select log_origin, log_txid, log_tableid,
log_actionseq, log_tablenspname,
log_tablerelname, log_cmdtype,
log_cmdupdncols, log_cmdargs
from %s.sl_log_%d
where log_origin = %d

How about I "quibble" about each of these:

a) log_origin - this indicates the node from which the data
originates. Presumably, this is implicit in a "chunk" of data that is
coming in.

b) log_txid - indicating the transaction ID. I presume you've got
this available. It's less important with the WAL-based scheme in that
we'd probably not be using it as a basis for querying as is the case
today with Slony.

c) log_tableid - indicating the ID of the table. Are you capturing an
OID equivalent to this? Or what?

d) log_actionseq - indicating relative sequences of updates. You
don't have this, but if you're capturing commit ordering, we don't
need it.

e) log_tablenspname, log_tablerelname - some small amount of magic
needful to get this. Or perhaps you are already capturing it?

f) log_cmdtype - I/U/D/T - indicating the action
(insert/update/delete/truncate). Hopefully you have something like
this?

g) log_cmdupdncols - for UPDATE action, the number of updated columns.
Probably not mandatory; this was a new 2.1 thing...

h) log_cmdargs - the actual data needed to do the I/U/D. The form of
this matters a fair bit. Before Slony 2.1, this was a portion of a
SQL statement, omitting the operation (provided in log_cmdtype) and
the table name (in log_tablerelname et al). In Slony 2.1, this
changes to be a text[] array that essentially consists of pairs of
[column name, column value] values.

I see one place, very notable in Slony 2.2, that would also be more
complicated, which is the handling of DDL.

In 2.1 and earlier, we handled DDL as "events", essentially out of
band. This isn't actually correct; it could mix very badly if you had
replication activity mixing with DDL requests. (More detail than you
want is in a bug on this...
<http://www.slony.info/bugzilla/show_bug.cgi?id=137&gt;).

In Slony 2.2, we added a third "log table" where DDL gets captured.
sl_log_script has much the same schema as sl_log_{1,2}; it needs to
get "mixed in" in compatible order. What I imagine would pointedly
complicate life is if a single transaction contained both DDL and
"regular replicable activity." Slony 2.2 mixes this in using XID +
log_actionseq; how this would play out with your log capture mechanism
isn't completely clear to me. That's the place where I'd expect the
very messiest interaction.

I guess the other tables would stay as they are as they contain the "added
value" of slony?

A fair bit of Slony is about the "event infrastructure," and some of
that ceases to be as needful. The configuration bits probably
continue to remain interesting.

The parts that seem notably mysterious to me at the moment are:

a) How do we pull result sets (e.g. - sl_log_* data)?

b) How is the command data represented?

c) If we have a need to mix together your 'raw logs' and other
material (e.g. - our sl_log_script that captures DDL-like changes to
be mixed back in), how easy|impossible is this?
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#78Andres Freund
andres@2ndquadrant.com
In reply to: Christopher Browne (#77)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Tuesday, October 16, 2012 12:13:14 AM Christopher Browne wrote:

On Mon, Oct 15, 2012 at 4:51 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote:

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com>

wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is
unlikely any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future? That would be (with respect to the Slony
guys) a commercial risk that is fairly heavily concentrated with
Afilias.

Yep, there's something a bit too circular there.

I'd also not be keen on reimplementing the "Slony integration" over
and over if it turns out that the API churns for a while before
stabilizing. That shouldn't be misread as "I expect horrible amounts
of churn", just that *any* churn comes at a cost. And if anything
unfortunate happens, that can easily multiply into a multiplicity of
painfulness(es?).

Well, as a crosscheck, could you list your requirements?

Do you need anything more than outputting data in a format compatible to
whats stored in sl_log_*? You wouldn't have sl_actionseq, everything
else should be there (Well, you would need to do lookups to get the
tableid, but thats not really much of a problem). The results would be
ordered in complete transactions, in commit order.

Hmm. We need to have log data that's in a compatible ordering.

We use sl_actionseq, and can mix data from multiple transactions
together; if what you're providing is, instead, in order based on
transaction commit order followed by some sequencing within each
transaction, then that should be acceptable.

Inside the transaction its sequenced by the order the XLogInsert calls were
made which is the order the client sent the commands. That sounds like it
should be compatible.

The stylized query on sl_log_* looks like...

select log_origin, log_txid, log_tableid,
log_actionseq, log_tablenspname,
log_tablerelname, log_cmdtype,
log_cmdupdncols, log_cmdargs
from %s.sl_log_%d
where log_origin = %d

How about I "quibble" about each of these:

a) log_origin - this indicates the node from which the data
originates. Presumably, this is implicit in a "chunk" of data that is
coming in.

I think we can just reuse whatever method you use in slony to get the current
node's id to get it in the output plugin.

b) log_txid - indicating the transaction ID. I presume you've got
this available. It's less important with the WAL-based scheme in that
we'd probably not be using it as a basis for querying as is the case
today with Slony.

Its directly available. The plugin will have to call txid_out, but thats
obviously no problem.

c) log_tableid - indicating the ID of the table. Are you capturing an
OID equivalent to this? Or what?

You get the TupleDesc of the table.

d) log_actionseq - indicating relative sequences of updates. You
don't have this, but if you're capturing commit ordering, we don't
need it.

Good.

e) log_tablenspname, log_tablerelname - some small amount of magic
needful to get this. Or perhaps you are already capturing it?

The relevant backend functions available, so its no problem
(RelationGetNamespace(change->new_tuple->table)).

f) log_cmdtype - I/U/D/T - indicating the action
(insert/update/delete/truncate). Hopefully you have something like
this?

Yes:
enum ApplyCacheChangeType
{
APPLY_CACHE_CHANGE_INSERT,
APPLY_CACHE_CHANGE_UPDATE,
APPLY_CACHE_CHANGE_DELETE,
..
}

g) log_cmdupdncols - for UPDATE action, the number of updated columns.
Probably not mandatory; this was a new 2.1 thing...

Hm. NO. We don't have that. But then, you can just use normal C code there, so
it shouldn't be too hard to compute if neede.d

h) log_cmdargs - the actual data needed to do the I/U/D. The form of
this matters a fair bit. Before Slony 2.1, this was a portion of a
SQL statement, omitting the operation (provided in log_cmdtype) and
the table name (in log_tablerelname et al). In Slony 2.1, this
changes to be a text[] array that essentially consists of pairs of
[column name, column value] values.

The existing C code to generate this should be copy&pasteable into this with a
relatively small amount of changes.

I see one place, very notable in Slony 2.2, that would also be more
complicated, which is the handling of DDL.

In 2.1 and earlier, we handled DDL as "events", essentially out of
band. This isn't actually correct; it could mix very badly if you had
replication activity mixing with DDL requests. (More detail than you
want is in a bug on this...
<http://www.slony.info/bugzilla/show_bug.cgi?id=137&gt;).

In Slony 2.2, we added a third "log table" where DDL gets captured.
sl_log_script has much the same schema as sl_log_{1,2}; it needs to
get "mixed in" in compatible order. What I imagine would pointedly
complicate life is if a single transaction contained both DDL and
"regular replicable activity." Slony 2.2 mixes this in using XID +
log_actionseq; how this would play out with your log capture mechanism
isn't completely clear to me. That's the place where I'd expect the
very messiest interaction.

Hm. I expect some complications here as well. But then, unless you do something
special changes to those tables (e.g. sl_log_script) will be streamed out as
well, so you could just insert events into their respective tables on the
receiving side and deal with them there.

I guess the other tables would stay as they are as they contain the
"added value" of slony?

A fair bit of Slony is about the "event infrastructure," and some of
that ceases to be as needful. The configuration bits probably
continue to remain interesting.

Quite a bit of the event infrastructure seems to deal with configuration
changes and such, all thats probably going to stay...

The parts that seem notably mysterious to me at the moment are:

a) How do we pull result sets (e.g. - sl_log_* data)?

The details of this are in a bit of flux as of now but I hope we will nail this
down soon. You open a replication connection to the primary 'replication=1
dbname=...' and issue

START_LOGICAL_REPLICATION slony $slot_id 0/DEADBEEF

With 0/DEADBEEF being the location youve stopped getting changes the last time.
That will start streaming out changes via the COPY protocol. The contents of
whats streamed out is entirely up to you.

The first time you start replication you need to do:

INIT_LOGICAL_REPLICATION

which will return a $slot_id, a SET TRANSACTION SNAPSHOT compatible snapshot
and the initial starting LSN.

The 'slony' in START_LOGICAL_REPLICATION above denotes which output plugin is
to be used.

b) How is the command data represented?

Command data is the old/new tuple? Thats up to the output plugin. You get a
HeapTuple with the old/new tuple, and compatible TupleDesc's. You could simply
stream out your current format for example.

c) If we have a need to mix together your 'raw logs' and other
material (e.g. - our sl_log_script that captures DDL-like changes to
be mixed back in), how easy|impossible is this?

As described above in general that seems easy enough. Just insert data into
e.g. sl_log_script and when you receive the changes on the other side decide in
which table to redirect those.

Where I see a bit of a problem is the handling of replication sets,
configuration and similar.

Currently there is a dichotomy between 'catalog tables' and 'data tables'. The
former are not replicated but can be queried in an output plugin (thats the
timetravel part). The latter are replicated but cannot be queried. All system
catalog tables are in the 'catalog' category by their nature, but I have played
with a system column that allows other tables to be treated as catalog tables
as well.

If you would want to filter data on the source - which probably makes sense? -
you currently would need to have such an additional catalog table which is not
replicated but can be queried by the output plugin. But I guess the contents of
that table would also need to be replicated...

I wonder if it we should replicate changes to such user-defined catalog tables
as well, that should be relatively easy and if its not wanted the output plugin
easily can filter that (if (class_form->relusercat)).

Does that clear things up?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#79Tatsuo Ishii
ishii@postgresql.org
In reply to: Josh Berkus (#55)
Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

The design Andres and Simon have advanced already eliminates a lot of
the common failure cases (now(), random(), nextval()) suffered by pgPool
and similar tools. But remember, this feature doesn't have to be

Well, pgpool-II already solved the now() case by using query rewriting
technique. The technique could be applied to random() as well but I'm
not convinced it is worth the trouble. nexval() would be a little
harder because pgpool needs an assistance from PostgreSQL core.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#80Steve Singer
steve@ssinger.info
In reply to: Andres Freund (#76)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 12-10-15 04:51 PM, Andres Freund wrote:

Well, as a crosscheck, could you list your requirements?

Do you need anything more than outputting data in a format compatible to whats
stored in sl_log_*? You wouldn't have sl_actionseq, everything else should be
there (Well, you would need to do lookups to get the tableid, but thats not
really much of a problem). The results would be ordered in complete
transactions, in commit order.

I guess the other tables would stay as they are as they contain the "added
value" of slony?

Greetings,

I actually had spent some time a few weeks ago looking over the
documents and code. I never did get around to writing a review as
elegant as Peter's. I have not seen any red flags that make me thing
that what your proposing wouldn't be suitable for slony but sometimes
you don't see details until you start implementing something.

My initial approach to modifying slony to work with this might be
something like:

* Leave sl_event as is for non SYNC events, slon would still generate
SYNC events in sl_event
* We would modify the remote_worker thread in slon to instead of
selecting from sl_event it would get the the next 'committed'
transaction from your apply cache. For each ApplyChange record we
would check to see if it is an insert into sl_event ,if so we would
trigger our existing event processing logic based on the contents of the
ev_type column.
* If the change involves a insert/update/delete/truncate to a replicated
table we would translate that change into SQL and apply it on the
replica, we would not commit changes on the replica until we encounter
a SYNC being added to sl_event for the current origin.
* SQL will be applied in a slightly different order than slony does
today. Today if two concurrent transactions are inserting into the same
replicated table and they commit one after the other there is a good
chance that the apply order on the replica will also be intermixed
(assuming both commits were in between two SYNC events). My thinking is
that we would just replay them one after the other on the replica in
commit order. (Slony doesn't use commit order because we don't have it,
not because we don't like it) this would mean we do away with tracking
the action id.

* If a node is configured as a 'forwarder' not it would store the
processed output of each ApplyChange record in a table on the replica.
If a slon is pulling data from a non-orign (ie if remoteWorkerThread_1
is pulling data from node 2) then it would need to query this table
instead of calling the functions that process the ApplyCache contents.

* To subscribe a node we would generate a SYNC event on the provider and
do the copy_set. We would keep track of that SYNC event. The remote
worker would then ignore any data that comes before that SYNC event
when it starts pulling data from the apply cache.
* DDL events in 2.2+ go into sl_ddl_script (or someting like that) when
we see INSERT commands to that table we would now to then apply the DDL
on the node.

* We would need to continue to populate sl_confirm because nowing what
SYNC events have already been processed by a node is pretty important in
a MOVE SET or FAILOVER. It is possible that we might need to still
track the xip lists of each SYNC for MOVE SET/FAILOVER but I'm not sure
why/why not.

This is all easier said than implemented

Steve

Show quoted text

Andres

#81Andres Freund
andres@2ndquadrant.com
In reply to: Peter Geoghegan (#35)
1 attachment(s)
First draft of snapshot snapshot building design document

Hi All,

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:

The design document [2] really just explains the problem (which is the
need for catalog metadata at a point in time to make sense of heap
tuples), without describing the solution that this patch offers with
any degree of detail. Rather, [2] says "How we build snapshots is
somewhat intricate and complicated and seems to be out of scope for
this document", which is unsatisfactory. I look forward to reading the
promised document that describes this mechanism in more detail.

Here's the first version of the promised document. I hope it answers most of
the questions.

Input welcome!

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

README.SNAPBUILD.txttext/plain; charset=UTF-8; name=README.SNAPBUILD.txtDownload
#82Jan Wieck
JanWieck@Yahoo.com
In reply to: Simon Riggs (#75)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 10/15/2012 4:43 PM, Simon Riggs wrote:

Jan spoke at length at PgCon, for all to hear, that what we are
building is a much better way than the trigger logging approach Slony
uses. I don't take that as carte blanche for approval of everything
being done, but its going in the right direction with an open heart,
which is about as good as it gets.

The mechanism you are building for capturing changes is certainly a lot
better than what Bucardo, Londiste and Slony are doing today. That much
is true.

The flip side of the coin however is that all of today's logical
replication systems are designed Postgres version agnostic to a degree.
This means that the transition time from the existing, trigger based
approach to the new WAL based mechanism will see both technologies in
parallel, which is no small thing to support. And that transition time
may last for a good while. We still have people installing Slony 1.2
because 2.0 (3 years old by now) requires Postgres 8.3 minimum.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#83Jan Wieck
JanWieck@Yahoo.com
In reply to: Andres Freund (#71)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 10/15/2012 3:25 PM, Andres Freund wrote:

On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote:

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future?

Well. I don't think (hope) anybody proposed making something release worthy for
slony but rather a POC patch that proofs the API is generic enough to be used
by them. If I (or somebody else familiar with this) work together with somebody
familiar with with slony internals I think such a POC shouldn't be too hard to
do.
I think some more input from that side is a good idea. I plan to send out an
email to possibly interested parties in about two weeks...

What Slony essentially sends to the receiver node is a COPY stream in
the format, Christopher described. That stream is directly copied into
the receiving node's sl_log_N table and picked up there by an apply
trigger BEFORE INSERT, that performs the corresponding
INSERT/UPDATE/DELETE operation via prepared plans to the user tables.

For a POC I think it is sufficient to demonstrate that this copy stream
can be generated out of the WAL decoding.

Note that Slony today does not touch columns in an UPDATE, that have not
changed in the original UPDATE on the origin. Sending toasted column
values, that haven't changed, would be a substantial change to the
storage efficiency on the replica. The consequence of this is that the
number of colums that need to be in the UPDATE's SET clause varies. The
log_cmdupdncols is to separate the new column/value pairs from the
column/key pairs of the updated row. The old row "key" in Slony is based
on a unique index (preferably a PK, but any unique key will do). This
makes that cmdupdncols simply the number of column/value pairs minus the
number of key columns. So it isn't too hard to figure out.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#84Peter Geoghegan
peter@2ndquadrant.com
In reply to: Jan Wieck (#82)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 16 October 2012 15:26, Jan Wieck <JanWieck@yahoo.com> wrote:

This means that the transition time from the existing, trigger based
approach to the new WAL based mechanism will see both technologies in
parallel, which is no small thing to support.

So, you're talking about a shim between the two in order to usefully
support inter-version replication, or are you just thinking about
making a clean break in compatibility for Postgres versions prior to
9.3 in a new release branch?

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#85Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#81)
Re: First draft of snapshot snapshot building design document

On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:

The design document [2] really just explains the problem (which is the
need for catalog metadata at a point in time to make sense of heap
tuples), without describing the solution that this patch offers with
any degree of detail. Rather, [2] says "How we build snapshots is
somewhat intricate and complicated and seems to be out of scope for
this document", which is unsatisfactory. I look forward to reading the
promised document that describes this mechanism in more detail.

Here's the first version of the promised document. I hope it answers most of
the questions.

Input welcome!

I haven't grokked all of this in its entirety, but I'm kind of
uncomfortable with the relfilenode -> OID mapping stuff. I'm
wondering if we should, when logical replication is enabled, find a
way to cram the table OID into the XLOG record. It seems like that
would simplify things.

If we don't choose to do that, it's worth noting that you actually
need 16 bytes of data to generate a unique identifier for a relation,
as in database OID + tablespace OID + relfilenode# + backend ID.
Backend ID might be ignorable because WAL-based logical replication is
going to ignore temporary relations anyway, but you definitely need
the other two. There's nothing, for example, to keep you from having
two relations with the same value in pg_class.relfilenode in the same
database but in different tablespaces. It's unlikely to happen,
because for new relations we set OID = relfilenode, but a subsequent
rewrite can bring it about if the stars align just right. (Such
situations are, of course, a breeding ground for bugs, which might
make you question whether our current scheme for assigning
relfilenodes has much of anything to recommend it.)

Another thing to think about is that, like catalog snapshots,
relfilenode mappings have to be time-relativized; that is, you need to
know what the mapping was at the proper point in the WAL sequence, not
what it is now. In practice, the risk here seems to be minimal,
because it takes a while to churn through 4 billion OIDs. However, I
suspect it pays to think about this fairly carefully because if we do
ever run into a situation where the OID counter wraps during a time
period comparable to the replication lag, the bugs will be extremely
difficult to debug.

Anyhow, adding the table OID to the WAL header would chew up a few
more bytes of WAL space, but it seems like it might be worth it to
avoid having to think very hard about all of these issues.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#86Christopher Browne
cbbrowne@gmail.com
In reply to: Peter Geoghegan (#84)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On Thu, Oct 18, 2012 at 9:49 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 16 October 2012 15:26, Jan Wieck <JanWieck@yahoo.com> wrote:

This means that the transition time from the existing, trigger based
approach to the new WAL based mechanism will see both technologies in
parallel, which is no small thing to support.

So, you're talking about a shim between the two in order to usefully
support inter-version replication, or are you just thinking about
making a clean break in compatibility for Postgres versions prior to
9.3 in a new release branch?

It's early to assume either.

In Slony 2.0, we accepted that we were breaking compatibility with
versions of Postgres before 8.3; we accepted that because there were
considerable 'manageability' benefits (e.g. - system catalogues no
longer hacked up, so pg_dump works against all nodes, and some
dramatically reduced locking).

But that had the attendant cost that we have had to continue fixing
bugs on 1.2, to a degree, even until now, because people on Postgres
versions earlier than 8.3 have no way to use version 2.0.

Those merits and demerits apply pretty clearly to this.

It would be somewhat attractive for a "version 2.3" (or, more likely,
to indicate the break from earlier versions, "3.0" to make the clean
break to the new-in-PG-9.3 facilities. It is attractive in that we
could:
a) Safely remove the trigger-based log capture apparatus (or, at
least, I'm assuming so), and
b) Consciously upgrade to take advantage of all the latest cool stuff
found in Postgres 9.3. (I haven't got any particular features in
mind; perhaps we add RANGE comparators for xid to 9.3, and make
extensive use of xid_range types? That would be something that
couldn't reasonably get hacked to work in anything before 9.2...)
c) Drop out any special cases having to do with support of versions
8.3, 8.4, 9.0, 9.1, and 9.2.

But, of course, we'd be leaving everyone running 8.3 thru 9.2 behind,
if we did so, and would corresponding shackle ourselves to need to
support the 2.x branches for still longer. And this would mean that
this Slony "3.0" would expressly NOT support one of our intended use
cases, namely to support upgrading from elder versions of Postgres.

A "shim" adds complexity, but retains the "upgrade across versions"
use case, and reduces the need to keep supporting elder versions of
Slony.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#87Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#85)
Re: First draft of snapshot snapshot building design document

On Thursday, October 18, 2012 04:47:12 PM Robert Haas wrote:

On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com>

wrote:

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:

The design document [2] really just explains the problem (which is the
need for catalog metadata at a point in time to make sense of heap
tuples), without describing the solution that this patch offers with
any degree of detail. Rather, [2] says "How we build snapshots is
somewhat intricate and complicated and seems to be out of scope for
this document", which is unsatisfactory. I look forward to reading the
promised document that describes this mechanism in more detail.

Here's the first version of the promised document. I hope it answers most
of the questions.

Input welcome!

I haven't grokked all of this in its entirety, but I'm kind of
uncomfortable with the relfilenode -> OID mapping stuff. I'm
wondering if we should, when logical replication is enabled, find a
way to cram the table OID into the XLOG record. It seems like that
would simplify things.

If we don't choose to do that, it's worth noting that you actually
need 16 bytes of data to generate a unique identifier for a relation,
as in database OID + tablespace OID + relfilenode# + backend ID.
Backend ID might be ignorable because WAL-based logical replication is
going to ignore temporary relations anyway, but you definitely need
the other two. ...

Hm. I should take look at the way temporary tables are represented. As you say
I is not going to matter for WAL decoding, but still...

Another thing to think about is that, like catalog snapshots,
relfilenode mappings have to be time-relativized; that is, you need to
know what the mapping was at the proper point in the WAL sequence, not
what it is now. In practice, the risk here seems to be minimal,
because it takes a while to churn through 4 billion OIDs. However, I
suspect it pays to think about this fairly carefully because if we do
ever run into a situation where the OID counter wraps during a time
period comparable to the replication lag, the bugs will be extremely
difficult to debug.

I think with a rollbacks + restarts we might even be able to see the same
relfilenode earlier.

Anyhow, adding the table OID to the WAL header would chew up a few
more bytes of WAL space, but it seems like it might be worth it to
avoid having to think very hard about all of these issues.

I don't think its necessary to change wal logging here. The relfilenode mapping
is now looked up using the timetravel snapshot we've built using (spcNode,
relNode) as the key, so the time-relativized lookup is "builtin". If we screw
that up way much more is broken anyway.

Two problems are left:

1) (reltablespace, relfilenode) is not unique in pg_class because InvalidOid is
stored for relfilenode if its a shared or nailed table. That not a problem for
the lookup because weve already checked the relmapper before that, so we never
look those up anyway. But it violates documented requirements of syscache.c.
Even after some looking I haven't found any problem that that could cause.

2) We need to decide whether a HEAP[1-2]_* record did catalog changes when
building/updating snapshots. Unfortunately we also need to do this *before* we
built the first snapshot. For now treating all tables as catalog modifying
before we built the snapshot seems to work fine.
I think encoding the oid in the xlog header wouln't help all that much here,
because I am pretty sure we want to have the set of "catalog tables" to be
extensible at some point...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#88Peter Geoghegan
peter@2ndquadrant.com
In reply to: Christopher Browne (#86)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

On 18 October 2012 16:18, Christopher Browne <cbbrowne@gmail.com> wrote:

A "shim" adds complexity, but retains the "upgrade across versions"
use case, and reduces the need to keep supporting elder versions of
Slony.

Right. Upgrading across major versions is likely to continue to remain
a very important use-case for Slony.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#89Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#27)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

This patch doesn't seem to be going anywhere, sadly. Since we're a bit
late in the commitfest and this patch hasn't seen any activity for a
long time, I'll mark it as returned-with-feedback. I hope one or both
versions are resubmitted (with additional fixes?) for the next
commitfest, and that the discussion continues to determine which of the
two approaches is the best.

Thanks.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#90Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#87)
Re: First draft of snapshot snapshot building design document

On Thu, Oct 18, 2012 at 11:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On Thursday, October 18, 2012 04:47:12 PM Robert Haas wrote:

On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com>

wrote:

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:

The design document [2] really just explains the problem (which is the
need for catalog metadata at a point in time to make sense of heap
tuples), without describing the solution that this patch offers with
any degree of detail. Rather, [2] says "How we build snapshots is
somewhat intricate and complicated and seems to be out of scope for
this document", which is unsatisfactory. I look forward to reading the
promised document that describes this mechanism in more detail.

Here's the first version of the promised document. I hope it answers most
of the questions.

Input welcome!

I haven't grokked all of this in its entirety, but I'm kind of
uncomfortable with the relfilenode -> OID mapping stuff. I'm
wondering if we should, when logical replication is enabled, find a
way to cram the table OID into the XLOG record. It seems like that
would simplify things.

If we don't choose to do that, it's worth noting that you actually
need 16 bytes of data to generate a unique identifier for a relation,
as in database OID + tablespace OID + relfilenode# + backend ID.
Backend ID might be ignorable because WAL-based logical replication is
going to ignore temporary relations anyway, but you definitely need
the other two. ...

Hm. I should take look at the way temporary tables are represented. As you say
I is not going to matter for WAL decoding, but still...

Another thing to think about is that, like catalog snapshots,
relfilenode mappings have to be time-relativized; that is, you need to
know what the mapping was at the proper point in the WAL sequence, not
what it is now. In practice, the risk here seems to be minimal,
because it takes a while to churn through 4 billion OIDs. However, I
suspect it pays to think about this fairly carefully because if we do
ever run into a situation where the OID counter wraps during a time
period comparable to the replication lag, the bugs will be extremely
difficult to debug.

I think with a rollbacks + restarts we might even be able to see the same
relfilenode earlier.

Anyhow, adding the table OID to the WAL header would chew up a few
more bytes of WAL space, but it seems like it might be worth it to
avoid having to think very hard about all of these issues.

I don't think its necessary to change wal logging here. The relfilenode mapping
is now looked up using the timetravel snapshot we've built using (spcNode,
relNode) as the key, so the time-relativized lookup is "builtin". If we screw
that up way much more is broken anyway.

Two problems are left:

1) (reltablespace, relfilenode) is not unique in pg_class because InvalidOid is
stored for relfilenode if its a shared or nailed table. That not a problem for
the lookup because weve already checked the relmapper before that, so we never
look those up anyway. But it violates documented requirements of syscache.c.
Even after some looking I haven't found any problem that that could cause.

2) We need to decide whether a HEAP[1-2]_* record did catalog changes when
building/updating snapshots. Unfortunately we also need to do this *before* we
built the first snapshot. For now treating all tables as catalog modifying
before we built the snapshot seems to work fine.
I think encoding the oid in the xlog header wouln't help all that much here,
because I am pretty sure we want to have the set of "catalog tables" to be
extensible at some point...

I don't like catalog-only snapshots at all. I think that's just a
recipe for subtle or not-so-subtle breakage down the road...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#91Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#90)
Re: First draft of snapshot snapshot building design document

On Friday, October 19, 2012 06:38:30 PM Robert Haas wrote:

On Thu, Oct 18, 2012 at 11:20 AM, Andres Freund <andres@2ndquadrant.com>

wrote:

2) We need to decide whether a HEAP[1-2]_* record did catalog changes
when building/updating snapshots. Unfortunately we also need to do this
*before* we built the first snapshot. For now treating all tables as
catalog modifying before we built the snapshot seems to work fine.
I think encoding the oid in the xlog header wouln't help all that much
here, because I am pretty sure we want to have the set of "catalog
tables" to be extensible at some point...

I don't like catalog-only snapshots at all. I think that's just a
recipe for subtle or not-so-subtle breakage down the road...

I really don't see this changing for now. At some point we need to draw the
line otherwise we will never ever get anywhere. Naively building a snapshot
that can access normal tables is just too expensive because it changes all the
time.

Sure, obvious breakage will be there if you have a datatype that accesses
other user-tables during decoding (as we talked about previously). But subtle
breakage should be easily catchable by simply not allowing to open user
relations.
If an extension needs this it will have to mark the table as catalog ones for
now. I find this to be a reasonable restriction.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#92Peter Geoghegan
peter@2ndquadrant.com
In reply to: Andres Freund (#81)
Re: First draft of snapshot snapshot building design document

On 16 October 2012 12:30, Andres Freund <andres@2ndquadrant.com> wrote:

Here's the first version of the promised document. I hope it answers most of
the questions.

This makes for interesting reading.

So, I've taken a closer look at the snapshot building code in light of
this information. What follows are my thoughts on that aspect of the
patch (in particular, the merits of snapshot time-travel as a method
of solving the general problem of making sense of WAL during what I'll
loosely call "logical recovery", performed by what one design document
[1]: http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com
somewhat.

You can think of this review as a critical exposition of what happens
during certain steps of my high-level schematic of "how this all fits
together", which covers this entire patchset (this comes under my
remit as the reviewer of one of the most important and complex patches
in that patchset,
"0008-Introduce-wal-decoding-via-catalog-timetravel.patch"), as
described in my original, high-level review [2]http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com.

To recap on what those steps are:

***SNIP***
(from within plugin, currently about to decode a record for the first time)
|
\ /
9. During the first call (within the first record within a call
to decode_xlog()), we allocate a snapshot reader.
|
\ /
10. Builds snapshot callback. This scribbles on our snapshot
state, which essentially encapsulates a snapshot.
The state (and snapshot) changes continually, once per call.
|
\ /
11. Looks at XLogRecordBuffer (an XLogReader struct). Looks at
an XLogRecord. Decodes based on record type.
Let's assume it's an XLOG_HEAP_INSERT.
|
\ /
12. DecodeInsert() called. This in turn calls
DecodeXLogTuple(). We store the tuple metadata in our
ApplyCache. (some ilists, somewhere, each corresponding
to an XID). We don't store the relation oid, because we
don't know it yet (only relfilenode is known from WAL).
/
/
\ /
13. We're back in XLogReader(). It calls the only callback of interest to
us covered in step 3 (and not of interest to
XlogReader()/Heikki) – decode_change(). It does this through the
apply_cache.apply_change hook. This happens because we
encounter another record, this time a
commit record (in the same codepath as discussed in step 12).
|
\ /
14. In decode_change(), the actual function that raises the
interesting WARNINGs within Andres'
earlier example [3]http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com, showing actual integer/varchar Datum value
for tuples previously inserted.
Resolve table oid based on relfilenode (albeit unsatisfactorily).
Using a StringInfo, tupledescs, syscache and typcache, build
WARNING string.

(No further steps. Aside: If I've done a good job of writing my
initial review [2]http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com, I should be able to continually refer back to this
as I drill down on other steps in later reviews.)

It's fairly obvious why steps 9 and 10 are interesting to us here.
Step 11 (the first call of SnapBuildCallback() - this is a bit of a
misnomer, since the function isn't ever called through a pointer) is
where the visibility-wise decision to decode a particular
XLogRecord/XLogRecordBuffer occurs.

Here is how the patch describes our reasons for needing this call
(curiously, this comment appears not above SnapBuildCallback() itself,
but above the decode.c call of said function, which may hint at a
lapse in separation of concerns):

+ 	/*---------
+ 	 * Call the snapshot builder. It needs to be called before we analyze
+ 	 * tuples for two reasons:
+ 	 *
+ 	 * * Only in the snapshot building logic we know whether we have enough
+ 	 *   information to decode a particular tuple
+ 	 *
+ 	 * * The Snapshot/CommandIds computed by the SnapshotBuilder need to be
+ 	 *   added to the ApplyCache before we add tuples using them
+ 	 *---------
+ 	 */

Step 12 is a step that I'm not going into for this review. That's all
decoding related. Step 13 is not really worthy of separate
consideration here, because it just covers what happens when we call
DecodeCommit() within decode.c , where text representations of tuples
are ultimately printed in simple elog WARNINGs, as in Andres' original
example [3]http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com, due to the apply_cache.apply_change hook being called.

Step 14 *is* kind-of relevant, because this is one place where we
resolve relation OID based on relfilenode, which is part of
snapbuild.c, since it has a LookupTableByRelFileNode() call (the only
other such call is within snapbuild.c itself, as part of checking if a
particular relation + xid have had catalogue changes, which can be a
concern due to DDL, which is the basic problem that snapbuild.c seeks
to solve). Assuming that it truly is necessary to have a
LookupTableByRelFileNode() function, I don't think your plugin (which
will soon actually be a contrib module, right?) has any business
calling it. Rather, this should all be part of some higher-level
abstraction, probably within applycache, that spoonfeeds your example
contrib module changesets without it having to care about system cache
invalidation and what-not.

As I've already noted, snapbuild.c (plus one or two other isolated
places) have rather heavy-handed and out of place
InvalidateSystemCaches() calls like these:

+ HeapTuple
+ LookupTableByRelFileNode(RelFileNode *relfilenode)
+ {
+ 	Oid spc;
+
+ 	InvalidateSystemCaches();

However, since you've privately told me that your next revision will
address the need to execute sinval messages granularly, rather than
using this heavy-handed kludge, I won't once again stress the need to
do better. If I've understood correctly, you've indicated that this
can be done by processing the shared invalidation messages in each
xl_xact_commit (i.e. each XLOG_XACT_COMMIT entry) during decoding
(which I guess means that decoding's reponsibility's have been widened
a bit, but that's still the place to do it - within the decoding
"dispatcher"/glue code). Apparently we can expect this revision in the
next couple of days. Thankfully, I *think* that this implies that
plugins don't need to directly concern themselves with cache
invalidation.

By the way, shouldn't this use InvalidOid?:

+ 	/*
+ 	 * relations in the default tablespace are stored with a reltablespace = 0
+ 	 * for some reason.
+ 	 */
+ 	spc = relfilenode->spcNode == DEFAULTTABLESPACE_OID ?
+ 		0 : relfilenode->spcNode;

The objectionable thing about having your “plugin” handle cache
invalidation, apart from the fact that, as we all agree, the way
you're doing it is just not acceptable, is that your plugin is doing
it directly *at all*. You need to analyse the requirements of the
universe of possible logical changeset subscriber plugins, and whittle
them down to a lowest common denominator interface that likely
generalises cache invalidation, and ideally doesn't require plugin
authors to even know what a relfilenode or syscache is – some might
say this shouldn't be a priority, but I take the view that it should.
Exposing innards like this to these plugins is wrong headed, and to do
so will likely paint us into a corner. Now, I grant that catcache +
relcache can only be considered private innards in a relative sense
with Postgres, and indeed a few contrib modules do use syscache
directly for simple stuff, like hstore, which needs syscache +
typcache for generating text representations in a way that is not
completely unlike what you have here. Still, my feeling is that all
the heavy lifting should be encapsulated elsewhere, in core. I think
that you could easily justify adding another module/translation unit
that is tasked with massaging this data in a form amenable to serving
the needs of client code/plugins. If you don't get things quite right,
plugin authors can still do it all for themselves just as well.

I previously complained about having to take a leap of faith in
respect of xmin horizon handling [2]http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com. This new document also tries to
allay some of those concerns. It says:

== xmin Horizon Handling ==

Reusing MVCC for timetravel access has one obvious major problem:
VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
obviously, we want autovacuum/manual vacuum to work as before.

The idea here is to reuse the infrastrcuture built for hot_standby_feedback
which allows us to keep the xmin horizon of a walsender backend artificially
low. We keep it low enough so we can restart decoding from the last location
the client has confirmed to be safely received. The means that we keep it low
enough to contain the last checkpoints oldestXid value.

These ideas are still underdeveloped. For one thing, it seems kind of
weak to me that we're obliged to have vacuum grind to a halt across
the cluster because some speculative plugin has its proc's xmin pegged
to some value due to a logical standby still needing it for reading
catalogues only. Think of the failure modes – what happens when the
standby participating in a plugin-based logical replication system
croaks for indeterminate reasons? Doing this with the WAL sender as
part of hot_standby_feedback is clearly much less hazardous, because
there *isn't* a WAL sender when streaming replication isn't active in
respect of some corresponding standby, and hot_standby_feeback need
only support deferring vacuum for the streaming replication standby
whose active snapshot's xmin is furthest in the past. If a standby is
taken out of commission for an hour, it can probably catch up without
difficulty (difficulty like needing a base-backup), and it has no
repercussions for the master or anyone else. However, I believe that
logical change records would be rendered meaningless in the same
scenario, because VACUUM, having not seen a “pegged” xmin horizon due
to the standby's unavailability, goes ahead and vacuums past a point
still needed to keep the standby in a consistent state.

Maybe you can invent a new type of xmin horizon that applies only to
catalogues so this isn't so bad, and I see you've suggested as much in
follow-up mail to Robert, but it might be unsatisfactory to have that
impact the PGAXT performance optimisation introduced in commit
ed0b409d, or obfuscate that code. You could have the xmin be a special
xmin, say though PGAXT.vacuumFlags indicating this, but wouldn't that
necessarily preclude the backend from having a non-special notion of
its xmin? How does that bode for this actually becoming maximally
generic infrastructure?

You do have some ideas about how to re-sync a speculative in-core
logical replication system standby, and indeed you talk about this in
the design document posted a few weeks back [1]http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com (4.8. Setup of
replication nodes). This process is indeed similar to a base backup,
and we'd hope to avoid having to do it for similar reasons - it would
be undesirable to have to do it much more often in practice due to
these types of concerns.

You go on to say:

That also means we need to make that value persist across restarts/crashes in a
very similar manner to twophase.c's. That infrastructure actually also useful
to make hot_standby_feedback work properly across primary restarts.

So here you anticipating my criticism above about needing to peg the
xmin horizon. That's fine, but I still don't know yet is how you
propose to make this work in a reasonable way, free from all the usual
caveats about leaving prepared transactions open for an indefinitely
long time (what happens when we need to XID wraparound?, how does the
need to hold a transaction open interfere with a given plugin
backend's ability to execute regular queries?, etc). Furthermore, I
don't know how it's going to be independently useful to make
hot_standby_feedback work across primary restarts, because the primary
cannot then generate VACUUM cleanup WAL records that the standby might
replay, resulting in a hard conflict. Maybe there's something
incredibly obvious that I'm missing, but doesn't that problem almost
take care of itself? Granted, those cleanup records aren't the only
reason for a hard conflict, but they've got to be by far the most
important, and are documented as such. Either the cleanup record
already exists and you're usually already out of luck anyway, or it
doesn't and you're not. Are you thinking about race conditions?

You talk about the relfilenode/oid mapping problem some:

== Catalog/User Table Detection ==

To detect whether a record/transaction does catalog modifications - which we
need to do for memory/performance reasons - we need to resolve the
RelFileNode's in xlog records back to the original tables. Unfortunately
RelFileNode's only contain the tables relfilenode, not their table oid. We only
can do catalog access once we reached FULL_SNAPSHOT, before that we can use
some heuristics but otherwise we have to assume that every record changes the
catalog.

What exactly are the implications of having to assume catalogue
changes? I see that right now, you're just skipping actual decoding
once you've taken care of snapshot building chores within
SnapBuildCallback():

+ 	if (snapstate->state == SNAPBUILD_START)
+ 		return SNAPBUILD_SKIP;

The heuristics we can use are:
* relfilenode->spcNode == GLOBALTABLESPACE_OID
* relfilenode->relNode <= FirstNormalObjectId
* RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid

Those detect some catalog tables but not all (think VACUUM FULL), but if they
detect one they are correct.

If the heuristics aren't going to be completely reliable, why is that
acceptable? You've said it "seems to work fine", but I don't quite
follow.

I see this within SnapBuildCallback() (the function whose name I said
was a misnomer).

After reaching FULL_SNAPSHOT we can do catalog access if our heuristics tell us
a table might not be a catalog table. For that we use the new RELFILENODE
syscache with (spcNode, relNode).

I share some of Robert's concerns here. The fact that relfilenode
mappings have to be time-relativised may tip the scales against this
approach. As Robert has said, it may be that simply adding the table
OID to the WAL header is the way forward. It's not as if we can't
optimise that later. One compromise might be to only do that when we
haven't yet reached FULL_SNAPSHOT

On the subject of making decoding restartable, you say:

== Restartable Decoding ==

As we want to generate a consistent stream of changes we need to have the
ability to start from a previously decoded location without going to the whole
multi-phase setup because that would make it very hard to calculate up to where
we need to keep information.

Indeed, it would.

To make that easier everytime a decoding process finds an online checkpoint
record it exlusively takes a global lwlock and checks whether visibility
information has been already been written out for that checkpoint and does so
if not. We only need to do that once as visibility information is the same
between all decoding backends.

Where and how would that visibility information be represented? So,
typically you'd expect it to say “no catalogue changes for entire
checkpoint“ much of the time?

The patch we've seen
(0008-Introduce-wal-decoding-via-catalog-timetravel.patch [4]http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com) doesn't
address the question of transactions that contain DDL:

+ 			/* FIXME: For now skip transactions with catalog changes entirely */
+ 			if (ent && ent->does_timetravel)
+ 				does_timetravel = true;
+ 			else
+ 				does_timetravel = SnapBuildHasCatalogChanges(snapstate, xid, relfilenode);

but you do address this question (or a closely related question,
which, I gather is the crux of the issue: How to mix DDL and DML in
transactions?) in the new doc (README.SNAPBUILD.txt [6]http://archives.postgresql.org/message-id/201210161330.37967.andres@2ndquadrant.com):

== mixed DDL/DML transaction handling ==

When a transactions uses DDL and DML in the same transaction things get a bit
more complicated because we need to handle CommandIds and ComboCids as we need
to use the correct version of the catalog when decoding the individual tuples.

Right, so it becomes necessary to think about time-travelling not just
to a particular transaction, but to a particular point in a particular
transaction – the exact point at which the catalogue showed a
structure consistent with sanely interpreting logical WAL records
created during that window after the last DDL command (if any), but
before the next (if any). This intelligence is only actually needed
when decoding tuples created in that actual transaction, because only
those tuples have their format change throughout a single transaction.

CommandId handling itself is relatively simple, we can figure out the current
CommandId relatively easily by looking at the currently used one in
changes. The problematic part is that those CommandId frequently will not be
actual cmin or cmax values but ComboCids. Those are used to minimize space in
the heap. During normal operation cmin/cmax values are only used within the
backend emitting those rows and only during one toplevel transaction, so
instead of storing cmin/cmax only a reference to an in-memory value is stored
that contains both. Whenever we see a new CommandId we call
ApplyCacheAddNewCommandId.

Right. So in general, transaction A doesn't have to concern itself
with the order that other transactions had tuples become visible or
invisible (cmin and cmax); transaction A need only concern itself with
whether they're visible or invisible based on if relevant transactions
(xids) committed, its own xid, plus each tuple's xmin and xmax. It is
this property of cmin/cmax that enabled the combocid optimisation in
8.3, which introduces an array in *backend local* memory, to map a
single HeapTupleHeader field (where previously there were 2 – cmin and
cmax) to an entry in that array, under the theory that it's unusual
for a tuple to be created and then deleted in the same transaction.
Most of the time, that one HeapTupleHeader field wouldn't have a
mapping to the local array – rather, it would simply have a cmin or a
cmax. That's how we save heap space.

To resolve this problem during heap_* whenever we generate a new combocid
(detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog table
we log the new XLOG_HEAP2_NEW_COMBOCID record containing the mapping. During
decoding this ComboCid is added to the applycache
(ApplyCacheAddNewComboCid). They are only guaranteed to be visible within a
single transaction, so we cannot simply setup all of them globally.

This seems more or less reasonable. The fact that the combocid
optimisation uses a local memory array isn't actually an important
property of combocids as a performance optimisation – it's just that
there was no need for the actual cmin and cmax values to be visible to
other backends, until now. Comments in combocids.c go on at length
about how infrequently actual combocids are actually used in practice.
For ease of implementation, but also because real combocids are
expected to be needed infrequently, I suggest that rather than logging
the mapping, you log the values directly in a record (i.e. The full
cmin and cmax mapped to the catalogue + catalogue tuple's ctid). You
could easily exhaust the combocid space otherwise, and besides, you
cannot do anything with the mapping from outside of the backend that
originated the combocid for that transaction (you don't have the local
array, or the local hashtable used for combocids).

Before calling the output plugin ComboCids are temporarily setup and torn down
afterwards.

How? You're using a combocid-like array + hashtable local to the
plugin backend?

Anyway, for now, this is unimplemented, which is perhaps the biggest
concern about it:

+     /* check if its one of our txids, toplevel is also in there */
+ 	else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt))
+ 	{
+ 		CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple);
+ 		/* no support for that yet */
+ 		if (tuple->t_infomask & HEAP_COMBOCID){
+ 			elog(WARNING, "combocids not yet supported");
+ 			return false;
+ 		}
+ 		if (cmin >= snapshot->curcid)
+ 			return false;	/* inserted after scan started */
+ 	}

Above, you aren't taking this into account (code from HeapTupleHeaderGetCmax()):

/* We do not store cmax when locking a tuple */
Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED)));

Sure, you're only interested in cmin, so this doesn't look like it
applies (isn't this just a sanity check?), but actually, based on this
it seems to me that the current sane representation of cmin needs to
be obtained in a more concurrency aware fashion - having the backend
local data-structures that originate the tuple isn't even good enough.
The paucity of other callers to HeapTupleHeaderGetRawCommandId() seems
to support this. Apart from contrib/pageinspect, there is only this
one other direct caller, within heap_getsysattr():

/*
* cmin and cmax are now both aliases for the same field, which
* can in fact also be a combo command id. XXX perhaps we should
* return the "real" cmin or cmax if possible, that is if we are
* inside the originating transaction?
*/
result = CommandIdGetDatum(HeapTupleHeaderGetRawCommandId(tup->t_data));

So these few direct call-sites that inspect CommandId outside of their
originating backend are all non-critical observers of the CommandId,
that are roughly speaking allowed to be wrong. Callers to the higher
level abstractions (HeapTupleHeaderGetCmin()/HeapTupleHeaderGetCmax())
know that the CommandId must be a cmin or cmax respectively, because
they have as their contract that
TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tup)) and
TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tup))
respectively.

I guess this can all happen when you write that
XLOG_HEAP2_NEW_COMBOCID record within the originating transaction
(that is, when you xmin is that of the tuple in the originating
transaction, you are indeed dealing with a cmin), but these are
hazards that you need to consider in addition to the more obvious
ComboCid hazard. I have an idea that the
HeapTupleHeaderGetRawCommandId(tuple) call in your code could well be
bogus even when (t_infomask & HEAP_COMBOCID) == 0.

I look forward to seeing your revision that addressed the issue with
sinval messages.

[1]: http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

[2]: http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com

[3]: http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com

[4]: http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com

[5]: http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com

[6]: http://archives.postgresql.org/message-id/201210161330.37967.andres@2ndquadrant.com

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#93Andres Freund
andres@2ndquadrant.com
In reply to: Peter Geoghegan (#92)
Re: First draft of snapshot snapshot building design document

Hi,

On Friday, October 19, 2012 10:53:06 PM Peter Geoghegan wrote:

On 16 October 2012 12:30, Andres Freund <andres@2ndquadrant.com> wrote:

Here's the first version of the promised document. I hope it answers most
of the questions.

This makes for interesting reading.

Thanks.

Step 14 *is* kind-of relevant, because this is one place where we
resolve relation OID based on relfilenode, which is part of
snapbuild.c, since it has a LookupTableByRelFileNode() call (the only
other such call is within snapbuild.c itself, as part of checking if a
particular relation + xid have had catalogue changes, which can be a
concern due to DDL, which is the basic problem that snapbuild.c seeks
to solve). Assuming that it truly is necessary to have a
LookupTableByRelFileNode() function, I don't think your plugin (which
will soon actually be a contrib module, right?) has any business
calling it.

That shouldn't be needed anymore, that was just because I didn't finish some
loose ends quickly enough. The callback now gets passed the TupleDesc.

Thankfully, I *think* that this implies that
plugins don't need to directly concern themselves with cache
invalidation.

Correct. They don't need to know anything about it, its all handled
transparently now.

By the way, shouldn't this use InvalidOid?:

Yes. Fixed.

allay some of those concerns. It says:

== xmin Horizon Handling ==

Reusing MVCC for timetravel access has one obvious major problem:
VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
obviously, we want autovacuum/manual vacuum to work as before.

The idea here is to reuse the infrastrcuture built for
hot_standby_feedback which allows us to keep the xmin horizon of a
walsender backend artificially low. We keep it low enough so we can
restart decoding from the last location the client has confirmed to be
safely received. The means that we keep it low enough to contain the
last checkpoints oldestXid value.

These ideas are still underdeveloped. For one thing, it seems kind of
weak to me that we're obliged to have vacuum grind to a halt across
the cluster because some speculative plugin has its proc's xmin pegged
to some value due to a logical standby still needing it for reading
catalogues only. Think of the failure modes – what happens when the
standby participating in a plugin-based logical replication system
croaks for indeterminate reasons? Doing this with the WAL sender as
part of hot_standby_feedback is clearly much less hazardous, because
there *isn't* a WAL sender when streaming replication isn't active in
respect of some corresponding standby, and hot_standby_feeback need
only support deferring vacuum for the streaming replication standby
whose active snapshot's xmin is furthest in the past.

If a standby is taken out of commission for an hour, it can probably catch
up without difficulty (difficulty like needing a base-backup), and it has no
repercussions for the master or anyone else.

Only if you have setup an archive_command with sufficient retention. Otherwise
it all depends on the wal_keep_segments value.

However, I believe that logical change records would be rendered meaningless
in the same scenario, because VACUUM, having not seen a “pegged” xmin
horizon due to the standby's unavailability, goes ahead and vacuums past a
point still needed to keep the standby in a consistent state.

Well, thats why I want to persist them similar to twophase.c.

Maybe you can invent a new type of xmin horizon that applies only to
catalogues so this isn't so bad

Yes, I thought about this. I seems like a very sensible optimization, but I
really, really, would like to get the simpler version finished.

and I see you've suggested as much in follow-up mail to Robert

but it might be unsatisfactory to have that impact the PGAXT performance
optimisation introduced in commit ed0b409d, or obfuscate that code.

Hm. I don't forsee any need to pessimize that. But I guess the burden of proof
lies on me writing up a patch for this.

You could have the xmin be a special xmin, say though PGAXT.vacuumFlags

indicating this, but wouldn't that necessarily preclude the backend from

having a non-special notion of its xmin? How does that bode for this

actually becoming maximally generic infrastructure?

Uhm. Why should the decoding backend *ever* need its own xmin? I will *never*
be allowed to do writes itself or similar?

...
You go on to say:

That also means we need to make that value persist across
restarts/crashes in a very similar manner to twophase.c's. That
infrastructure actually also useful to make hot_standby_feedback work
properly across primary restarts.

So here you anticipating my criticism above about needing to peg the
xmin horizon. That's fine, but I still don't know yet is how you
propose to make this work in a reasonable way, free from all the usual
caveats about leaving prepared transactions open for an indefinitely
long time (what happens when we need to XID wraparound?, how does the
need to hold a transaction open interfere with a given plugin
backend's ability to execute regular queries?, etc).

Well, its not like its introducing a problem that wasn't there before. I
really don't see this as something all that problematic.

If people protest we can add some tiny bit of code to autovacuum that throws
everything away thats older than some autovacuum_vacuum_freeze_max_age or
such. We could even do the same to prepared transactions.

But in the end the answer is that you need to monitor *any* replication system
carefully.

Furthermore, I don't know how it's going to be independently useful to make
hot_standby_feedback work across primary restarts, because the primary
cannot then generate VACUUM cleanup WAL records that the standby might
replay, resulting in a hard conflict. Maybe there's something
incredibly obvious that I'm missing, but doesn't that problem almost
take care of itself?

Autovacuum immediately starts after a restart. Once its started some workers a
reconnecting standby cannot lower the xmin again, so you have a high
likelihood of conflicts. I have seen that problem in the field (ironically
"fixed" by creating a prepared xact before restarting ...).

You talk about the relfilenode/oid mapping problem some:

== Catalog/User Table Detection ==

To detect whether a record/transaction does catalog modifications - which
we need to do for memory/performance reasons - we need to resolve the
RelFileNode's in xlog records back to the original tables. Unfortunately
RelFileNode's only contain the tables relfilenode, not their table oid.
We only can do catalog access once we reached FULL_SNAPSHOT, before that
we can use some heuristics but otherwise we have to assume that every
record changes the catalog.

What exactly are the implications of having to assume catalogue
changes?

Higher cpu/storage overhead, nothing else.

I see that right now, you're just skipping actual decoding
once you've taken care of snapshot building chores within
SnapBuildCallback():

+ 	if (snapstate->state == SNAPBUILD_START)
+ 		return SNAPBUILD_SKIP;

When were in SNAPBUILD_START state we don't have *any* knowledge about the
system yet, so we can't do anything with collected records anyway (very likely
the record we just read was part of an already started transaction). Once were
in SNAPBUILD_FULL_SNAPSHOT state we can start collecting changes if the
respective transaction started *after* we have reached
SNAPBUILD_FULL_SNAPSHOT:

The heuristics we can use are:
* relfilenode->spcNode == GLOBALTABLESPACE_OID
* relfilenode->relNode <= FirstNormalObjectId
* RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid

Those detect some catalog tables but not all (think VACUUM FULL), but if
they detect one they are correct.

If the heuristics aren't going to be completely reliable, why is that
acceptable? You've said it "seems to work fine", but I don't quite
follow.

Should have left that out, its a small internal optimization... If the above
heuristic detect that a relfilenode relates to a catalog table its guaranteed
to be correct. It cannot detect all catalog changes though, so you can only
use it to skip doing work for catalog tables, not for skipping work if
!catalog.

After reaching FULL_SNAPSHOT we can do catalog access if our heuristics
tell us a table might not be a catalog table. For that we use the new
RELFILENODE syscache with (spcNode, relNode).

I share some of Robert's concerns here. The fact that relfilenode
mappings have to be time-relativised may tip the scales against this
approach.

Whats the problem with the time-relativized access?

As Robert has said, it may be that simply adding the table
OID to the WAL header is the way forward. It's not as if we can't
optimise that later. One compromise might be to only do that when we
haven't yet reached FULL_SNAPSHOT

When writing the WAL we don't have any knowledge about what state some
potential decoding process could be in, so its an all or nothing thing.

I don't have a problem with writing the table oid into the records somewhere,
I just think its not required. One reason for storing it in there independent
from this patchset/feature is debugging. I wished for that in the past.

We need to build the snapshot to access catalog anyway, so its not like doing
the relfilenode lookup time-relativzed incurs any additional costs. Also, we
need to do the table-oid lookup time-relatived as well, because table oids can
be reused.

To make that easier everytime a decoding process finds an online
checkpoint record it exlusively takes a global lwlock and checks whether
visibility information has been already been written out for that
checkpoint and does so if not. We only need to do that once as
visibility information is the same between all decoding backends.

Where and how would that visibility information be represented?

Some extra pg_* directory, like pg_decode/$LSN_OF_CHECKPOINT.

So, typically you'd expect it to say “no catalogue changes for entire
checkpoint“ much of the time?

No, not really. It will probably look similar to the files ExportSnapshot
currently produces. Even if no catalog changes happened we still need to keep
knowledge about committed transactions and such.

Btw, I doubt all that many checkpoint<->checkpoint windows will have
absolutely no catalog changes. At least some pg_class.relfrozenxid,
pg_class.reltuples changes are to be expected.

The patch we've seen
(0008-Introduce-wal-decoding-via-catalog-timetravel.patch [4]) doesn't
address the question of transactions that contain DDL:
...
but you do address this question (or a closely related question,
which, I gather is the crux of the issue: How to mix DDL and DML in

transactions?) in the new doc (README.SNAPBUILD.txt [6]):

== mixed DDL/DML transaction handling ==

When a transactions uses DDL and DML in the same transaction things get a
bit more complicated because we need to handle CommandIds and ComboCids
as we need to use the correct version of the catalog when decoding the
individual tuples.

Right, so it becomes necessary to think about time-travelling not just
to a particular transaction, but to a particular point in a particular
transaction – the exact point at which the catalogue showed a
structure consistent with sanely interpreting logical WAL records
created during that window after the last DDL command (if any), but
before the next (if any). This intelligence is only actually needed
when decoding tuples created in that actual transaction, because only
those tuples have their format change throughout a single transaction.

Exactly.

CommandId handling itself is relatively simple, we can figure out the
current CommandId relatively easily by looking at the currently used one
in changes. The problematic part is that those CommandId frequently will
not be actual cmin or cmax values but ComboCids. Those are used to
minimize space in the heap. During normal operation cmin/cmax values are
only used within the backend emitting those rows and only during one
toplevel transaction, so instead of storing cmin/cmax only a reference
to an in-memory value is stored that contains both. Whenever we see a
new CommandId we call
ApplyCacheAddNewCommandId.

Right. So in general, transaction A doesn't have to concern itself
with the order that other transactions had tuples become visible or
invisible (cmin and cmax); transaction A need only concern itself with
whether they're visible or invisible based on if relevant transactions
(xids) committed, its own xid, plus each tuple's xmin and xmax. It is
this property of cmin/cmax that enabled the combocid optimisation in
8.3, which introduces an array in *backend local* memory, to map a
single HeapTupleHeader field (where previously there were 2 – cmin and
cmax) to an entry in that array, under the theory that it's unusual
for a tuple to be created and then deleted in the same transaction.
Most of the time, that one HeapTupleHeader field wouldn't have a
mapping to the local array – rather, it would simply have a cmin or a
cmax. That's how we save heap space.

Yes. The whole handling here is nearly completely analogous to the normal
handling of CommandIds.

To resolve this problem during heap_* whenever we generate a new combocid
(detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog
table we log the new XLOG_HEAP2_NEW_COMBOCID record containing the
mapping. During decoding this ComboCid is added to the applycache
(ApplyCacheAddNewComboCid). They are only guaranteed to be visible within
a single transaction, so we cannot simply setup all of them globally.

This seems more or less reasonable. The fact that the combocid
optimisation uses a local memory array isn't actually an important
property of combocids as a performance optimisation

It is an important property here because ComboCids from different toplevel
transactions conflict with each other which means we have to deal with them on
a per toplevel-xid basis.

For ease of implementation, but also because real combocids are
expected to be needed infrequently, I suggest that rather than logging
the mapping, you log the values directly in a record (i.e. The full
cmin and cmax mapped to the catalogue + catalogue tuple's ctid). You
could easily exhaust the combocid space otherwise, and besides, you
cannot do anything with the mapping from outside of the backend that
originated the combocid for that transaction (you don't have the local
array, or the local hashtable used for combocids).

I can't really follow here. Obviously we need to generate the
XLOG_HEAP2_NEW_COMBOCID locally in the transaction/backend that generated the
change?

Before calling the output plugin ComboCids are temporarily setup and torn
down afterwards.

How? You're using a combocid-like array + hashtable local to the
plugin backend?

I added
extern void PutComboCommandId(CommandId combocid, CommandId cmin, CommandId
cmax);
which in combination with the existing
extern void AtEOXact_ComboCid(void);
is enough.

Anyway, for now, this is unimplemented, which is perhaps the biggest
concern about it:

+     /* check if its one of our txids, toplevel is also in there */
+ 	else if (TransactionIdInArray(xmin, snapshot->subxip,
snapshot->subxcnt)) + 	{
+ 		CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple);
+ 		/* no support for that yet */
+ 		if (tuple->t_infomask & HEAP_COMBOCID){
+ 			elog(WARNING, "combocids not yet supported");
+ 			return false;
+ 		}
+ 		if (cmin >= snapshot->curcid)
+ 			return false;	/* inserted after scan started */
+ 	}

Above, you aren't taking this into account (code from
HeapTupleHeaderGetCmax()):

/* We do not store cmax when locking a tuple */
Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED)));

Sure, you're only interested in cmin, so this doesn't look like it
applies (isn't this just a sanity check?), but actually, based on this
it seems to me that the current sane representation of cmin needs to
be obtained in a more concurrency aware fashion - having the backend
local data-structures that originate the tuple isn't even good enough.

You completely lost me here and in the following paragraphs. The infomask is
available for everyone, and we only read/write cmin|cmax|comboid when were
inside the transaction or when we have already logged a HEAP2_NEW_COMBOCID and
thus have the necessary information?
Which concurrency concerns are you referring to?

I have an idea that the HeapTupleHeaderGetRawCommandId(tuple) call in your
code could well be bogus even when (t_infomask & HEAP_COMBOCID) == 0.

Ah? The other .satisfies routines do HeapTupleHeaderGetCmin(tuple) which
returns exactly that if !(tup->t_infomask & HEAP_COMBOCID).

But anyway, with the new combocid handling the code uses the usual
HeapTupleHeaderGetCmin/Cmax calls, so it looks even more like the normal
routines.

Thanks for the extensive review! I am pretty sure this a lot to take in ;).

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#94Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#1)
Re: [RFC][PATCH] wal decoding, attempt #2

Comments about the approach or even the general direction of the
implementation? Questions?

This patch series has gotten serious amount of discussion and useful
feedback; even some parts of it have been committed. I imagine lots
more feedback, discussion and spawning of new ideas will take place in
Prague. I am marking it as Returned with Feedback for now. Updated,
rebased, modified versions are expected for the next commitfest.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#95Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#5)
Re: [PATCH 4/8] add simple xlogdump tool

After some fooling around to provide the discussed backend functionality
to xlogdump (StringInfo atop PQExpBuffer and elog_start/elog_finish),
the following items still need work:

1. rmgr tables
We're linking rmgr.c so that we can obtain the appropriate rm_desc
function pointer for each rmgr. However the table also includes the
rm_redo, startup, etc function pointers, which the linker wants resolved
at xlogdump link time. The idea I have to handle this is to use a macro
similar to PG_KEYWORD: at compile time we define it differently on
xlogdump than on backend, so that the symbols we don't want are hidden.

2. ereport() functionality
Currently the xlogreader.c I'm using (the latest version posted by
Andres) has both elog() calls and ereport(). I have provided trivial
elog_start and elog_finish implementations, which covers the first. I
am not really sure about implementing the whole errstart/errfinish
stack, because that'd be pretty duplicative, though I haven't tried.
The other alternative suggested elsewhere is to avoid elog/ereport
entirely in xlogreader.c and instead pass a function pointer for error
reportage. The backend would normally use ereport(), but xlogdump could
do something simple with fprintf. I think that would end up being
cleaner overall.

3. timestamptz_to_str
xact_desc uses this, which involves a couple of messy backend files
(because there's palloc in them, among other problems). Alternatively
we could tweak xact_desc to use EncodeDateTime (probably through some
simple wrapper); given the constraints imposed on the values, that might
be simpler, and we can provide a simple implementation of EncodeDateTime
or of its hypothetical wrapper in xlogdump.

4. relpathbackend and pfree of its return value
This is messy. Maybe we should a caller-supplied buffer instead of
palloc to solve this.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#96Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#25)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Heikki Linnakangas escribió:

Hmm. I was thinking that making this work in a non-backend context
would be too hard, so I didn't give that much thought, but I guess
there isn't many dependencies to backend functions after all.
palloc/pfree are straightforward to replace with malloc/free. That's
what we could easily do with the error messages too, just malloc a
suitably sized buffer.

How does a non-backend program get access to xlogreader.c? Copy
xlogreader.c from the source tree at build time and link into the
program? Or should we turn it into a shared library?

It links the object file into the src/bin/ subdir or whatever. I don't
think a shared library is warranted at this point.

One further (relatively minor) problem with what you propose here is
that the xlogreader.c code is using emode_for_corrupt_record(), which
not only lives in xlog.c but also depends on readSource. I guess we
still want the message-supressing abilities of that function, so some
more thinking is required in this area.

I think you may have converted some malloc() calls from Andres' patch
into palloc() -- because you have some palloc() calls which are later
checked for NULL results, which obviously doesn't make sense. At the
same time, if we're going to use malloc() instead of palloc(), we need
to check for NULL return value in XLogReaderAllocate() callers. This
seems easy to fix at first glance, but what is the correct response if
it fails during StartupXLOG()? Should we just elog(FATAL) and hope it
never happens in practice?

Andres commented elsewhere about reading xlog records, processing them
as they came in, and do a running CRC while we're still reading it. I
think this is a mistake; we shouldn't do anything with a record until
the CRC has been verified. Otherwise we risk reading arbitrarily
corrupt data.

Overall, I think I like Heikki's minimal patch better than Andres'
original proposal, but I'll keep looking at both.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#97Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#96)
Re: Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I think you may have converted some malloc() calls from Andres' patch
into palloc() -- because you have some palloc() calls which are later
checked for NULL results, which obviously doesn't make sense. At the
same time, if we're going to use malloc() instead of palloc(), we need
to check for NULL return value in XLogReaderAllocate() callers. This
seems easy to fix at first glance, but what is the correct response if
it fails during StartupXLOG()? Should we just elog(FATAL) and hope it
never happens in practice?

Um, surely we can still let those functions use palloc? It should
just be #define'd as pg_malloc() (ie something with an error exit)
in non-backend contexts.

regards, tom lane

#98Andres Freund
andres@2ndquadrant.com
In reply to: Alvaro Herrera (#96)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Monday, October 29, 2012 08:58:53 PM Alvaro Herrera wrote:

Heikki Linnakangas escribió:

Hmm. I was thinking that making this work in a non-backend context
would be too hard, so I didn't give that much thought, but I guess
there isn't many dependencies to backend functions after all.
palloc/pfree are straightforward to replace with malloc/free. That's
what we could easily do with the error messages too, just malloc a
suitably sized buffer.

How does a non-backend program get access to xlogreader.c? Copy
xlogreader.c from the source tree at build time and link into the
program? Or should we turn it into a shared library?

Andres commented elsewhere about reading xlog records, processing them
as they came in, and do a running CRC while we're still reading it. I
think this is a mistake; we shouldn't do anything with a record until
the CRC has been verified. Otherwise we risk reading arbitrarily
corrupt data.

Uhm. xlog.c does just the same. It reads the header and if it looks valid it
uses its length information to read the full record and only computes the CRC
at the end.

Overall, I think I like Heikki's minimal patch better than Andres'
original proposal, but I'll keep looking at both.

I'll defer to you two on that, no point in fighting that many experienced
people ;)

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#99Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#98)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Andres Freund escribió:

On Monday, October 29, 2012 08:58:53 PM Alvaro Herrera wrote:

Heikki Linnakangas escribió:

Andres commented elsewhere about reading xlog records, processing them
as they came in, and do a running CRC while we're still reading it. I
think this is a mistake; we shouldn't do anything with a record until
the CRC has been verified. Otherwise we risk reading arbitrarily
corrupt data.

Uhm. xlog.c does just the same. It reads the header and if it looks valid it
uses its length information to read the full record and only computes the CRC
at the end.

Uh. Correct.

Am I the only one who finds this rather bizarre? Maybe this was okay
when xlog data would only come from WAL files stored in the data
directory at recovery, but if we're now receiving these from a remote
sender over the network I wonder if we should be protecting against
malicious senders. (This is not related to this patch anyway.)

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#100Andres Freund
andres@2ndquadrant.com
In reply to: Alvaro Herrera (#99)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote:

Andres Freund escribió:

On Monday, October 29, 2012 08:58:53 PM Alvaro Herrera wrote:

Heikki Linnakangas escribió:

Andres commented elsewhere about reading xlog records, processing them
as they came in, and do a running CRC while we're still reading it. I
think this is a mistake; we shouldn't do anything with a record until
the CRC has been verified. Otherwise we risk reading arbitrarily
corrupt data.

Uhm. xlog.c does just the same. It reads the header and if it looks valid
it uses its length information to read the full record and only computes
the CRC at the end.

Uh. Correct.

Am I the only one who finds this rather bizarre? Maybe this was okay
when xlog data would only come from WAL files stored in the data
directory at recovery, but if we're now receiving these from a remote
sender over the network I wonder if we should be protecting against
malicious senders. (This is not related to this patch anyway.)

How should this work otherwise? The CRC is over the whole data so we obviously
need to read the whole data to compute the CRC? Would you prefer protecting
the header with a separate CRC?
You can't use a CRC against malicous users anyway, its not cryptographically
secure in any meaning of the word, its trivial to generate different content
resulting in the same CRC. The biggest user of the CRC checking code we have
is making sure were not reading beyond the end of the WAL...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#101Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#100)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Andres Freund <andres@2ndquadrant.com> writes:

On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote:

Am I the only one who finds this rather bizarre? Maybe this was okay
when xlog data would only come from WAL files stored in the data
directory at recovery, but if we're now receiving these from a remote
sender over the network I wonder if we should be protecting against
malicious senders. (This is not related to this patch anyway.)

You can't use a CRC against malicous users anyway, its not cryptographically
secure in any meaning of the word,

More to the point, if a bad guy has got control of your WAL stream,
crashing the startup process with a ridiculous length word is several
orders of magnitude less than the worst thing he can find to do to you.
So the above argument seems completely nonsensical to me. Anybody who's
worried about that type of scenario is better advised to be setting up
SSL to ensure that the stream is coming from the server they think it's
coming from.

regards, tom lane

#102Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#101)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Tom Lane escribió:

Andres Freund <andres@2ndquadrant.com> writes:

On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote:

Am I the only one who finds this rather bizarre? Maybe this was okay
when xlog data would only come from WAL files stored in the data
directory at recovery, but if we're now receiving these from a remote
sender over the network I wonder if we should be protecting against
malicious senders. (This is not related to this patch anyway.)

You can't use a CRC against malicous users anyway, its not cryptographically
secure in any meaning of the word,

More to the point, if a bad guy has got control of your WAL stream,
crashing the startup process with a ridiculous length word is several
orders of magnitude less than the worst thing he can find to do to you.
So the above argument seems completely nonsensical to me.

Well, I wasn't talking just about the record length, but about the
record in general. The length just came up because it's what I noticed.

And yeah, I was thinking in one sum for the header and another one for
the data. If we're using CRC to detect end of WAL, what sense does it
make to have to read the whole record if we can detect the end by just
looking at the header? (I obviously see that we need to checksum the
data as well; and having a second CRC field would bloat the record.)

Anybody who's worried about that type of scenario is better advised to
be setting up SSL to ensure that the stream is coming from the server
they think it's coming from.

Okay.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#103Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#102)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

And yeah, I was thinking in one sum for the header and another one for
the data.

I don't think it's worth the space.

If we're using CRC to detect end of WAL, what sense does it
make to have to read the whole record if we can detect the end by just
looking at the header?

Er, what? The typical case where CRC shows us it's end of WAL is that
the overall CRC doesn't match, ie, torn record. Record header
corruption as such is just about nonexistent AFAIK.

regards, tom lane

#104Andres Freund
andres@2ndquadrant.com
In reply to: Alvaro Herrera (#102)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Tuesday, October 30, 2012 04:24:21 PM Alvaro Herrera wrote:

Tom Lane escribió:

Andres Freund <andres@2ndquadrant.com> writes:

On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote:

Am I the only one who finds this rather bizarre? Maybe this was okay
when xlog data would only come from WAL files stored in the data
directory at recovery, but if we're now receiving these from a remote
sender over the network I wonder if we should be protecting against
malicious senders. (This is not related to this patch anyway.)

You can't use a CRC against malicous users anyway, its not
cryptographically secure in any meaning of the word,

More to the point, if a bad guy has got control of your WAL stream,
crashing the startup process with a ridiculous length word is several
orders of magnitude less than the worst thing he can find to do to you.
So the above argument seems completely nonsensical to me.

Well, I wasn't talking just about the record length, but about the
record in general. The length just came up because it's what I noticed.

And yeah, I was thinking in one sum for the header and another one for
the data. If we're using CRC to detect end of WAL, what sense does it
make to have to read the whole record if we can detect the end by just
looking at the header? (I obviously see that we need to checksum the
data as well; and having a second CRC field would bloat the record.)

Well, the header is written first. In the header we can detect somewhat
accurately that were beyond the current end-of-wal by looking at ->xl_prev and
doing some validity checks, but thats not applicable for the data. A valid
looking header doesn't mean that the whole, potentially multi-megabyte, record
data is valid, we could have crashed while writing the data.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#105Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#95)
1 attachment(s)
Enabling frontend-only xlog "desc" routines

I mentioned the remaining issues in a previous email (see
message-id 20121025161751.GE6442@alvh.no-ip.org). Attached is a
patch that enables xlogdump to #include xlog_internal.h by way of
removing that file's inclusion of fmgr.h, which is problematic. I don't
think this should be too contentious.

The other interesting question remaining is what to do about the rm_desc
function in rmgr.c. We're split between these two ideas:

1. Have this in rmgr.c:

#ifdef FRONTEND
#define RMGR_REDO_FUNC(func) NULL
#else
#define RMGR_REDO_FUNC(func) func
#endif /* FRONTEND */

and then use RMGR_REDO_FUNC() in the table.

2. Have this in rmgr.c:

#ifndef RMGR_REDO_FUNC
#define RMGR_REDO_FUNC(func) func
#endif

And then have the xlogdump Makefile use -D to define a suitable
RMGR_REDO_FUNC.

Opinions please?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

xlog-fn.patchtext/x-diff; charset=us-asciiDownload
commit 74986720979c1c2bb39a58133563fd8da82c301b
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Tue Nov 27 15:33:18 2012 -0300

    Split SQL-callable function declarations into xlog_fn.h
    
    This lets xlog_internal.h go without including fmgr.h, which is useful
    to let it compile in a frontend-only environment.

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index d345761..40c0bd6 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "access/xlog.h"
+#include "access/xlog_fn.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
diff --git a/src/include/access/xlog_fn.h b/src/include/access/xlog_fn.h
new file mode 100644
index 0000000..65376fe
--- /dev/null
+++ b/src/include/access/xlog_fn.h
@@ -0,0 +1,35 @@
+/*
+ * xlog_fn.h
+ *
+ * PostgreSQL transaction log SQL-callable function declarations
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/xlog_fn.h
+ */
+#ifndef XLOG_FN_H
+#define XLOG_FN_H
+
+#include "fmgr.h"
+
+extern Datum pg_start_backup(PG_FUNCTION_ARGS);
+extern Datum pg_stop_backup(PG_FUNCTION_ARGS);
+extern Datum pg_switch_xlog(PG_FUNCTION_ARGS);
+extern Datum pg_create_restore_point(PG_FUNCTION_ARGS);
+extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
+extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
+extern Datum pg_last_xlog_receive_location(PG_FUNCTION_ARGS);
+extern Datum pg_last_xlog_replay_location(PG_FUNCTION_ARGS);
+extern Datum pg_last_xact_replay_timestamp(PG_FUNCTION_ARGS);
+extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
+extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
+extern Datum pg_is_in_recovery(PG_FUNCTION_ARGS);
+extern Datum pg_xlog_replay_pause(PG_FUNCTION_ARGS);
+extern Datum pg_xlog_replay_resume(PG_FUNCTION_ARGS);
+extern Datum pg_is_xlog_replay_paused(PG_FUNCTION_ARGS);
+extern Datum pg_xlog_location_diff(PG_FUNCTION_ARGS);
+extern Datum pg_is_in_backup(PG_FUNCTION_ARGS);
+extern Datum pg_backup_start_time(PG_FUNCTION_ARGS);
+
+#endif   /* XLOG_FN_H */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index b70a620..5802d1d 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -17,7 +17,6 @@
 #define XLOG_INTERNAL_H
 
 #include "access/xlog.h"
-#include "fmgr.h"
 #include "pgtime.h"
 #include "storage/block.h"
 #include "storage/relfilenode.h"
@@ -253,26 +252,4 @@ extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
 extern void XLogArchiveCleanup(const char *xlog);
 
-/*
- * These aren't in xlog.h because I'd rather not include fmgr.h there.
- */
-extern Datum pg_start_backup(PG_FUNCTION_ARGS);
-extern Datum pg_stop_backup(PG_FUNCTION_ARGS);
-extern Datum pg_switch_xlog(PG_FUNCTION_ARGS);
-extern Datum pg_create_restore_point(PG_FUNCTION_ARGS);
-extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
-extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
-extern Datum pg_last_xlog_receive_location(PG_FUNCTION_ARGS);
-extern Datum pg_last_xlog_replay_location(PG_FUNCTION_ARGS);
-extern Datum pg_last_xact_replay_timestamp(PG_FUNCTION_ARGS);
-extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
-extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
-extern Datum pg_is_in_recovery(PG_FUNCTION_ARGS);
-extern Datum pg_xlog_replay_pause(PG_FUNCTION_ARGS);
-extern Datum pg_xlog_replay_resume(PG_FUNCTION_ARGS);
-extern Datum pg_is_xlog_replay_paused(PG_FUNCTION_ARGS);
-extern Datum pg_xlog_location_diff(PG_FUNCTION_ARGS);
-extern Datum pg_is_in_backup(PG_FUNCTION_ARGS);
-extern Datum pg_backup_start_time(PG_FUNCTION_ARGS);
-
 #endif   /* XLOG_INTERNAL_H */
#106Amit Kapila
amit.kapila@huawei.com
In reply to: Alvaro Herrera (#105)
Re: Enabling frontend-only xlog "desc" routines

On Wednesday, November 28, 2012 12:17 AM Alvaro Herrera wrote:

I mentioned the remaining issues in a previous email (see message-id
20121025161751.GE6442@alvh.no-ip.org). Attached is a patch that enables
xlogdump to #include xlog_internal.h by way of removing that file's
inclusion of fmgr.h, which is problematic. I don't think this should be
too contentious.

I have tried to go through Xlogdump patch provided in mail chain of
message-id provided.
It seems there is no appropriate file/function header which makes it little
difficult to understand the purpose.
This is just a suggestion and not related to your this mail.

The other interesting question remaining is what to do about the rm_desc
function in rmgr.c. We're split between these two ideas:

1. Have this in rmgr.c:

#ifdef FRONTEND
#define RMGR_REDO_FUNC(func) NULL
#else
#define RMGR_REDO_FUNC(func) func
#endif /* FRONTEND */

and then use RMGR_REDO_FUNC() in the table.

2. Have this in rmgr.c:

#ifndef RMGR_REDO_FUNC
#define RMGR_REDO_FUNC(func) func
#endif

And then have the xlogdump Makefile use -D to define a suitable
RMGR_REDO_FUNC.

What I understood is that as there is only a need of rm_desc function in
xlogdump, so we want to separate it out such that
it can be easily used.
In Approach-1, define for some of functions (redo, startup, cleanup,..) as
NULL for frontends so that corresponding functions for frontends become
hidden.
In Approach-2, frontend (in this case Xlogdump) need to define value for
that particular Macro by using -D in makefile.

If my above thinking is right, then I think Approach-2 might be better as in
that Frontend will have more flexibility if it wants
to use some other functionality of rmgr.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#106)
Re: Enabling frontend-only xlog "desc" routines

On 2012-11-28 18:58:45 +0530, Amit Kapila wrote:

On Wednesday, November 28, 2012 12:17 AM Alvaro Herrera wrote:

I mentioned the remaining issues in a previous email (see message-id
20121025161751.GE6442@alvh.no-ip.org). Attached is a patch that enables
xlogdump to #include xlog_internal.h by way of removing that file's
inclusion of fmgr.h, which is problematic. I don't think this should be
too contentious.

I have tried to go through Xlogdump patch provided in mail chain of
message-id provided.
It seems there is no appropriate file/function header which makes it little
difficult to understand the purpose.
This is just a suggestion and not related to your this mail.

An updated version of xlogdump with some initial documentation, sensible
building, and some more is available at
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlogreader_v3

The other interesting question remaining is what to do about the rm_desc
function in rmgr.c. We're split between these two ideas:

1. Have this in rmgr.c:

#ifdef FRONTEND
#define RMGR_REDO_FUNC(func) NULL
#else
#define RMGR_REDO_FUNC(func) func
#endif /* FRONTEND */

and then use RMGR_REDO_FUNC() in the table.

2. Have this in rmgr.c:

#ifndef RMGR_REDO_FUNC
#define RMGR_REDO_FUNC(func) func
#endif

And then have the xlogdump Makefile use -D to define a suitable
RMGR_REDO_FUNC.

What I understood is that as there is only a need of rm_desc function in
xlogdump, so we want to separate it out such that
it can be easily used.
In Approach-1, define for some of functions (redo, startup, cleanup,..) as
NULL for frontends so that corresponding functions for frontends become
hidden.
In Approach-2, frontend (in this case Xlogdump) need to define value for
that particular Macro by using -D in makefile.

If my above thinking is right, then I think Approach-2 might be better as in
that Frontend will have more flexibility if it wants
to use some other functionality of rmgr.

I personally favor approach-1 because I cannot see any potential other
use. You basically need to have the full backend code available just to
successfully link the other functions. Running is even more complex, and
there's no real point in doing that standalone anyway, so, what for?

Its not like thats something that annot be changed should an actual
usecase emerge.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108Amit Kapila
amit.kapila@huawei.com
In reply to: Andres Freund (#107)
Re: Enabling frontend-only xlog "desc" routines

On Wednesday, November 28, 2012 7:07 PM Andres Freund wrote:

On 2012-11-28 18:58:45 +0530, Amit Kapila wrote:

On Wednesday, November 28, 2012 12:17 AM Alvaro Herrera wrote:

I mentioned the remaining issues in a previous email (see message-id
20121025161751.GE6442@alvh.no-ip.org). Attached is a patch that

enables

xlogdump to #include xlog_internal.h by way of removing that file's
inclusion of fmgr.h, which is problematic. I don't think this

should be

too contentious.

I have tried to go through Xlogdump patch provided in mail chain of
message-id provided.
It seems there is no appropriate file/function header which makes it

little

difficult to understand the purpose.
This is just a suggestion and not related to your this mail.

An updated version of xlogdump with some initial documentation, sensible
building, and some more is available at
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=sh
ortlog;h=refs/heads/xlogreader_v3

Oops.. looked at wrong version.

The other interesting question remaining is what to do about the

rm_desc

function in rmgr.c. We're split between these two ideas:

1. Have this in rmgr.c:

#ifdef FRONTEND
#define RMGR_REDO_FUNC(func) NULL
#else
#define RMGR_REDO_FUNC(func) func
#endif /* FRONTEND */

and then use RMGR_REDO_FUNC() in the table.

2. Have this in rmgr.c:

#ifndef RMGR_REDO_FUNC
#define RMGR_REDO_FUNC(func) func
#endif

And then have the xlogdump Makefile use -D to define a suitable
RMGR_REDO_FUNC.

What I understood is that as there is only a need of rm_desc function

in

xlogdump, so we want to separate it out such that
it can be easily used.
In Approach-1, define for some of functions (redo, startup,

cleanup,..) as

NULL for frontends so that corresponding functions for frontends

become

hidden.
In Approach-2, frontend (in this case Xlogdump) need to define value

for

that particular Macro by using -D in makefile.

If my above thinking is right, then I think Approach-2 might be better

as in

that Frontend will have more flexibility if it wants
to use some other functionality of rmgr.

I personally favor approach-1 because I cannot see any potential other
use. You basically need to have the full backend code available just to
successfully link the other functions. Running is even more complex, and
there's no real point in doing that standalone anyway, so, what for?

Such functionality might be used if somebody wants to write independent test
for storage engine, but not sure if such a thing (Approach-2) can be
helpful.

As I can see that Approach-1 has advantage, there will be no dependency in
makefiles for exposing rm_desc functionality.
And for Approach-2, it is unnecessary for makefile to define value if there
is actually no other relevant use case for it.

Can you think of any other pros-cons of both approaches, or do you think we
already have sufficient reasoning to conclude on Approach-1.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#105)
Re: Enabling frontend-only xlog "desc" routines

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

The other interesting question remaining is what to do about the rm_desc
function in rmgr.c. We're split between these two ideas:

Why try to link rmgr.c into frontend versions at all? Just make
a new table file that only includes the desc function pointers.
Yeah, then there would be two table files to maintain, but it's
not clear to me that it's uglier than these proposals ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#109)
Re: Enabling frontend-only xlog "desc" routines

On 2012-11-29 15:03:48 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

The other interesting question remaining is what to do about the rm_desc
function in rmgr.c. We're split between these two ideas:

Why try to link rmgr.c into frontend versions at all? Just make
a new table file that only includes the desc function pointers.
Yeah, then there would be two table files to maintain, but it's
not clear to me that it's uglier than these proposals ...

Seems more likely to get out of sync. But then, rmgr.c isn't changing
that heavily...
I still prefer the -DFRONTEND solutions, but once more, its only a
slight preference.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#109)
Re: Enabling frontend-only xlog "desc" routines

On Thu, Nov 29, 2012 at 3:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

The other interesting question remaining is what to do about the rm_desc
function in rmgr.c. We're split between these two ideas:

Why try to link rmgr.c into frontend versions at all? Just make
a new table file that only includes the desc function pointers.
Yeah, then there would be two table files to maintain, but it's
not clear to me that it's uglier than these proposals ...

+1. It's not likely to get updated very often, we can add comments to
remind people to keep them all in sync, and if you manage to screw it
up without noticing then you are adding recovery code that you have
not tested in any way whatsoever.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers